[TE] Capture GPU device at registration time in MnnvlTransport / NVLinkTransport instead of inferring from request pointers
#2,620 opened on Jun 25, 2026
Repository metrics
- Stars
- (5,470 stars)
- PR merge metrics
- (Avg merge 4d 6h) (224 merged PRs in 30d)
Description
Background
PR #2569 fixed a bug where MnnvlTransport::allocateSubBatch (and the same bug in NVLinkTransport, addressed in follow-up) created CUDA streams without specifying the target GPU, so all streams landed on GPU 0 and the copy engine became a bottleneck in multi-GPU setups.
The fix defers stream creation until submitTransferTasks, where the device is inferred by iterating over request source pointers and calling cudaPointerGetAttributes:
for (auto &req : request_list) {
device_id = detectDeviceFromPointer(req.source);
if (device_id >= 0) break;
}
This unblocked the immediate performance issue, but the approach has architectural drawbacks worth tracking as a follow-up.
Concern
The pointer-inference approach relies on implicit assumptions that are not enforced by the interface:
- The first valid CUDA pointer in the batch represents the correct device for the entire batch.
request.sourceis always a non-null CUDA device pointer — thecudaGetDevice()fallback is the original bug in disguise (silently selects the calling thread's current device, which may not be where the data lives).cudaPointerGetAttributesis called on the hot path of every batch submission, adding a per-request CUDA API round-trip just to recover information we already had.
Proposed approach
BufferDesc.location (e.g., "cuda:3") is already the authoritative source of device information and is available during memory registration. SunriseLinkTransport follows this pattern — addMemoryBuffer parses the location and stores a per-buffer mapping:
// mooncake-transfer-engine/tent/src/transport/sunrise_link/sunrise_link_transport.cpp:901
registered_memory_gpu_id_[reinterpret_cast<void*>(desc.addr)] = location.index();
(declaration at tent/include/tent/transport/sunrise_link/sunrise_link_transport.h:123)
Applying the same pattern to MnnvlTransport and NVLinkTransport would:
- Eliminate pointer inference and the
cudaPointerGetAttributescall on every submission - Remove the implicit "first-pointer wins" assumption — the transport always knows which GPU each registered buffer lives on
- Make the device contract explicit at registration boundaries instead of recovered at use time
- Naturally handle batches that mix buffers from different GPUs (currently silently bound to whichever pointer is inspected first)
Scope
- Add a
registered_memory_gpu_id_(or equivalent) map toMnnvlTransport, populated fromBufferDesc.locationinaddMemoryBuffer - Same for
NVLinkTransport - Use the registered device ID directly in
submitTransferTasksfor stream creation; removedetectDeviceFromPointerwalks - Validate behavior when a batch references multiple devices (reject? group? — interface decision)
References
- PR #2569 — original bug fix using pointer inference
SunriseLinkTransport— existing precedent for registration-time device capture