[TE] Capture GPU device at registration time in MnnvlTransport / NVLinkTransport instead of inferring from request pointers · kvcache-ai/Mooncake#2620

(3 comments) (0 reactions) (1 assignee)C++ (803 forks)auto 404

good first issue

Repository metrics

Stars: (5,470 stars)
PR merge metrics: (Avg merge 4d 6h) (224 merged PRs in 30d)

Description

Background

PR #2569 fixed a bug where MnnvlTransport::allocateSubBatch (and the same bug in NVLinkTransport, addressed in follow-up) created CUDA streams without specifying the target GPU, so all streams landed on GPU 0 and the copy engine became a bottleneck in multi-GPU setups.

The fix defers stream creation until submitTransferTasks, where the device is inferred by iterating over request source pointers and calling cudaPointerGetAttributes:

for (auto &req : request_list) {
    device_id = detectDeviceFromPointer(req.source);
    if (device_id >= 0) break;
}

This unblocked the immediate performance issue, but the approach has architectural drawbacks worth tracking as a follow-up.

Concern

The pointer-inference approach relies on implicit assumptions that are not enforced by the interface:

The first valid CUDA pointer in the batch represents the correct device for the entire batch.
request.source is always a non-null CUDA device pointer — the cudaGetDevice() fallback is the original bug in disguise (silently selects the calling thread's current device, which may not be where the data lives).
cudaPointerGetAttributes is called on the hot path of every batch submission, adding a per-request CUDA API round-trip just to recover information we already had.

Proposed approach

BufferDesc.location (e.g., "cuda:3") is already the authoritative source of device information and is available during memory registration. SunriseLinkTransport follows this pattern — addMemoryBuffer parses the location and stores a per-buffer mapping:

// mooncake-transfer-engine/tent/src/transport/sunrise_link/sunrise_link_transport.cpp:901
registered_memory_gpu_id_[reinterpret_cast<void*>(desc.addr)] = location.index();

(declaration at tent/include/tent/transport/sunrise_link/sunrise_link_transport.h:123)

Applying the same pattern to MnnvlTransport and NVLinkTransport would:

Eliminate pointer inference and the cudaPointerGetAttributes call on every submission
Remove the implicit "first-pointer wins" assumption — the transport always knows which GPU each registered buffer lives on
Make the device contract explicit at registration boundaries instead of recovered at use time
Naturally handle batches that mix buffers from different GPUs (currently silently bound to whichever pointer is inspected first)

Scope

Add a registered_memory_gpu_id_ (or equivalent) map to MnnvlTransport, populated from BufferDesc.location in addMemoryBuffer
Same for NVLinkTransport
Use the registered device ID directly in submitTransferTasks for stream creation; remove detectDeviceFromPointer walks
Validate behavior when a batch references multiple devices (reject? group? — interface decision)

References

PR #2569 — original bug fix using pointer inference
SunriseLinkTransport — existing precedent for registration-time device capture

Contributor guide

Research direction: Implement a `registered memory gpu id ` map in `MnnvlTransport` and `NVLinkTransport`, populated from `BufferDesc.location` in `addMemoryBuffer`, and use it in `submitTransferTasks` to replace `detectDeviceFromPointer`.
Tech stack: cpp
Domain: backendinfrastructure
Issue type: Refactor
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: C++CUDA
Newbie friendliness: 40