kvcache-ai/Mooncake

[TE] Capture GPU device at registration time in MnnvlTransport / NVLinkTransport instead of inferring from request pointers

Open

#2,620 opened on Jun 25, 2026

View on GitHub
 (3 comments) (0 reactions) (1 assignee)C++ (803 forks)auto 404
good first issue

Repository metrics

Stars
 (5,470 stars)
PR merge metrics
 (Avg merge 4d 6h) (224 merged PRs in 30d)

Description

Background

PR #2569 fixed a bug where MnnvlTransport::allocateSubBatch (and the same bug in NVLinkTransport, addressed in follow-up) created CUDA streams without specifying the target GPU, so all streams landed on GPU 0 and the copy engine became a bottleneck in multi-GPU setups.

The fix defers stream creation until submitTransferTasks, where the device is inferred by iterating over request source pointers and calling cudaPointerGetAttributes:

for (auto &req : request_list) {
    device_id = detectDeviceFromPointer(req.source);
    if (device_id >= 0) break;
}

This unblocked the immediate performance issue, but the approach has architectural drawbacks worth tracking as a follow-up.

Concern

The pointer-inference approach relies on implicit assumptions that are not enforced by the interface:

  1. The first valid CUDA pointer in the batch represents the correct device for the entire batch.
  2. request.source is always a non-null CUDA device pointer — the cudaGetDevice() fallback is the original bug in disguise (silently selects the calling thread's current device, which may not be where the data lives).
  3. cudaPointerGetAttributes is called on the hot path of every batch submission, adding a per-request CUDA API round-trip just to recover information we already had.

Proposed approach

BufferDesc.location (e.g., "cuda:3") is already the authoritative source of device information and is available during memory registration. SunriseLinkTransport follows this pattern — addMemoryBuffer parses the location and stores a per-buffer mapping:

// mooncake-transfer-engine/tent/src/transport/sunrise_link/sunrise_link_transport.cpp:901
registered_memory_gpu_id_[reinterpret_cast<void*>(desc.addr)] = location.index();

(declaration at tent/include/tent/transport/sunrise_link/sunrise_link_transport.h:123)

Applying the same pattern to MnnvlTransport and NVLinkTransport would:

  • Eliminate pointer inference and the cudaPointerGetAttributes call on every submission
  • Remove the implicit "first-pointer wins" assumption — the transport always knows which GPU each registered buffer lives on
  • Make the device contract explicit at registration boundaries instead of recovered at use time
  • Naturally handle batches that mix buffers from different GPUs (currently silently bound to whichever pointer is inspected first)

Scope

  • Add a registered_memory_gpu_id_ (or equivalent) map to MnnvlTransport, populated from BufferDesc.location in addMemoryBuffer
  • Same for NVLinkTransport
  • Use the registered device ID directly in submitTransferTasks for stream creation; remove detectDeviceFromPointer walks
  • Validate behavior when a batch references multiple devices (reject? group? — interface decision)

References

  • PR #2569 — original bug fix using pointer inference
  • SunriseLinkTransport — existing precedent for registration-time device capture

Contributor guide