[TE] Refactor: unify libfabric-based transports (EFA / CXI) shared infrastructure
#2,657 opened on Jun 29, 2026
Repository metrics
- Stars
- (5,470 stars)
- PR merge metrics
- (Avg merge 4d 6h) (224 merged PRs in 30d)
Description
Background
With #2535 (CXI backend) merged, we now have two transports built on libfabric that share significant structural overlap but are implemented as independent forks. This was the pragmatic choice for initial integration (see discussion in #2535), but leaves maintenance debt.
Scope
-
Topology — CXI currently reuses InfiniBand device discovery paths in
topology.cpp. CXI is not IB; it should have its own topology functions or a sharedlibfabric_topologyabstraction (ref: @alogfans' comment). -
Shared base class or utilities — Extract common libfabric patterns (endpoint management, CQ polling, MR registration, error handling) into a shared layer that both EFA and CXI can consume, reducing duplicated code.
-
mr_key_tconsolidation — The current#if defined(USE_EFA) || defined(USE_CXI)guards work but will grow unwieldy if more libfabric providers are added. Consider a compile-time or runtime abstraction.
Non-goals
- Changing the public API or metadata wire format
- Merging EFA and CXI into a single transport (they have meaningful behavioral differences in MR handling and transfer semantics)
References
- #2535 — CXI backend PR
- #2564 — 64-bit MR key fix