Unlock Peak GPU Performance: Introducing mKernel for Multi-Node Communication
The mKernel project introduces a novel approach to GPU-driven communication, addressing bottlenecks that can impact training times and overall performance in high-performance computing and AI. Industry analysis suggests that optimizing these communication pathways is crucial for scaling AI models effectively.

Research from the mKernel team indicates that communication overhead can consume a significant portion of processing time, accounting for 43.6% of the forward pass and 32% of end-to-end training time. For complex Mixture-of-Experts (MoE) models, this inter-device communication can reach 47% of total execution time. To address this, researchers at UC Berkeley's UCCL project developed mKernel, a library using persistent CUDA kernels to fuse intra-node NVLink communication, inter-node RDMA, and compute into a unified kernel.
The Limitations of Host-Driven Communication
Traditional multi-GPU communication relies on a host-driven model, where the CPU manages the control path, using libraries like NCCL or NVSHMEM to execute collective operations across the GPUs. Compute and communication operate on separate CUDA streams, overlapping only at kernel boundaries.
The mKernel team identifies two key issues with this approach:
- CPU Bottleneck: CPUs aren't keeping pace with GPU compute power. Modern systems like the GB300 NVL72 rack, with 72 Blackwell Ultra GPUs and 36 Grace CPUs, offer processing capabilities of 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance and 130 TB/s all-to-all intra-rack NVLink bandwidth. Microsecond-scale host orchestration overhead, such as cudaLaunchKernel calls, CPU-side "all writes done" checks, and inter-stream events, results in significant pipeline bubbles.
- Coarse-Grained Overlap: Host-driven systems can only overlap compute and communication at kernel boundaries, making finer-grained overlap at the tile or chunk level impossible from the host side.
GPU-driven communication offers an alternative, where the GPU initiates transfers, with communication integrated directly into the compute kernel. While existing fused kernel libraries primarily operate within a single node or GPU, mKernel is designed for multi-node environments. From a professional standpoint, the shift to GPU-driven approaches is essential for maximizing the potential of modern hardware.
mKernel: A Deep Dive
mKernel is a library of persistent CUDA kernels, engineered to fuse intra-node NVLink communication, inter-node RDMA, and dense compute operations into a single unit.
- Unified Multi-GPU + Multi-Node Kernel: Both intra-node NVLink and inter-node RDMA operations are contained within the same persistent kernel.
- Fine-Grained Intra-Kernel Overlap: Compute and communication are overlapped at the tile/chunk level, encompassing both intra-node and inter-node GPU communication.
- Persistent Kernel with SM Specialization: Compute Thread Arrays (CTAs) dynamically self-assign roles: compute, intra-comm, inter-send, inter-reduce. The number of Streaming Multiprocessors (SMs) dedicated to each role can be tuned based on the specific workload.
- GPU-Driven Networking on libibverbs: mKernel leverages GPU-initiated RDMA writes without relying on NCCL or NVSHMEM. The communication backend is built from the ground up to maximize performance and support a wide range of networking devices. This approach, according to the developers, allows for greater control and optimization of data transfer.
The Power of Five: mKernel's Fused Kernels
mKernel features five key fused kernels, each designed to optimize specific computational patterns:
| Kernel | What it fuses | Description |
|---|---|---|
| AllGather + GEMM | AllGather → GEMM | Each rank holds a shard of A. As ranks gather shards via NVLink/RDMA, the local GEMM consumes tiles as they arrive. |
| GEMM + AllReduce | GEMM → AllReduce | Computes C = A @ B and reduces partial outputs across all ranks in a single launch. Output tiles are pushed into the reduction tree as soon as they are produced. |
| MoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed immediately upon arrival. |
| Ring Attention | Ring KV exchange → FlashAttention | Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. |
| GEMM + ReduceScatter | GEMM → ReduceScatter | Computes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. |
mKernel Evaluation: Testbeds and Benchmarks
The mKernel team evaluated their library on two 2-node × 8-H200 clusters, differing only in their inter-node fabric:
| Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC |
|---|---|---|---|---|
| AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node |
| ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node |
mKernel was benchmarked against libraries including NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scales is ongoing. In practice, these benchmarks offer a preliminary glimpse into mKernel's potential, but larger, more diverse datasets are needed for a comprehensive assessment.
Networking Backends and System Requirements
mKernel supports two distinct networking backends:
| Backend | Macro | Transport | Where it runs |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS | libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
| EFA | -DINTERNODE_BACKEND_EFA | libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |
Both backends share a host-side API and on-GPU kernel. The primary difference is in the proxy/session implementation (session.h for CX7, session_efa.h for EFA). System requirements include NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, and Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries, while the EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA_HOME=/opt/amazon/efa by default.
mKernel fuses intra-node NVLink communication, inter-node RDMA, and compute operations. While this holds for most applications, the specific benefits will vary depending on the workload and hardware configuration. This could significantly reduce development time and improve the efficiency of distributed training jobs.