Unlock Peak GPU Performance: Introducing mKernel for Multi-Node Communication
AI

Unlock Peak GPU Performance: Introducing mKernel for Multi-Node Communication

Jake Morrison
Jake Morrison

6 hours ago

5 min read
88%

Unlock Peak GPU Performance: Introducing mKernel for Multi-Node Communication

The mKernel project introduces a novel approach to GPU-driven communication, addressing bottlenecks that can impact training times and overall performance in high-performance computing and AI. Industry analysis suggests that optimizing these communication pathways is crucial for scaling AI models effectively.

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

Research from the mKernel team indicates that communication overhead can consume a significant portion of processing time, accounting for 43.6% of the forward pass and 32% of end-to-end training time. For complex Mixture-of-Experts (MoE) models, this inter-device communication can reach 47% of total execution time. To address this, researchers at UC Berkeley's UCCL project developed mKernel, a library using persistent CUDA kernels to fuse intra-node NVLink communication, inter-node RDMA, and compute into a unified kernel.

The Limitations of Host-Driven Communication

Traditional multi-GPU communication relies on a host-driven model, where the CPU manages the control path, using libraries like NCCL or NVSHMEM to execute collective operations across the GPUs. Compute and communication operate on separate CUDA streams, overlapping only at kernel boundaries.

The mKernel team identifies two key issues with this approach:

  • CPU Bottleneck: CPUs aren't keeping pace with GPU compute power. Modern systems like the GB300 NVL72 rack, with 72 Blackwell Ultra GPUs and 36 Grace CPUs, offer processing capabilities of 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance and 130 TB/s all-to-all intra-rack NVLink bandwidth. Microsecond-scale host orchestration overhead, such as cudaLaunchKernel calls, CPU-side "all writes done" checks, and inter-stream events, results in significant pipeline bubbles.
  • Coarse-Grained Overlap: Host-driven systems can only overlap compute and communication at kernel boundaries, making finer-grained overlap at the tile or chunk level impossible from the host side.

GPU-driven communication offers an alternative, where the GPU initiates transfers, with communication integrated directly into the compute kernel. While existing fused kernel libraries primarily operate within a single node or GPU, mKernel is designed for multi-node environments. From a professional standpoint, the shift to GPU-driven approaches is essential for maximizing the potential of modern hardware.

mKernel: A Deep Dive

mKernel is a library of persistent CUDA kernels, engineered to fuse intra-node NVLink communication, inter-node RDMA, and dense compute operations into a single unit.

  • Unified Multi-GPU + Multi-Node Kernel: Both intra-node NVLink and inter-node RDMA operations are contained within the same persistent kernel.
  • Fine-Grained Intra-Kernel Overlap: Compute and communication are overlapped at the tile/chunk level, encompassing both intra-node and inter-node GPU communication.
  • Persistent Kernel with SM Specialization: Compute Thread Arrays (CTAs) dynamically self-assign roles: compute, intra-comm, inter-send, inter-reduce. The number of Streaming Multiprocessors (SMs) dedicated to each role can be tuned based on the specific workload.
  • GPU-Driven Networking on libibverbs: mKernel leverages GPU-initiated RDMA writes without relying on NCCL or NVSHMEM. The communication backend is built from the ground up to maximize performance and support a wide range of networking devices. This approach, according to the developers, allows for greater control and optimization of data transfer.

The Power of Five: mKernel's Fused Kernels

mKernel features five key fused kernels, each designed to optimize specific computational patterns:

Kernel What it fuses Description
AllGather + GEMM AllGather → GEMM Each rank holds a shard of A. As ranks gather shards via NVLink/RDMA, the local GEMM consumes tiles as they arrive.
GEMM + AllReduce GEMM → AllReduce Computes C = A @ B and reduces partial outputs across all ranks in a single launch. Output tiles are pushed into the reduction tree as soon as they are produced.
MoE Dispatch + GEMM All-to-All dispatch → grouped GEMM Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed immediately upon arrival.
Ring Attention Ring KV exchange → FlashAttention Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk.
GEMM + ReduceScatter GEMM → ReduceScatter Computes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced.

mKernel Evaluation: Testbeds and Benchmarks

The mKernel team evaluated their library on two 2-node × 8-H200 clusters, differing only in their inter-node fabric:

Testbed Nodes × GPUs Intra-node Inter-node transport NIC
AWS EFA 2 × 8 H200 NVLink AWS EFA / SRD 16 × 200 Gb/s EFA per node
ConnectX-7 2 × 8 H200 NVLink InfiniBand 8 × 400 Gb/s NVIDIA ConnectX-7 per node

mKernel was benchmarked against libraries including NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scales is ongoing. In practice, these benchmarks offer a preliminary glimpse into mKernel's potential, but larger, more diverse datasets are needed for a comprehensive assessment.

Networking Backends and System Requirements

mKernel supports two distinct networking backends:

Backend Macro Transport Where it runs
CX7 -DINTERNODE_BACKEND_IBVERBS libibverbs RC ConnectX-7 / InfiniBand / RoCE
EFA -DINTERNODE_BACKEND_EFA libibverbs + efadv (SRD) AWS p5/p5e (H200, EFA)

Both backends share a host-side API and on-GPU kernel. The primary difference is in the proxy/session implementation (session.h for CX7, session_efa.h for EFA). System requirements include NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, and Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries, while the EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA_HOME=/opt/amazon/efa by default.

mKernel fuses intra-node NVLink communication, inter-node RDMA, and compute operations. While this holds for most applications, the specific benefits will vary depending on the workload and hardware configuration. This could significantly reduce development time and improve the efficiency of distributed training jobs.

Jake Morrison

Jake Morrison

Gaming Industry Columnist

Lifelong gamer turned industry commentator. Covers esports, game design, and the business of play. Known for passionate but fair criticism.

gaming

Topics

#unlock #peak #performance #introducing #mkernel

Source

marktechpost

Read Original

Questions

Unlock Peak GPU Performance: Introducing mKernel for Multi-Node Communication The mKernel project introduces a novel approach to GPU-driven communication, addressing bottlenecks that can impact train...

Comments

Leave a Comment

Your email will not be published. Comments are moderated.

No comments yet. Be the first to share your thoughts!