Models & Research

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

· May 29, 2026
Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

What changed

UC Berkeley’s UCCL team has released mKernel, a CUDA kernel library that combines multi-GPU and multi-node communication into a single persistent kernel. It tightly integrates intra-node NVLink, inter-node RDMA, and dense compute tasks, eliminating the overhead of separate memory copies and kernel launches across GPUs and nodes. This approach keeps GPU-driven communication and computation fused inside the GPU hardware pipeline, bypassing multiple CPU interrupts or memory stalls common in traditional distributed workloads.

Why builders should care

Managing communication between GPUs and across nodes in large HPC or machine learning clusters typically introduces latency and resource inefficiencies due to repeated driver overhead and synchronization delays. mKernel directly reduces these inefficiencies by fusing operations into one CUDA kernel that stays resident and runs persistently. This design promises lower latency and better overall throughput for multi-GPU and multi-node workloads where communication costs often bottleneck distributed system scaling and performance.

The practical takeaway

For teams running large-scale GPU clusters, especially those dealing with collective communication heavy workloads like distributed training or simulations, mKernel offers a way to squeeze more sustained performance from existing hardware. By embedding NVLink and RDMA communication inside a single kernel, the library reduces system-level interruptions and data movement delays, potentially lowering total compute time and improving cluster efficiency. This can lead to cost savings, better resource utilization, and faster iteration cycles for compute-heavy applications.

What to watch next

Early adopter feedback will be crucial to see if mKernel can integrate smoothly with existing distributed frameworks like MPI or NCCL. The degree to which it can reduce overhead in real-world workloads outside academic demos will determine its uptake. Also, watching if hardware vendors or commercial middleware providers adopt similar fused kernel communication strategies will reveal if this approach reaches broader production contexts. Performance benchmarks and open-source availability will further influence its practical impact.

AI Quick Briefs Editorial Desk

Stay ahead of AI Get the most important AI news delivered to your inbox — free.