Models & Research

Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

· May 11, 2026
Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

What changed

Sakana AI and NVIDIA researchers introduced TwELL, a method that applies simple L1 regularization to induce over 99% sparsity in feedforward layers of large language models. They translated this high sparsity into actual speed gains on GPUs by developing new sparse data formats and fused CUDA kernels. The result is a 20.5% speedup in inference and a 21.9% acceleration in training for large language models.

Why builders should care

Achieving real GPU throughput improvements from model sparsity has been difficult. High sparsity often reduces model quality or is limited to theoretical gains. TwELL shows that straightforward L1 regularization can push sparsity beyond 99% in key layers while keeping downstream model performance steady. This challenges the prevailing assumption that making models sparse is too risky or yields little runtime payoff. Builders aiming to scale or optimize cost will find TwELL’s approach a practical option to reduce compute time and energy without sacrificing model accuracy.

The practical takeaway

Hardware-aware sparsity like TwELL can lower the cost of running large language models by reducing both training time and inference latency. For teams constrained by GPU budgets, this means more efficient resource use. The combination of regularization-driven sparsity and custom CUDA kernels targeting that sparsity unlocks concrete acceleration, not just theoretical parameter pruning. Adopting similar sparsity techniques could accelerate experimentation cycles and shrink cloud bills across production LLM workloads.

What to watch next

Keep an eye on how TwELL integrates into mainstream AI frameworks and if wider industry adoption follows. Future work will likely expand sparse data formats and kernel fusions to more hardware platforms beyond NVIDIA GPUs. Also, watch whether this technique works beyond feedforward layers and applies to other model architectures. The real test will be sustaining speed gains without tweaking model quality and seeing how this competes with emerging dense model optimizations.

AI Quick Briefs Editorial Desk

Stay ahead of AI Get the most important AI news delivered to your inbox — free.