Models & Research

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

AI Quick Briefs Editorial Desk · June 2, 2026

What changed

NVIDIA Apex, an open-source extension for PyTorch, offers fused implementations of Adam optimizer and LayerNorm that speed up Transformer training. By building Apex from source, operators can detect whether these fused kernels are available and benchmark their performance against native PyTorch modules. Additionally, the use of PyTorch’s native automatic mixed precision (torch.amp) integrates seamlessly with these fused kernels to maximize training efficiency.

Why builders should care

Transformer training is computationally intensive and slow, especially on large models. FusedAdam and FusedLayerNorm in Apex combine multiple operations into single GPU kernels, reducing overhead and improving throughput. This cuts training time and GPU costs. Meanwhile, torch.amp lowers memory usage and accelerates computations by handling mixed precision natively. Builders using PyTorch can unlock better resource utilization by combining these approaches, without rewriting model code.

The practical takeaway

Operators running Transformer workloads on NVIDIA GPUs should build Apex from source rather than relying on pre-built binaries to access the fastest fused kernels. It is critical to verify kernel availability in the setup to ensure the performance gains are realized. Once set up, training becomes significantly faster and more memory efficient with FusedAdam, FusedLayerNorm, and torch.amp working together. This approach can reduce training bottlenecks and lower cloud GPU expenses or on-prem hardware utilization.

What to watch next

Continued updates to PyTorch and Apex will refine support for fused kernels and mixed precision improvements. Builders should track new releases for better integration and expanded compatibility. Also, monitoring benchmarking results in real training settings will inform when to switch between native PyTorch and Apex fused routines. Industry uptake will pressure frameworks to optimize around these performance-enhancing techniques and compete on training speed and cost.

AI Quick Briefs Editorial Desk

Read Full Article →