Models & Research

Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That De…

· May 16, 2026
Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That De…

What changed

Nous Research introduced Lighthouse Attention, a new training-only hierarchical attention mechanism designed to speed up large language model pretraining on long contexts. Unlike previous methods that pool keys and values only, Lighthouse symmetrically pools queries, keys, and values into a multi-resolution pyramid. This cuts the complexity of attention computation from O(N·S·d) to O(S²·d) during training, where N is the input length, S is a much smaller subsequence size, and d is the embedding dimension. The approach works by wrapping standard scaled dot-product attention and using stock FlashAttention on a much smaller dense subsequence, then removing Lighthouse after training finishes.

Why builders should care

Pretraining large models on long sequences is typically expensive and slow, limiting experimentation and model scaling. Lighthouse lowers computational cost during training by sharply reducing the number of attention queries without changing the model architecture for inference. It offers a 1.4 to 1.7 times speedup on pretraining large LLMs with 530 million parameters tested in a Llama-3-style environment. This means faster iteration cycles and reduced cloud or on-prem hardware costs when training models that process long contexts. Builders looking to scale long-context transformers should take note of this approach to squeeze more performance from existing attention operations.

The practical takeaway

Lighthouse Attention applies only during training, so there is no inference overhead or model complexity increase at deployment time. It delivers scalable speed gains for long-range context models by selecting representative queries, keys, and values from across the input. This lowers memory and compute demands beyond previous hierarchical attention variants that pooled only keys and values. Builders using or developing transformer-based models for tasks that require long sequences, such as document summarization or code understanding, may find a direct path to faster pretraining without sacrificing attention quality.

What to watch next

Watch for further validation of Lighthouse at larger scales and in different model architectures beyond the initial 530M parameter tests. Also, watch if this training-only selection approach becomes a standard feature in pretraining pipelines or frameworks. Finally, observe whether the concept inspires hybrid hierarchical attention layers that maintain low-cost during training and minimal overhead during inference for even larger models and longer sequences.

AI Quick Briefs Editorial Desk

Stay ahead of AI Get the most important AI news delivered to your inbox — free.