Models & Research

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM …

AI Quick Briefs Editorial Desk · May 25, 2026

What changed

Together AI has open-sourced OSCAR, a 2-bit KV cache quantization system designed for long-context large language model (LLM) serving. Unlike earlier methods using fixed Hadamard transforms, OSCAR applies attention-aware rotations derived from the covariance of keys and values estimated offline. This approach reduces the per-key-value element bit size to 2.28 while maintaining accuracy significantly closer to full precision models, cutting the BF16 accuracy gap by over 3 points on notable Qwen3 LLM variants.

Why builders should care

Running long-context LLMs demands large memory and compute resources, especially due to KV cache storage. OSCAR directly targets this bottleneck by shrinking KV caches with minimal accuracy loss. The attention-aware rotation means quantization does not blindly compress data but respects the model’s internal structure, yielding better performance-to-memory tradeoffs. Builders can extend context windows or lower inference costs by integrating OSCAR without a painful accuracy penalty.

The practical takeaway

OSCAR is ready for deployment now as an open-source tool, giving model deployers a practical lever to efficiently scale long-context LLM workloads. Using just over 2 bits per KV element slashes memory requirements, cuts bandwidth, and reduces storage and operational expenses. The attention-aware design reduces the usual drawbacks of naive quantization, making OSCAR a rare system that balances aggressive compression with task performance. This shapes incentives toward wider use of long-context models in production.

What to watch next

Next steps include integrating OSCAR with diverse LLM architectures beyond Qwen3 and tracking its impact on latency and throughput in real-world inference workflows. Watch for adoption by inference-serving platforms aiming to shave costs and boost capacity for long-context applications like document understanding and conversational AI. Further innovations may build on OSCAR’s covariance-aware approach to refine quantization schemes and compress other LLM components.

AI Quick Briefs Editorial Desk

Read Full Article →