Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM …
What changed
Together AI has open-sourced OSCAR, a 2-bit KV cache quantization system designed for long-context large language model (LLM) serving. Unlike earlier methods using fixed Hadamard transforms, OSCAR applies attention-aware rotations derived from the covariance of keys and values estimated offline. This approach reduces the per-key-value element bit size to 2.28 while maintaining accuracy significantly closer to full precision models, cutting the BF16 accuracy gap by over 3 points on notable Qwen3 LLM variants.
Why builders should care
Running long-context LLMs demands large memory and compute resources, especially due to KV cache storage. OSCAR directly targets this bottleneck by shrinking KV caches with minimal accuracy loss. The attention-aware rotation means quantization does not blindly compress data but respects the model’s internal structure, yielding better performance-to-memory tradeoffs. Builders can extend context windows or lower inference costs by integrating OSCAR without a painful accuracy penalty.
The practical takeaway
OSCAR is ready for deployment now as an open-source tool, giving model deployers a practical lever to efficiently scale long-context LLM workloads. Using just over 2 bits per KV element slashes memory requirements, cuts bandwidth, and reduces storage and operational expenses. The attention-aware design reduces the usual drawbacks of naive quantization, making OSCAR a rare system that balances aggressive compression with task performance. This shapes incentives toward wider use of long-context models in production.
What to watch next
Next steps include integrating OSCAR with diverse LLM architectures beyond Qwen3 and tracking its impact on latency and throughput in real-world inference workflows. Watch for adoption by inference-serving platforms aiming to shave costs and boost capacity for long-context applications like document understanding and conversational AI. Further innovations may build on OSCAR’s covariance-aware approach to refine quantization schemes and compress other LLM components.
AI Quick Briefs Editorial Desk