Models & Research

UK’s AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actual…

· July 3, 2026
UK’s AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actual…

What changed

The UK’s AI Security Institute found that common AI benchmarks systematically underestimate what AI agents can do because they limit the compute budget during testing. When the token budget—the amount of input or output the model can process—was increased tenfold on software engineering benchmarks, success rates jumped by about 25 percent. Newer AI models showed the biggest improvements. The actual progress at the frontier of AI capability is around 60 percent higher than traditional benchmarks have indicated.

Why builders should care

Benchmarks set token limits that throttle AI agent performance, hiding how far models can go if given more compute. Builders working with AI for complex tasks like coding or multi-step reasoning may have been misled by scores suggesting AI is less capable than it really is. Models perform notably better when allowed more tokens, meaning strict token caps on evaluation underrepresent real-world performance. This affects tooling, automated coding services, and agent deployment strategies that depend on accurate capability estimates.

The practical takeaway

Operators and developers should reconsider relying on standard benchmark results that cap computational resources. Proper testing and deployment need to account for how token limits bottleneck AI performance, especially for newer and larger models. Increasing token budgets can unlock substantial capability gains, which means products and workflows can be improved by allowing longer inputs and outputs. Investors and decision makers should factor in that AI agents might be further along technically than benchmarks suggest, adjusting expectations for what AI-driven automation and coding tools can deliver.

What to watch next

Expect evolving benchmarks that relax token constraints or introduce dynamic compute allowances to better reflect AI agent capabilities. New evaluations will influence how vendors market models and what users expect from AI-powered automation. Watch for updated performance claims, adjusted pricing models tied to token usage, and tools that optimize for longer context windows. Regulators and security teams should be aware that underestimated AI power could impact safety and control frameworks as agents become more capable than previously measured.

AI Quick Briefs Editorial Desk

Stay ahead of AI Get the most important AI news delivered to your inbox — free.