Models & Research

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Quick Briefs Editorial Desk · June 26, 2026

What changed

Epoch AI introduced the MirrorCode benchmark to test AI models’ ability to reconstruct entire programs without seeing the original source code. The benchmark includes tasks like rebuilding a 16,000-line codebase. Models were evaluated on how well they can reverse engineer and generate working code from scratch. Claude Opus 4.7 led this benchmark, solving 56% of the tasks and impressively reproducing the large toolkit in 14 hours. However, some tasks took as long as 19 days and cost over $2,600 in computation, showing the extreme resources needed for the hardest challenges.

Why builders should care

MirrorCode shifts focus from simple code completion to full program synthesis under tough constraints. For developers and AI builders, this stresses that current AI models still struggle with complex, large-scale code recreation without direct access to original scripts. The gaps in solving the hardest tasks highlight limits in model scalability and accuracy, signaling cautious expectations for applying AI to legacy system rewrites or full software rebuilds. The lengthy runtime and steep cost underline the resource intensity such operations currently demand.

The practical takeaway

AI models can assist in recreating and maintaining software at scale but only up to a point. For now, models like Claude Opus 4.7 are useful for mid-scale automation or speeding up code recovery tasks. However, businesses attempting full legacy code reconstruction will face long turnaround times and high compute bills, limiting feasibility. Operators should plan hybrid strategies combining AI suggestions with human validation and selective task targeting to balance cost and output quality.

What to watch next

Look for advances that reduce compute time and cost while improving accuracy on large-scale code tasks, which will unlock more practical applications for software modernization. Also watch for new benchmarks pushing beyond MirrorCode’s current challenges to test model robustness in more diverse and complex programming environments. Finally, pricing trends around extended AI runs will reveal how economically viable deep code synthesis becomes for enterprises and software teams.

AI Quick Briefs Editorial Desk

Read Full Article →