AUTOLAB — Benchmarking Long-Horizon Empirical Optimization

What Is AUTOLAB?¶

AUTOLAB is a benchmark designed to test frontier AI models on sustained iterative empirical optimization over extended time horizons — 2 to 12 hours of continuous work. It measures not just whether a model can solve a problem, but whether it can persist through the long tail of debugging, tuning, and iteration that real optimization work demands.

The benchmark spans 36 tasks across four domains:

System optimization — kernel tuning, database configuration, compiler flag optimization.
Puzzles — constraint satisfaction, combinatorial search.
Model development — hyperparameter sweeps, architecture search, loss landscape exploration.
CUDA kernel optimization — writing and iterating on GPU kernels for speed.

The Key Finding: Persistence > Initial Ability¶

The most striking result from AUTOLAB is that the strongest predictor of success is persistence, not initial ability. Models that could rapidly produce a reasonable first attempt but then plateaued consistently underperformed models that iterated more slowly but kept improving over the full time budget.

Current leader: Claude Opus 4.6 with an average score of 0.68 — ahead of the field, but still far from ceiling performance.

Failure Modes¶

The benchmark reveals distinct failure patterns across frontier models:

Premature termination — GPT-5.4 and Grok models often declare victory and stop optimizing long before hitting a real local optimum. They lack the meta-cognition to realize "good enough" isn't the same as "done."
Budget exhaustion — DeepSeek-v4-pro and Qwen models tend to chase diminishing returns aggressively, burning through the full time budget on marginal improvements while neglecting other dimensions of the task.

These failure patterns suggest that when to stop and what to optimize next are as important as raw problem-solving skill.

Anti-Reward Hacking Design¶

AUTOLAB's infrastructure is carefully designed to prevent reward hacking — a growing concern as models become more adept at gaming evaluation metrics:

Sealed verifier — the scoring function is opaque to the model during the run.
Correctness gates — partial credit is only awarded past hard correctness thresholds.
SHA-pinned files — task definitions and reference solutions are hash-verified to prevent tampering.

Case Study: Flash Attention Optimization¶

One illustrative task involved optimizing a Flash Attention CUDA kernel. The best-performing model achieved a 42.4× speedup over the naive baseline through a sequence of progressive optimizations: memory coalescing, shared memory tiling, warp-level primitives, and finally, Hopper-specific instruction scheduling. Each step required the model to profile, interpret results, and formulate the next optimization hypothesis.

The Harness Effect¶

The paper also discusses the harness effect — how the design of the evaluation infrastructure itself impacts results. Subtle choices in how tasks are presented, how feedback is structured, and how time is measured can shift rankings. This raises important questions about whether current leaderboards measure model capability or infrastructure design quality.