Agents' Last Exam (ALE): Benchmarking Economically Valuable AI Work¶

Paper: arXiv 2606.05405 · Authors: Yiyou Sun, Xinyang Han, Dawn Song et al. · Institution: UC Berkeley

Problem & Motivation¶

Existing AI benchmarks increasingly saturate and fail to measure capabilities that translate to real economic value. Benchmarks like MMLU, GSM8K, and HumanEval test narrow skills in isolation. Even agent benchmarks like SWE-bench and WebArena focus on specific domains. There is no comprehensive benchmark designed to measure the economically valuable work AI agents can perform — tasks that require real professional judgement, end-to-end workflows, and domain expertise across multiple industries.

Method / Approach¶

ALE is constructed by 250+ domain experts across 55 subfields in 13 industry clusters, producing 1,490 task instances. Three design requirements:

Representativeness — Tasks must reflect real professional software and workflows actually used in industry, not toy problems.
Complexity — Tasks must require end-to-end workflows with multiple steps, tool use, and domain knowledge.
Verifiability — Each task has a deterministic evaluation procedure ensuring objective scoring (93.2% of tasks).

Three difficulty tiers: - Near-Term (top 38.1% of tasks solvable) — approximates tasks current frontier agents can handle - Full-Spectrum (22.7% solvable) — representative of professional-level work - Last-Exam (<1% solvable) — the hardest tasks, beyond any current system

The evaluation pipeline: Task Specification → Agent → Remote VM Environment → Deterministic Scoring.

Failure taxonomy derived from agent runs: - Understanding 31%: Domain knowledge errors (25%), Hallucination (6%) - Approach 47%: Wrong strategy (30%), Incomplete solution (17%) - Execution 22%: Implementation bugs, tool misuse, environment interaction errors

Key Results¶

Setting	Metric	Result
Task instances	Total across 55 subfields	1,490
Score	Near-Term tier solvable	38.1%
Score	Full-Spectrum tier solvable	22.7%
Score	Last-Exam tier solvable	<1%
Deterministic scoring	Fraction of tasks	93.2%
Model choice spread	vs harness choice spread	3×
Most common failure	Wrong strategy (Approach)	30%

Model choice accounts for 3× the performance spread of harness choice, suggesting the underlying model capability is the dominant factor.

Contributions¶

Large-scale, multi-domain benchmark (1,490 tasks, 55 subfields, 13 clusters) designed specifically for economically valuable AI work.
Three-tier difficulty structure that provides clear capability baselines and headroom even for frontier models.
Detailed failure taxonomy derived from real agent runs, providing actionable diagnostics for system improvement.
Demonstrates that model choice dominates harness choice by a 3× factor — an important finding for practitioners.

Strengths¶

Real economic grounding: Tasks are designed by domain experts to reflect actual professional workflows, not academic toy problems.
Deterministic evaluation: 93.2% objective scoring eliminates the subjectivity and grader-noise problems plaguing LLM-as-judge evaluations.
Actionable failure taxonomy: Understanding how agents fail (47% approach errors vs 22% execution errors) directly informs where to focus engineering effort.
The 3× model-vs-harness finding: A provocative result that challenges the "harness engineering > model capability" narrative common in the agent community.

Weaknesses / Limitations¶

Single-vendor construction: 250+ domain experts from a single institution (UC Berkeley) may introduce systematic bias in task selection and difficulty calibration.
VM-based evaluation cost: Running agents in remote VMs for each task instance is expensive and may limit benchmark accessibility for smaller labs.
Economic value proxy: Tasks are designed to reflect "economically valuable work" but the mapping from task performance to actual economic productivity is not validated.
Rapid saturation risk: The <1% Last-Exam tier may be solved faster than expected given the pace of AI progress.

Connections & Follow-ups¶

Situated in the lineage of agent benchmarks: SWE-bench (software engineering), WebArena (web tasks), AgentBench (general agent), and GAIA (assistant agents). ALE distinguishes itself through breadth (13 industry clusters) and economic relevance criteria. The three-tier design mirrors the "saturation → harder → saturation" pattern seen in MMLU → MMLU-Pro → MMLU-Redux. The failure taxonomy provides a useful framework for interpreting results on other agent benchmarks as well.

My Take¶

ALE is a welcome contribution to the benchmarking ecosystem. The economic-relevance criterion is an important corrective to benchmarks that measure capabilities without asking "capabilities for what?" The failure taxonomy is particularly valuable — the finding that 47% of errors are approach/strategy failures rather than execution failures suggests that better prompting and planning (not just better coding) is the bottleneck. The 3× model > harness finding will be cited heavily in procurement decisions. My main concern is saturation timeline: if ALE Near-Term is at 38% today, it could be solved in 12-18 months. The Last-Exam tier's <1% baseline provides headroom, but benchmarks are more useful when the middle tier provides resolution.