Can AI Refute Economic Theory?¶

Source: Can AI Refute Economic Theory?

TL;DR¶

Alexis Akira Toda (Emory University) tests whether LLMs can identify errors in published economic theory papers. The study tested 4 papers with known errors using Gemini, Refine, Claude, and ChatGPT. The result: no model located a true error without substantial human guidance, and data contamination complicates interpretation. However, ChatGPT Pro 5.5 was "truly amazing" — it constructed valid counterexamples for Tirole (1985), provided a corrected proof for Kocherlakota (1992) more elegant than the published corrigendum, and correctly identified a tautological transversality condition issue in Miao & Wang (2018). Claude Opus 4.8 was weaker on math but stronger on economic judgment. Gemini performed worst, defending errors with plausible but unfounded arguments. The conclusion: "a competent human paired with a frontier model can outperform current peer review, but AI cannot yet refute economic theory on its own."

The Study Design¶

Professor Toda selected four published economic theory papers with known errors — errors that had been formally identified and, in some cases, accompanied by published corrections. The papers were:

Tirole (1985) — Asset bubbles and overlapping generations
Kocherlakota (1992) — Bubbles and constraints
Miao & Wang (2018) — Economic uncertainty and firm behavior
A fourth paper (the details of which are discussed in the paper itself)

These papers were chosen because their errors had been definitively established, allowing a clean test of whether LLMs could independently discover and articulate the problems.

Models Tested¶

The study evaluated four frontier models:

Gemini (Google)
Refine (a reasoning-focused model)
Claude Opus 4.8 (Anthropic)
ChatGPT Pro 5.5 (OpenAI)

Each model was given the papers and asked to identify errors, without human guidance on where to look.

The Negative Result: No Autonomous Discovery¶

The headline finding is sobering: no model located a true error without substantial human guidance. When given papers cold (without hints about where the problem lay), all models either missed the errors entirely or produced false positives — identifying correct mathematical derivations or modeling choices as errors.

Data contamination further complicates interpretation. Since economic theory papers are part of the training data, it's impossible to distinguish genuine reasoning from memorization of known corrections. A model that correctly identifies an error in Tirole (1985) might be repeating a critique it encountered in training rather than reasoning through the mathematics.

The Positive Result: ChatGPT Pro 5.5's Performance¶

When given targeted guidance or specific questions, ChatGPT Pro 5.5 demonstrated remarkable capabilities:

Tirole (1985): Constructed valid, original counterexamples that exposed the logical flaw — going beyond what appeared in the published correction
Kocherlakota (1992): Produced a corrected proof that was more elegant than the published corrigendum, simplifying the mathematics while maintaining rigor
Miao & Wang (2018): Correctly identified that a transversality condition used in the proof was tautological — it assumed what needed to be proved, meaning the proof was circular

Professor Toda described ChatGPT Pro 5.5's performance as "truly amazing" in these guided contexts.

Model Comparisons¶

Claude Opus 4.8 showed a different profile: it was weaker on formal mathematical reasoning (less reliable at constructing counterexamples or following complex algebraic derivations) but stronger on economic judgment — it could identify when a model's assumptions didn't match economic reality, even when it couldn't pinpoint the mathematical error.

Gemini performed worst across all dimensions. The model frequently defended the erroneous papers with plausible-sounding but mathematically unfounded arguments, showing confidence in its incorrect analyses.

Refine performed somewhere between Gemini and the frontier models, occasionally identifying surface-level issues but missing deeper structural problems.

Implications for Peer Review¶

The study's central conclusion has two components:

"A competent human paired with a frontier model can outperform current peer review."

This is the optimistic reading. A human reviewer who already suspects a problem can use an LLM as a powerful co-pilot — having the model construct counterexamples, check proofs, or suggest alternative approaches. The human provides the judgment about what to investigate; the model provides the computational and mathematical horsepower.

"AI cannot yet refute economic theory on its own."

This is the pessimistic boundary. The models lack the autonomous discovery capability needed to replace human reviewers. They don't have a reliable internal "error detector" — they're as likely to confidently affirm a flawed proof as to correctly identify one.

Key Takeaways¶

No LLM could autonomously locate true errors in published economic theory papers without substantial human guidance
Data contamination makes it unclear whether correct identifications reflect genuine reasoning or memorization
ChatGPT Pro 5.5 was "truly amazing" when guided — constructing valid counterexamples, providing elegant corrected proofs, and identifying tautological reasoning
Claude Opus 4.8 showed stronger economic judgment but weaker formal mathematical reasoning than ChatGPT Pro 5.5
Gemini performed worst, confidently defending erroneous arguments with unfounded reasoning
A competent human paired with a frontier model may outperform current peer review, but AI cannot yet refute economic theory independently