Skip to content

Code-to-Paper Mapping Assessment: Local LLM Evaluation

Source: github.com/nathanlgabriel/paper_code_mapping_assessment
Author: Nathan L. Gabriel
Date: 2025-07-01


Core Thesis

Local LLMs have advanced dramatically enough to perform a task previously impossible for small local models: mapping computational simulation code to its corresponding academic research paper with high accuracy. However, the more surprising finding is that a decent local model combined with an intelligent human can outperform even overpowered frontier models.


The Task

Evaluate local LLMs on their ability to map computational simulation code to its corresponding research paper, tracking iterative refinement from initial outputs to corrected mappings.

"Ultimately, my assessment is that the hype is real. Qwen 3.6, Gemma 4, and Nemotron Nano were all able to do reasonably well at a task that was impossible for small local models a few months ago."


Critical Finding: Local Model + Human > Frontier Model

The author discovered a substantive oversight in Claude Sonnet's mapping of replicator dynamics related to continuous vs. discrete population representation and rounding bias prevention.

  • Claude Opus 4.7 failed to identify the issue even after a follow-up prompt specifically asking about omissions related to avoiding statistical biases (despite explicit Python code comments stating the modification purpose).
  • Qwen 3.6 35B A3B, using more detailed guided prompts, successfully identified the relevant code sections and produced the actually definitive code-to-paper mapping.

"Conclusion: A decent local model + an intelligent human can still be smarter than an overpowered frontier model."


Models Evaluated

Model Performance
Qwen 3.6 35B A3B Standout performer. Captured 75-80% of the definitive mapping. Fast inference.
Qwen 3.6 27B Reasonable baseline; solid structural understanding but required more extensive correction.
Gemma 4 26B Generated initial mappings requiring substantial correction; good grasp of simulation pipeline.
Nemotron Nano Delivered reasonable baseline outputs.
Gemma 4 31B Failed: Exceeded context/VRAM limits; resulted in inference crashes.

Methodology

  • Context Window: ~160k tokens used by each model; maximum configured up to 262,144 tokens.
  • Iterative Process: Models received structured prompts; outputs were corrected and refined across multiple attempts.
  • Source Materials: Research paper PDF, supplementary appendices (partial and full), and two Jupyter notebooks (reinforcement learning and replicator dynamics).

Key Insight

The hype around local LLMs is justified — they can now handle complex code-to-paper mapping tasks that were impossible months ago. But the more profound insight is about human-AI collaboration: careful prompting and human oversight with a capable local model can catch errors that even the most powerful frontier models miss.