Skip to content

Self-Harness: Harnesses That Improve Themselves

Paper: arXiv 2606.09498 · Authors: Hangfan Zhang, Shao Zhang et al. · Institution: Shanghai AI Lab

Problem & Motivation

Agentic systems rely on a "harness" — the combination of system prompt, tool definitions, runtime orchestration logic, and execution environment — to operate effectively. Currently, harness design is a manual, expert-driven process requiring extensive engineering effort. The key question: can the same fixed LLM that runs within a harness also be used to improve that harness, without any external supervision, additional training, or human intervention?

Method / Approach

Self-Harness defines a harness as the tuple {system prompt + tool descriptions + runtime + orchestration logic}. The improvement process runs a 3-stage loop using the same frozen LLM:

  1. Weakness Mining — Run the agent on a held-in task split. Collect failure traces and cluster them by verifier-grounded signatures (deterministic failure patterns extracted from task verifiers, not subjective judgment). This produces interpretable, verifier-backed clusters of failure modes.

  2. Harness Proposal — Conditioned on the mined weaknesses, the LLM generates K diverse, minimal candidate edits to the harness. Diversity is encouraged via temperature sampling and a diversity penalty; minimality is enforced by constraining edit scope.

  3. Proposal Validation — Each candidate harness is tested on both the held-in split (where weaknesses were mined) and a held-out split. A strict non-regression criterion is applied: both splits must improve or stay flat. Candidates failing either are discarded.

The loop repeats until convergence or a budget limit. The model is never fine-tuned — only the harness text changes.

Key Results

Setting Metric Result
MiniMax M2.5 Held-out held-out (Terminal-Bench 2.0) +21.4pp (+53% relative)
Qwen3.5-35B Held-in (self-split) +20.9pp (+138% relative)
Model families tested Across 3 families Consistent gains
Task coverage Terminal-Bench 2.0 tasks 89 tasks

All experiments conducted on Terminal-Bench 2.0, a benchmark of 89 diverse terminal-based agent tasks.

Contributions

  • First demonstration that a fixed LLM can iteratively improve its own harness without any external feedback or model updates.
  • The 3-stage loop (weakness mining → harness proposal → proposal validation) provides a principled, verifiable self-improvement protocol.
  • Verifier-grounded signature clustering replaces subjective failure analysis with deterministic, reproducible weakness identification.
  • Strict non-regression validation (both splits must improve) prevents overfitting to mining distribution.
  • Demonstrated across 3 different model families, showing generality.

Strengths

  • Elegantly minimal: No additional training, no RLHF, no human annotation — just an LLM reading its own failure traces and editing a text prompt. The simplicity is the point.
  • Verifier-grounded weakness mining: Using deterministic verifier signatures rather than LLM-generated failure analysis avoids hallucinated or self-serving explanations.
  • Strong held-out results: The +53% on held-out tasks demonstrates genuine generalization, not just overfitting to the mining split.
  • Model-agnostic: Works across MiniMax, Qwen, and presumably other architectures — suggests the principle is general.

Weaknesses / Limitations

  • Dependence on verifier quality: The approach assumes high-quality, deterministic verifiers. For open-ended tasks without clear verifiable criteria, the mining step breaks down.
  • Harness representation: The harness is treated as a flat text artifact. More structured harnesses (e.g., with tool graphs, state machines, or error handlers) may not be easily editable via simple prompt edits.
  • Scalability of validation: Testing K candidate edits × 2 splits could become expensive for large benchmarks. The paper doesn't discuss budget management strategies.
  • Cold start: Initial harness quality likely matters — a very poor initial harness may produce failures too diffuse to cluster meaningfully.

Connections & Follow-ups

Directly related to the emerging "self-improving agents" literature, including Reflexion (Shinn et al.), Auto-GPT, and Voyager-style skill libraries. Where Reflexion improves via in-context learning at runtime, Self-Harness improves the permanent harness structure. Also connects to prompt optimization (DSPy, OPRO) but applied to agent systems rather than pure language tasks. Future work could extend to multi-harness co-optimization or adding structural mutation operators.

My Take

This is one of the cleaner self-improvement results I've seen. The key insight — that you don't need to train the model, just the harness — is both obvious in retrospect and surprisingly underexplored. The verifier-grounded weakness mining is the architectural linchpin: it converts fuzzy failure analysis into a reproducible signal. The strict non-regression gate is also a good engineering practice that prevents degradation. I'm curious about the ceiling — does the improvement asymptote at some harness complexity, or can it compound indefinitely? The Terminal-Bench 2.0 results suggest substantial headroom remains.