Self-Harness: Harnesses That Improve Themselves¶

Paper: arXiv 2606.09498 · Authors: Hangfan Zhang, Shao Zhang et al. · Institution: Shanghai AI Lab

Problem & Motivation¶

Agentic systems rely on a "harness" — the combination of system prompt, tool definitions, runtime orchestration logic, and execution environment — to operate effectively. Currently, harness design is a manual, expert-driven process requiring extensive engineering effort. The key question: can the same fixed LLM that runs within a harness also be used to improve that harness, without any external supervision, additional training, or human intervention?

Method / Approach¶

Self-Harness defines a harness as the tuple {system prompt + tool descriptions + runtime + orchestration logic}. The improvement process runs a 3-stage loop using the same frozen LLM:

Weakness Mining — Run the agent on a held-in task split. Collect failure traces and cluster them by verifier-grounded signatures (deterministic failure patterns extracted from task verifiers, not subjective judgment). This produces interpretable, verifier-backed clusters of failure modes.
Harness Proposal — Conditioned on the mined weaknesses, the LLM generates K diverse, minimal candidate edits to the harness. Diversity is encouraged via temperature sampling and a diversity penalty; minimality is enforced by constraining edit scope.
Proposal Validation — Each candidate harness is tested on both the held-in split (where weaknesses were mined) and a held-out split. A strict non-regression criterion is applied: both splits must improve or stay flat. Candidates failing either are discarded.

The loop repeats until convergence or a budget limit. The model is never fine-tuned — only the harness text changes.

Key Results¶

Setting	Metric	Result
MiniMax M2.5	Held-out held-out (Terminal-Bench 2.0)	+21.4pp (+53% relative)
Qwen3.5-35B	Held-in (self-split)	+20.9pp (+138% relative)
Model families tested	Across 3 families	Consistent gains
Task coverage	Terminal-Bench 2.0 tasks	89 tasks

All experiments conducted on Terminal-Bench 2.0, a benchmark of 89 diverse terminal-based agent tasks.

Contributions¶

First demonstration that a fixed LLM can iteratively improve its own harness without any external feedback or model updates.
The 3-stage loop (weakness mining → harness proposal → proposal validation) provides a principled, verifiable self-improvement protocol.
Verifier-grounded signature clustering replaces subjective failure analysis with deterministic, reproducible weakness identification.
Strict non-regression validation (both splits must improve) prevents overfitting to mining distribution.
Demonstrated across 3 different model families, showing generality.

Strengths¶

Elegantly minimal: No additional training, no RLHF, no human annotation — just an LLM reading its own failure traces and editing a text prompt. The simplicity is the point.
Verifier-grounded weakness mining: Using deterministic verifier signatures rather than LLM-generated failure analysis avoids hallucinated or self-serving explanations.
Strong held-out results: The +53% on held-out tasks demonstrates genuine generalization, not just overfitting to the mining split.
Model-agnostic: Works across MiniMax, Qwen, and presumably other architectures — suggests the principle is general.

Weaknesses / Limitations¶

Dependence on verifier quality: The approach assumes high-quality, deterministic verifiers. For open-ended tasks without clear verifiable criteria, the mining step breaks down.
Harness representation: The harness is treated as a flat text artifact. More structured harnesses (e.g., with tool graphs, state machines, or error handlers) may not be easily editable via simple prompt edits.
Scalability of validation: Testing K candidate edits × 2 splits could become expensive for large benchmarks. The paper doesn't discuss budget management strategies.
Cold start: Initial harness quality likely matters — a very poor initial harness may produce failures too diffuse to cluster meaningfully.

Connections & Follow-ups¶

Directly related to the emerging "self-improving agents" literature, including Reflexion (Shinn et al.), Auto-GPT, and Voyager-style skill libraries. Where Reflexion improves via in-context learning at runtime, Self-Harness improves the permanent harness structure. Also connects to prompt optimization (DSPy, OPRO) but applied to agent systems rather than pure language tasks. Future work could extend to multi-harness co-optimization or adding structural mutation operators.

My Take¶

This is one of the cleaner self-improvement results I've seen. The key insight — that you don't need to train the model, just the harness — is both obvious in retrospect and surprisingly underexplored. The verifier-grounded weakness mining is the architectural linchpin: it converts fuzzy failure analysis into a reproducible signal. The strict non-regression gate is also a good engineering practice that prevents degradation. I'm curious about the ceiling — does the improvement asymptote at some harness complexity, or can it compound indefinitely? The Terminal-Bench 2.0 results suggest substantial headroom remains.