SkillOpt: Executive Strategy for Self-Evolving Agent Skills¶

Source: SkillOpt: Self-Evolving Agent Skills via Controllable Text-Space Optimization
Date Published: 2026-05
Authors: Yifan Yang et al. (Microsoft, SJTU, Tongji, Fudan)
Code: aka.ms/SkillOpt

TL;DR¶

SkillOpt treats an agent's skill document as trainable external state — analogous to weight-space optimization — using a separate optimizer model that applies bounded add/delete/replace edits accepted only when they strictly improve a held-out validation score. The result: best or tied-best across all 52 evaluated benchmark cells, with average lifts of +19–25 pts across GPT-5.5, Codex, and Claude Code harnesses.

The Core Problem¶

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision. None behave like a deep-learning optimizer — none reliably improve over their starting point under feedback. The authors argue skills should be treated as a trainable artifact, with the same discipline that makes weight-space optimization reproducible.

How SkillOpt Works¶

SkillOpt introduces a separate optimizer model that turns scored rollouts into controlled edits on a single skill document:

Component	Analogy	Function
Bounded Edits (Lt)	Learning Rate	Limits changes per step — prevents erasure of useful rules
Validation Gate	Validation Set	Accepts candidate only if held-out score strictly improves
Rejected-Edit Buffer	Negative Feedback	Stores failed edits to guide future optimizer calls
Slow/Meta Update	Momentum	End-of-epoch comparison preserves durable domain lessons
Minibatch Reflection	Gradient Step	Separates failures/successes to find reusable patterns

The output is a single best_skill.md file (300–2,000 tokens). The target model and execution harness remain frozen — zero inference-time overhead at deployment.

Results¶

Headline Performance¶

52 of 52 evaluated cells: SkillOpt is best or tied best.

Setting	Average Lift	vs Best Baseline
GPT-5.5 Direct Chat	+23.5 pts (58.8 → 82.3)	+5.4 pts over oracle
Codex Harness	+24.8 pts	+14.0 pts over EvoSkill
Claude Code Harness	+19.1 pts	+3.2 pts over EvoSkill

Selected Benchmarks (GPT-5.5)¶

SpreadsheetBench: 41.8 → 80.7 (+38.9)
OfficeQA: 33.1 → 72.1 (+39.0)
LiveMathematicianBench: 37.6 → 66.9 (+29.3)
ALFWorld: 83.6 → 95.5 (+11.9)
DocVQA: 78.8 → 91.2 (+12.4)
SearchQA: 77.7 → 87.3 (+9.6)

Model-Scale Robustness¶

Improvement is uniform across all 7 target models, from GPT-5.5 down to Qwen3.5-4B. Small models benefit most in relative terms (GPT-5.4-nano: DocVQA +49.4 pts, ALFWorld +35.1 pts).

Why It Works¶

Ablations reveal the gains are highly sensitive to the core controls:

Validation Gate turns self-editing into propose-and-test optimization
Bounded Learning Rate (Lt=4–8) prevents skill collapse from unbounded rewriting
Rejected-Edit Buffer recovers ~5 pts on complex benchmarks
Slow/Meta Update is the largest degradation — removing it drops SpreadsheetBench by -22.5 pts

"The ablations show the gains are ... much more sensitive to the presence of bounded text-space learning, validation gating, rejected-edit feedback, and epoch-wise slow/meta update — the design choices that make skill editing behave like a controlled training loop."

Transfer & Generalization¶

Cross-Harness Transfer (Strongest Signal)¶

Skills trained in one execution environment transfer positively to the other:

Codex → Claude Code (SpreadsheetBench): +59.7 pts
Claude Code → Codex (SpreadsheetBench): +43.6 pts

Cross-Model & Cross-Benchmark¶

Skills trained on GPT-5.4 improve every smaller GPT variant. OlympiadBench skills transfer positively to Omni-MATH. No row falls below the target's no-skill baseline.

Key Takeaways¶

Skill = trainable state. Treating agent skills as optimizable external state — with bounded edits, validation gating, and rejected-edit feedback — turns skill improvement into a disciplined training loop analogous to gradient descent.
Universal improvement. SkillOpt achieves best-or-tied results across all 52 evaluated cells spanning 7 models, 3 execution harnesses, and diverse benchmarks — with no inference-time cost.
Transfer works. Skills transfer across execution environments (Codex ↔ Claude Code) and across model scales, always beating the no-skill baseline.