SkillOpt: Executive Strategy for Self-Evolving Agent Skills¶
Source: SkillOpt: Self-Evolving Agent Skills via Controllable Text-Space Optimization
Date Published: 2026-05
Authors: Yifan Yang et al. (Microsoft, SJTU, Tongji, Fudan)
Code: aka.ms/SkillOpt
TL;DR¶
SkillOpt treats an agent's skill document as trainable external state — analogous to weight-space optimization — using a separate optimizer model that applies bounded add/delete/replace edits accepted only when they strictly improve a held-out validation score. The result: best or tied-best across all 52 evaluated benchmark cells, with average lifts of +19–25 pts across GPT-5.5, Codex, and Claude Code harnesses.
The Core Problem¶
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision. None behave like a deep-learning optimizer — none reliably improve over their starting point under feedback. The authors argue skills should be treated as a trainable artifact, with the same discipline that makes weight-space optimization reproducible.
How SkillOpt Works¶
SkillOpt introduces a separate optimizer model that turns scored rollouts into controlled edits on a single skill document:
| Component | Analogy | Function |
|---|---|---|
| Bounded Edits (Lt) | Learning Rate | Limits changes per step — prevents erasure of useful rules |
| Validation Gate | Validation Set | Accepts candidate only if held-out score strictly improves |
| Rejected-Edit Buffer | Negative Feedback | Stores failed edits to guide future optimizer calls |
| Slow/Meta Update | Momentum | End-of-epoch comparison preserves durable domain lessons |
| Minibatch Reflection | Gradient Step | Separates failures/successes to find reusable patterns |
The output is a single best_skill.md file (300–2,000 tokens). The target model and execution harness remain frozen — zero inference-time overhead at deployment.
Results¶
Headline Performance¶
52 of 52 evaluated cells: SkillOpt is best or tied best.
| Setting | Average Lift | vs Best Baseline |
|---|---|---|
| GPT-5.5 Direct Chat | +23.5 pts (58.8 → 82.3) | +5.4 pts over oracle |
| Codex Harness | +24.8 pts | +14.0 pts over EvoSkill |
| Claude Code Harness | +19.1 pts | +3.2 pts over EvoSkill |
Selected Benchmarks (GPT-5.5)¶
- SpreadsheetBench: 41.8 → 80.7 (+38.9)
- OfficeQA: 33.1 → 72.1 (+39.0)
- LiveMathematicianBench: 37.6 → 66.9 (+29.3)
- ALFWorld: 83.6 → 95.5 (+11.9)
- DocVQA: 78.8 → 91.2 (+12.4)
- SearchQA: 77.7 → 87.3 (+9.6)
Model-Scale Robustness¶
Improvement is uniform across all 7 target models, from GPT-5.5 down to Qwen3.5-4B. Small models benefit most in relative terms (GPT-5.4-nano: DocVQA +49.4 pts, ALFWorld +35.1 pts).
Why It Works¶
Ablations reveal the gains are highly sensitive to the core controls:
- Validation Gate turns self-editing into propose-and-test optimization
- Bounded Learning Rate (Lt=4–8) prevents skill collapse from unbounded rewriting
- Rejected-Edit Buffer recovers ~5 pts on complex benchmarks
- Slow/Meta Update is the largest degradation — removing it drops SpreadsheetBench by -22.5 pts
"The ablations show the gains are ... much more sensitive to the presence of bounded text-space learning, validation gating, rejected-edit feedback, and epoch-wise slow/meta update — the design choices that make skill editing behave like a controlled training loop."
Transfer & Generalization¶
Cross-Harness Transfer (Strongest Signal)¶
Skills trained in one execution environment transfer positively to the other:
- Codex → Claude Code (SpreadsheetBench): +59.7 pts
- Claude Code → Codex (SpreadsheetBench): +43.6 pts
Cross-Model & Cross-Benchmark¶
Skills trained on GPT-5.4 improve every smaller GPT variant. OlympiadBench skills transfer positively to Omni-MATH. No row falls below the target's no-skill baseline.
Key Takeaways¶
- Skill = trainable state. Treating agent skills as optimizable external state — with bounded edits, validation gating, and rejected-edit feedback — turns skill improvement into a disciplined training loop analogous to gradient descent.
- Universal improvement. SkillOpt achieves best-or-tied results across all 52 evaluated cells spanning 7 models, 3 execution harnesses, and diverse benchmarks — with no inference-time cost.
- Transfer works. Skills transfer across execution environments (Codex ↔ Claude Code) and across model scales, always beating the no-skill baseline.