Skip to content

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Source: SkillOpt: Self-Evolving Agent Skills via Controllable Text-Space Optimization
Date Published: 2026-05
Authors: Yifan Yang et al. (Microsoft, SJTU, Tongji, Fudan)
Code: aka.ms/SkillOpt


TL;DR

SkillOpt treats an agent's skill document as trainable external state — analogous to weight-space optimization — using a separate optimizer model that applies bounded add/delete/replace edits accepted only when they strictly improve a held-out validation score. The result: best or tied-best across all 52 evaluated benchmark cells, with average lifts of +19–25 pts across GPT-5.5, Codex, and Claude Code harnesses.

The Core Problem

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision. None behave like a deep-learning optimizer — none reliably improve over their starting point under feedback. The authors argue skills should be treated as a trainable artifact, with the same discipline that makes weight-space optimization reproducible.

How SkillOpt Works

SkillOpt introduces a separate optimizer model that turns scored rollouts into controlled edits on a single skill document:

Component Analogy Function
Bounded Edits (Lt) Learning Rate Limits changes per step — prevents erasure of useful rules
Validation Gate Validation Set Accepts candidate only if held-out score strictly improves
Rejected-Edit Buffer Negative Feedback Stores failed edits to guide future optimizer calls
Slow/Meta Update Momentum End-of-epoch comparison preserves durable domain lessons
Minibatch Reflection Gradient Step Separates failures/successes to find reusable patterns

The output is a single best_skill.md file (300–2,000 tokens). The target model and execution harness remain frozen — zero inference-time overhead at deployment.

Results

Headline Performance

52 of 52 evaluated cells: SkillOpt is best or tied best.

Setting Average Lift vs Best Baseline
GPT-5.5 Direct Chat +23.5 pts (58.8 → 82.3) +5.4 pts over oracle
Codex Harness +24.8 pts +14.0 pts over EvoSkill
Claude Code Harness +19.1 pts +3.2 pts over EvoSkill

Selected Benchmarks (GPT-5.5)

  • SpreadsheetBench: 41.8 → 80.7 (+38.9)
  • OfficeQA: 33.1 → 72.1 (+39.0)
  • LiveMathematicianBench: 37.6 → 66.9 (+29.3)
  • ALFWorld: 83.6 → 95.5 (+11.9)
  • DocVQA: 78.8 → 91.2 (+12.4)
  • SearchQA: 77.7 → 87.3 (+9.6)

Model-Scale Robustness

Improvement is uniform across all 7 target models, from GPT-5.5 down to Qwen3.5-4B. Small models benefit most in relative terms (GPT-5.4-nano: DocVQA +49.4 pts, ALFWorld +35.1 pts).

Why It Works

Ablations reveal the gains are highly sensitive to the core controls:

  • Validation Gate turns self-editing into propose-and-test optimization
  • Bounded Learning Rate (Lt=4–8) prevents skill collapse from unbounded rewriting
  • Rejected-Edit Buffer recovers ~5 pts on complex benchmarks
  • Slow/Meta Update is the largest degradation — removing it drops SpreadsheetBench by -22.5 pts

"The ablations show the gains are ... much more sensitive to the presence of bounded text-space learning, validation gating, rejected-edit feedback, and epoch-wise slow/meta update — the design choices that make skill editing behave like a controlled training loop."

Transfer & Generalization

Cross-Harness Transfer (Strongest Signal)

Skills trained in one execution environment transfer positively to the other:

  • Codex → Claude Code (SpreadsheetBench): +59.7 pts
  • Claude Code → Codex (SpreadsheetBench): +43.6 pts

Cross-Model & Cross-Benchmark

Skills trained on GPT-5.4 improve every smaller GPT variant. OlympiadBench skills transfer positively to Omni-MATH. No row falls below the target's no-skill baseline.

Key Takeaways

  1. Skill = trainable state. Treating agent skills as optimizable external state — with bounded edits, validation gating, and rejected-edit feedback — turns skill improvement into a disciplined training loop analogous to gradient descent.
  2. Universal improvement. SkillOpt achieves best-or-tied results across all 52 evaluated cells spanning 7 models, 3 execution harnesses, and diverse benchmarks — with no inference-time cost.
  3. Transfer works. Skills transfer across execution environments (Codex ↔ Claude Code) and across model scales, always beating the no-skill baseline.