MAI-Thinking-1 — Building a Hill-Climbing Machine
What Is MAI-Thinking-1?¶
Microsoft AI's MAI-Thinking-1 is a 35 billion active / ~1 trillion total parameter mixture-of-experts (MoE) reasoning model trained entirely from scratch on clean, curated data. It represents a deliberate break from distillation-dependent approaches — the model's capabilities were learned, not inherited.
The project embodies three core principles:
- Capabilities learned, not inherited — no shadow of a teacher model constraining the ceiling.
- Simplicity is sustainable — architectural choices favor what can scale cleanly over what is clever.
- Scientific rigor — every design decision is validated by experiments at the target scale, not extrapolated from toy runs.
Architecture and Training¶
MAI-Thinking-1 uses a novel LatentMoE configuration with 8 active experts out of 512 total, interleaving MoE and dense layers throughout the transformer. Pre-training ran on 30 trillion tokens — a dataset built for breadth and quality rather than distilled from another model's outputs.
A critical methodological finding emerged during development: rank non-invariance. Small-scale ablations often fail to predict model ordering at large scale. What works best at 1B parameters may be strictly worse at 35B. This undermines the common practice of extrapolating architecture decisions from cheap experiments and forces researchers to validate at or near production scale.
Post-Training with GRPO¶
The post-training pipeline uses Group Relative Policy Optimization (GRPO) with several innovations:
- Adaptive entropy control — dynamically adjusts exploration vs. exploitation pressure during RL.
- Outer ratio clip — stabilizes training by clipping the importance sampling ratio on both ends.
- Reward decomposition — the total reward is decomposed into:
R = Rtask + w_lang · R_lang − w_len · R_len
This lets the model optimize for correctness, language quality, and conciseness simultaneously without one axis dominating.
The Self-Distillation Breakthrough¶
Perhaps the most surprising result is the critical importance of self-distillation from RL checkpoints. As training progresses, earlier RL checkpoints contain valuable exploration trajectories that later, more refined checkpoints have moved past. Distilling these intermediate checkpoints back into the model preserves hard-won behavioral diversity.
Infrastructure and Goodput¶
At peak, MAI-Thinking-1 training achieved 90% goodput across 8,000 GPUs — a staggering number that speaks to the engineering effort behind the paper. The team invested heavily in fault tolerance, overlapping communication with computation, and rapid failure recovery.
Notably, the paper explicitly states there was no distillation from third-party models. Every capability in MAI-Thinking-1 was earned through clean pre-training data, architectural innovation, and RL — not borrowed from GPT-4, Claude, or any other frontier system.