GRAM: Generative Recursive Reasoning¶
Source: arXiv:2605.19376 \ Authors: Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn (KAIST, Mila, NYU, UdeM) \ Date: May 2026
TL;DR¶
GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative strategies, and inference-time scaling along depth (recursive steps) and width (parallel trajectory sampling). Unlike deterministic recursive models that collapse into a single attractor, GRAM maintains uncertainty, explores diverse reasoning paths, and can generate both conditional (y|x) and unconditional (x) outputs within a single latent-variable framework.
Motivation¶
Existing recursive reasoning models (Looped TF, HRM, TRM) are fundamentally deterministic — they follow a single latent trajectory and cannot maintain uncertainty, consider alternatives, or escape suboptimal reasoning paths. A capable reasoning system should be both deep (repeated refinement) and wide (parallel trajectory exploration).
The GRAM Framework¶
Architecture¶
- Encoder: embeds input x into a reusable representation.
- Recursive core: hierarchical latent state z = (h, l):
- Low-level refinement: K inner steps of deterministic refinement.
- High-level stochastic transition: Gaussian noise is injected at the high level using a reparameterised prior.
- Decoder: reads the final high-level component.
- Key design choice: Stochasticity is injected only at the high level; low-level refinement remains deterministic.
Training¶
- Variational inference with evidence lower bound (ELBO).
- Both prior and posterior are Markov processes over latent states.
- Truncated BPTT: gradients are propagated only through the last transition of each supervision step, enabling memory-efficient training.
Inference-Time Scaling¶
- Depth scaling: Adaptive computation time — each trajectory halts when a learned head predicts it is beneficial to stop.
- Width scaling: Sample N parallel trajectories from the prior, decode each, then select the best candidate using:
- Majority voting, or
- Latent Process Reward Model (LPRM): a value head trained to predict final accuracy.
Key Results¶
Structured Reasoning¶
| Method | Params | Sudoku-Extreme | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|---|---|
| TRM (deterministic) | 7M | 87.4% | 44.6% | 7.8% |
| GRAM (Ours) | 10M | 97.0% | 52.0% | 11.1% |
Width scaling is especially powerful: GRAM with N=20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations (97.0% vs 90.5%).
Multi-Solution Tasks (N-Queens, Graph Coloring)¶
- Deterministic recursion fails on multi-solution tasks — it cannot discover multiple valid solutions.
- GRAM achieves 99.6% accuracy on 8×8 N-Queens with 90.3% coverage across 20 samples.
- On graph coloring: 85.8% coverage vs 22.3% for the best deterministic baseline.
Significance¶
GRAM addresses a fundamental limitation of deterministic recursive reasoning: the inability to handle uncertainty and explore multiple hypotheses. By introducing controlled stochasticity and explicit width-scaling, it opens a path toward more robust reasoning systems that can consider alternatives and escape local optima — particularly important for open-ended problem-solving where the correct approach is not known in advance.