GRAM: Generative Recursive Reasoning¶

Source: arXiv:2605.19376 \ Authors: Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn (KAIST, Mila, NYU, UdeM) \ Date: May 2026

TL;DR¶

GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative strategies, and inference-time scaling along depth (recursive steps) and width (parallel trajectory sampling). Unlike deterministic recursive models that collapse into a single attractor, GRAM maintains uncertainty, explores diverse reasoning paths, and can generate both conditional (y|x) and unconditional (x) outputs within a single latent-variable framework.

Motivation¶

Existing recursive reasoning models (Looped TF, HRM, TRM) are fundamentally deterministic — they follow a single latent trajectory and cannot maintain uncertainty, consider alternatives, or escape suboptimal reasoning paths. A capable reasoning system should be both deep (repeated refinement) and wide (parallel trajectory exploration).

The GRAM Framework¶

Architecture¶

Encoder: embeds input x into a reusable representation.
Recursive core: hierarchical latent state z = (h, l):
Low-level refinement: K inner steps of deterministic refinement.
High-level stochastic transition: Gaussian noise is injected at the high level using a reparameterised prior.
Decoder: reads the final high-level component.
Key design choice: Stochasticity is injected only at the high level; low-level refinement remains deterministic.

Training¶

Variational inference with evidence lower bound (ELBO).
Both prior and posterior are Markov processes over latent states.
Truncated BPTT: gradients are propagated only through the last transition of each supervision step, enabling memory-efficient training.

Inference-Time Scaling¶

Depth scaling: Adaptive computation time — each trajectory halts when a learned head predicts it is beneficial to stop.
Width scaling: Sample N parallel trajectories from the prior, decode each, then select the best candidate using:
Majority voting, or
Latent Process Reward Model (LPRM): a value head trained to predict final accuracy.

Key Results¶

Structured Reasoning¶

Method	Params	Sudoku-Extreme	ARC-AGI-1	ARC-AGI-2
TRM (deterministic)	7M	87.4%	44.6%	7.8%
GRAM (Ours)	10M	97.0%	52.0%	11.1%

Width scaling is especially powerful: GRAM with N=20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations (97.0% vs 90.5%).

Multi-Solution Tasks (N-Queens, Graph Coloring)¶

Deterministic recursion fails on multi-solution tasks — it cannot discover multiple valid solutions.
GRAM achieves 99.6% accuracy on 8×8 N-Queens with 90.3% coverage across 20 samples.
On graph coloring: 85.8% coverage vs 22.3% for the best deterministic baseline.

Significance¶

GRAM addresses a fundamental limitation of deterministic recursive reasoning: the inability to handle uncertainty and explore multiple hypotheses. By introducing controlled stochasticity and explicit width-scaling, it opens a path toward more robust reasoning systems that can consider alternatives and escape local optima — particularly important for open-ended problem-solving where the correct approach is not known in advance.