Skip to content

GRAM: Generative Recursive Reasoning

Source: arXiv:2605.19376 \ Authors: Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn (KAIST, Mila, NYU, UdeM) \ Date: May 2026


TL;DR

GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative strategies, and inference-time scaling along depth (recursive steps) and width (parallel trajectory sampling). Unlike deterministic recursive models that collapse into a single attractor, GRAM maintains uncertainty, explores diverse reasoning paths, and can generate both conditional (y|x) and unconditional (x) outputs within a single latent-variable framework.

Motivation

Existing recursive reasoning models (Looped TF, HRM, TRM) are fundamentally deterministic — they follow a single latent trajectory and cannot maintain uncertainty, consider alternatives, or escape suboptimal reasoning paths. A capable reasoning system should be both deep (repeated refinement) and wide (parallel trajectory exploration).

The GRAM Framework

Architecture

  • Encoder: embeds input x into a reusable representation.
  • Recursive core: hierarchical latent state z = (h, l):
  • Low-level refinement: K inner steps of deterministic refinement.
  • High-level stochastic transition: Gaussian noise is injected at the high level using a reparameterised prior.
  • Decoder: reads the final high-level component.
  • Key design choice: Stochasticity is injected only at the high level; low-level refinement remains deterministic.

Training

  • Variational inference with evidence lower bound (ELBO).
  • Both prior and posterior are Markov processes over latent states.
  • Truncated BPTT: gradients are propagated only through the last transition of each supervision step, enabling memory-efficient training.

Inference-Time Scaling

  • Depth scaling: Adaptive computation time — each trajectory halts when a learned head predicts it is beneficial to stop.
  • Width scaling: Sample N parallel trajectories from the prior, decode each, then select the best candidate using:
  • Majority voting, or
  • Latent Process Reward Model (LPRM): a value head trained to predict final accuracy.

Key Results

Structured Reasoning

Method Params Sudoku-Extreme ARC-AGI-1 ARC-AGI-2
TRM (deterministic) 7M 87.4% 44.6% 7.8%
GRAM (Ours) 10M 97.0% 52.0% 11.1%

Width scaling is especially powerful: GRAM with N=20 samples at 16 iterations outperforms all deterministic baselines at 320 iterations (97.0% vs 90.5%).

Multi-Solution Tasks (N-Queens, Graph Coloring)

  • Deterministic recursion fails on multi-solution tasks — it cannot discover multiple valid solutions.
  • GRAM achieves 99.6% accuracy on 8×8 N-Queens with 90.3% coverage across 20 samples.
  • On graph coloring: 85.8% coverage vs 22.3% for the best deterministic baseline.

Significance

GRAM addresses a fundamental limitation of deterministic recursive reasoning: the inability to handle uncertainty and explore multiple hypotheses. By introducing controlled stochasticity and explicit width-scaling, it opens a path toward more robust reasoning systems that can consider alternatives and escape local optima — particularly important for open-ended problem-solving where the correct approach is not known in advance.