Skip to content

Latent Spatial Memory for Video World Models (Mirage)

Paper: arXiv · Authors: Wang et al. · Institution: ZJU, MSR, Adelaide, Monash

Problem & Motivation

Video world models that maintain coherent scene understanding across long generations typically reconstruct pixel-space representations at every step, creating a massive memory and compute bottleneck. Storing and updating a full 3D scene in RGB space is prohibitively expensive. A compressed persistent memory operating directly in the latent space is needed.

Method / Approach

Mirage introduces a latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space, entirely avoiding pixel-space reconstruction. The pipeline consists of three stages:

  1. Initialize — encode a reference frame, estimate depth, back-project latent features into 3D
  2. Readout — project the 3D memory to latent resolution using Z-buffering for occlusion handling
  3. Update — filter dynamic objects and sky regions via Qwen3-VL + SAM3, re-encode only static regions, and back-project the updated features

The model fine-tunes Wan2.2-TI2V-5B in two stages: Stage 1 trains only the ControlNet branch, Stage 2 attaches a rank-64 LoRA.

Key Results

  • 10.57× faster end-to-end video generation vs. RGB-cache baselines
  • 55× lower GPU memory consumption
  • Per-frame cost: 0.25s
  • Cache grows at <0.5 MiB/chunk
  • WorldScore SOTA: 70.36
  • RealEstate10K SOTA: PSNRc 20.05

Contributions

  1. First latent-space persistent 3D memory for diffusion-based video generation
  2. Efficient Z-buffered readout for occlusion handling in latent space
  3. Dynamic object filtering pipeline using vision-language models (Qwen3-VL + SAM3)
  4. Two-stage fine-tuning protocol for video backbone adaptation
  5. State-of-the-art results on both WorldScore and RealEstate10K benchmarks

Strengths

  • Dramatic efficiency improvements — 10× faster and 55× less memory is transformative
  • Operating in latent space avoids the pixel-reconstruction bottleneck entirely
  • Dynamic object filtering preserves scene coherence while enabling change
  • SOTA results validate the approach beyond just efficiency
  • Per-frame cost of 0.25s makes real-time applications plausible

Weaknesses / Limitations

  • Requires depth estimation as input — depth quality bounds memory quality
  • Dynamic object filtering depends on Qwen3-VL + SAM3, adding preprocessing overhead
  • Two-stage fine-tuning is more complex than single-model approaches
  • Evaluation limited to camera-traversal video — applicability to general video generation unclear
  • Cache structure assumes static or slowly changing scenes

Connections & Follow-ups

Builds on 3D scene representation (NeRF, 3D Gaussian Splatting) and diffusion-based video generation (Wan2.2, Sora, Video LDM). The latent-space memory approach could generalize to other generative domains (3D asset generation, robotic world models). Combining with Mamba or state-space backbones could further reduce latency.

My Take

This is a rare paper where the efficiency gains are so dramatic that they fundamentally change what's feasible. A 10× speedup and 55× memory reduction with SOTA quality is the kind of result that shifts production architectures. The latent-space memory design is elegant — it seems obvious in retrospect, but making it work with Z-buffering and dynamic filtering required serious engineering. I'd love to see this extended to fully dynamic scenes without the static-region assumption.