Latent Spatial Memory for Video World Models (Mirage)¶

Paper: arXiv · Authors: Wang et al. · Institution: ZJU, MSR, Adelaide, Monash

Problem & Motivation¶

Video world models that maintain coherent scene understanding across long generations typically reconstruct pixel-space representations at every step, creating a massive memory and compute bottleneck. Storing and updating a full 3D scene in RGB space is prohibitively expensive. A compressed persistent memory operating directly in the latent space is needed.

Method / Approach¶

Mirage introduces a latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space, entirely avoiding pixel-space reconstruction. The pipeline consists of three stages:

Initialize — encode a reference frame, estimate depth, back-project latent features into 3D
Readout — project the 3D memory to latent resolution using Z-buffering for occlusion handling
Update — filter dynamic objects and sky regions via Qwen3-VL + SAM3, re-encode only static regions, and back-project the updated features

The model fine-tunes Wan2.2-TI2V-5B in two stages: Stage 1 trains only the ControlNet branch, Stage 2 attaches a rank-64 LoRA.

Key Results¶

10.57× faster end-to-end video generation vs. RGB-cache baselines
55× lower GPU memory consumption
Per-frame cost: 0.25s
Cache grows at <0.5 MiB/chunk
WorldScore SOTA: 70.36
RealEstate10K SOTA: PSNRc 20.05

Contributions¶

First latent-space persistent 3D memory for diffusion-based video generation
Efficient Z-buffered readout for occlusion handling in latent space
Dynamic object filtering pipeline using vision-language models (Qwen3-VL + SAM3)
Two-stage fine-tuning protocol for video backbone adaptation
State-of-the-art results on both WorldScore and RealEstate10K benchmarks

Strengths¶

Dramatic efficiency improvements — 10× faster and 55× less memory is transformative
Operating in latent space avoids the pixel-reconstruction bottleneck entirely
Dynamic object filtering preserves scene coherence while enabling change
SOTA results validate the approach beyond just efficiency
Per-frame cost of 0.25s makes real-time applications plausible

Weaknesses / Limitations¶

Requires depth estimation as input — depth quality bounds memory quality
Dynamic object filtering depends on Qwen3-VL + SAM3, adding preprocessing overhead
Two-stage fine-tuning is more complex than single-model approaches
Evaluation limited to camera-traversal video — applicability to general video generation unclear
Cache structure assumes static or slowly changing scenes

Connections & Follow-ups¶

Builds on 3D scene representation (NeRF, 3D Gaussian Splatting) and diffusion-based video generation (Wan2.2, Sora, Video LDM). The latent-space memory approach could generalize to other generative domains (3D asset generation, robotic world models). Combining with Mamba or state-space backbones could further reduce latency.

My Take¶

This is a rare paper where the efficiency gains are so dramatic that they fundamentally change what's feasible. A 10× speedup and 55× memory reduction with SOTA quality is the kind of result that shifts production architectures. The latent-space memory design is elegant — it seems obvious in retrospect, but making it work with Z-buffering and dynamic filtering required serious engineering. I'd love to see this extended to fully dynamic scenes without the static-region assumption.