LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding¶
NVIDIA Research
NVIDIA Research (with PolyU, Princeton, NJU, UIUC) introduced LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD).
The Problem¶
Traditional Vision-Language Models (VLMs) decode coordinates token-by-token, creating a sequential bottleneck that severely limits throughput in dense detection scenarios.
How LocateAnything Works¶
Unlike traditional VLMs that decode coordinates token-by-token, PBD decodes geometric elements (bounding boxes, points) as atomic units in a single step, preserving intra-box coherence.
Key Components¶
- Moon-ViT vision encoder
- MLP projector
- Qwen2.5 language decoder
Three Operating Modes¶
| Mode | Name | Description |
|---|---|---|
| Fast | MTP (Max Throughput) | Prioritizes speed |
| Slow | NTP (Max Stability) | Prioritizes accuracy |
| Hybrid | Default | Fast mode with fallback |
Training Data¶
Trained on LocateAnything-Data: - 12M unique images - 138M language queries - 785M bounding boxes
Performance Results¶
- 12.7 BPS throughput — >10x faster than Qwen3-VL
- 2.5x faster than Rex-Omni
- SOTA on LVIS: +3.8% mF1
- Dense200: 58.7 mF1
- ScreenSpot-Pro: 60.3 mF1
- DocLayNet: 76.8 mF1
- Significant speedup in dense scenes: 2x to 6x over NTP with 20–300 target boxes