LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding¶

NVIDIA Research

NVIDIA Research (with PolyU, Princeton, NJU, UIUC) introduced LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD).

The Problem¶

Traditional Vision-Language Models (VLMs) decode coordinates token-by-token, creating a sequential bottleneck that severely limits throughput in dense detection scenarios.

How LocateAnything Works¶

Unlike traditional VLMs that decode coordinates token-by-token, PBD decodes geometric elements (bounding boxes, points) as atomic units in a single step, preserving intra-box coherence.

Key Components¶

Moon-ViT vision encoder
MLP projector
Qwen2.5 language decoder

Three Operating Modes¶

Mode	Name	Description
Fast	MTP (Max Throughput)	Prioritizes speed
Slow	NTP (Max Stability)	Prioritizes accuracy
Hybrid	Default	Fast mode with fallback

Training Data¶

Trained on LocateAnything-Data: - 12M unique images - 138M language queries - 785M bounding boxes

Performance Results¶

12.7 BPS throughput — >10x faster than Qwen3-VL
2.5x faster than Rex-Omni
SOTA on LVIS: +3.8% mF1
Dense200: 58.7 mF1
ScreenSpot-Pro: 60.3 mF1
DocLayNet: 76.8 mF1
Significant speedup in dense scenes: 2x to 6x over NTP with 20–300 target boxes

Reference¶

NVIDIA Research - LocateAnything