Skip to content

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

NVIDIA Research

NVIDIA Research (with PolyU, Princeton, NJU, UIUC) introduced LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD).

The Problem

Traditional Vision-Language Models (VLMs) decode coordinates token-by-token, creating a sequential bottleneck that severely limits throughput in dense detection scenarios.

How LocateAnything Works

Unlike traditional VLMs that decode coordinates token-by-token, PBD decodes geometric elements (bounding boxes, points) as atomic units in a single step, preserving intra-box coherence.

Key Components

  • Moon-ViT vision encoder
  • MLP projector
  • Qwen2.5 language decoder

Three Operating Modes

Mode Name Description
Fast MTP (Max Throughput) Prioritizes speed
Slow NTP (Max Stability) Prioritizes accuracy
Hybrid Default Fast mode with fallback

Training Data

Trained on LocateAnything-Data: - 12M unique images - 138M language queries - 785M bounding boxes

Performance Results

  • 12.7 BPS throughput — >10x faster than Qwen3-VL
  • 2.5x faster than Rex-Omni
  • SOTA on LVIS: +3.8% mF1
  • Dense200: 58.7 mF1
  • ScreenSpot-Pro: 60.3 mF1
  • DocLayNet: 76.8 mF1
  • Significant speedup in dense scenes: 2x to 6x over NTP with 20–300 target boxes

Reference

NVIDIA Research - LocateAnything