Tokenisation via Convex Relaxations (ConvexTok)¶
Source: arXiv:2605.22821
Authors: Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel (ETH Zürich, Kensho)
Date: May 2026
TL;DR¶
Current tokenisers (BPE, Unigram) are greedy and produce sub-optimal compression — but finding the globally optimal tokeniser is NP-hard. ConvexTok reformulates tokeniser construction as an Integer Program, relaxes it to a Linear Program, solves it on an NVIDIA GH200, and rounds the result. The solution is provably within 1% of optimal at common vocabulary sizes, and on downstream tasks (CORE benchmark, multilingual Flores+) it matches or outperforms BPE — especially at larger vocabulary sizes.
The Problem with Greedy Tokenisers¶
BPE makes locally optimal merge decisions that compound into globally sub-optimal vocabularies. Example from the paper: on dataset D = {abc, abd, abe, bc, bd, be} with budget 3 tokens:
- BPE: Merges
a+b→ab(saves 3), then arbitrary (saves 1 each) → total savings 5 - Optimal: Merges
bc,bd,be→ total savings 6
How ConvexTok Works¶
- Graph Formulation: Tokenisation recast as a shortest-path DAG problem
- Integer Program (IP): Minimizes compressed sequence length subject to vocabulary budget — NP-hard
- LP Relaxation: Variables relaxed to continuous [0,1], solved with NVIDIA CuOPT (~4 hours on 1 GH200, ~99M constraints)
- Rounding: Several schemes convert fractional LP solutions back to discrete vocabularies
Key Results¶
| Vocabulary | Tokeniser | Gap to Optimal |
|---|---|---|
| 32k | BPE | 101.29% |
| 32k | ConvexTok | 100.07% |
| 128k | BPE | 100.39% |
| 128k | ConvexTok | 100.23% |
| 256k | ConvexTok | 99.99% |
Downstream: ConvexTok matches or beats BPE on CORE benchmark and generalizes better out-of-distribution (Flores+). Takeaway: Existing tokenisers are already close to optimal on compression — significant gains must come from other architectural changes.