Tokenisation via Convex Relaxations (ConvexTok)¶

Source: arXiv:2605.22821
Authors: Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel (ETH Zürich, Kensho)
Date: May 2026

TL;DR¶

Current tokenisers (BPE, Unigram) are greedy and produce sub-optimal compression — but finding the globally optimal tokeniser is NP-hard. ConvexTok reformulates tokeniser construction as an Integer Program, relaxes it to a Linear Program, solves it on an NVIDIA GH200, and rounds the result. The solution is provably within 1% of optimal at common vocabulary sizes, and on downstream tasks (CORE benchmark, multilingual Flores+) it matches or outperforms BPE — especially at larger vocabulary sizes.

The Problem with Greedy Tokenisers¶

BPE makes locally optimal merge decisions that compound into globally sub-optimal vocabularies. Example from the paper: on dataset D = {abc, abd, abe, bc, bd, be} with budget 3 tokens:

BPE: Merges a+b→ab (saves 3), then arbitrary (saves 1 each) → total savings 5
Optimal: Merges bc, bd, be → total savings 6

How ConvexTok Works¶

Graph Formulation: Tokenisation recast as a shortest-path DAG problem
Integer Program (IP): Minimizes compressed sequence length subject to vocabulary budget — NP-hard
LP Relaxation: Variables relaxed to continuous [0,1], solved with NVIDIA CuOPT (~4 hours on 1 GH200, ~99M constraints)
Rounding: Several schemes convert fractional LP solutions back to discrete vocabularies

Key Results¶

Vocabulary	Tokeniser	Gap to Optimal
32k	BPE	101.29%
32k	ConvexTok	100.07%
128k	BPE	100.39%
128k	ConvexTok	100.23%
256k	ConvexTok	99.99%

Downstream: ConvexTok matches or beats BPE on CORE benchmark and generalizes better out-of-distribution (Flores+). Takeaway: Existing tokenisers are already close to optimal on compression — significant gains must come from other architectural changes.