Skip to content

Tokenisation via Convex Relaxations (ConvexTok)

Source: arXiv:2605.22821
Authors: Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel (ETH Zürich, Kensho)
Date: May 2026


TL;DR

Current tokenisers (BPE, Unigram) are greedy and produce sub-optimal compression — but finding the globally optimal tokeniser is NP-hard. ConvexTok reformulates tokeniser construction as an Integer Program, relaxes it to a Linear Program, solves it on an NVIDIA GH200, and rounds the result. The solution is provably within 1% of optimal at common vocabulary sizes, and on downstream tasks (CORE benchmark, multilingual Flores+) it matches or outperforms BPE — especially at larger vocabulary sizes.


The Problem with Greedy Tokenisers

BPE makes locally optimal merge decisions that compound into globally sub-optimal vocabularies. Example from the paper: on dataset D = {abc, abd, abe, bc, bd, be} with budget 3 tokens:

  • BPE: Merges a+b→ab (saves 3), then arbitrary (saves 1 each) → total savings 5
  • Optimal: Merges bc, bd, betotal savings 6

How ConvexTok Works

  1. Graph Formulation: Tokenisation recast as a shortest-path DAG problem
  2. Integer Program (IP): Minimizes compressed sequence length subject to vocabulary budget — NP-hard
  3. LP Relaxation: Variables relaxed to continuous [0,1], solved with NVIDIA CuOPT (~4 hours on 1 GH200, ~99M constraints)
  4. Rounding: Several schemes convert fractional LP solutions back to discrete vocabularies

Key Results

Vocabulary Tokeniser Gap to Optimal
32k BPE 101.29%
32k ConvexTok 100.07%
128k BPE 100.39%
128k ConvexTok 100.23%
256k ConvexTok 99.99%

Downstream: ConvexTok matches or beats BPE on CORE benchmark and generalizes better out-of-distribution (Flores+). Takeaway: Existing tokenisers are already close to optimal on compression — significant gains must come from other architectural changes.