ELF: Embedded Language Flows¶

Source: arXiv:2605.10938
Authors: Keya Hu*, Linlu Qiu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He (MIT) · equal contribution
Code:* github.com/lillian039/ELF
Date: 2026-05-11

TL;DR¶

Continuous diffusion models can beat discrete ones for language — if you stay in embedding space until the very last moment. ELF achieves Gen. PPL ~24 with just 32 sampling steps, using 10× fewer training tokens than leading discrete DLMs, with no distillation required.

Core Thesis¶

Diffusion/flow models dominate continuous data generation (images, video), but their application to language has predominantly led to discrete diffusion language models (DLMs) — models that add/remove tokens directly. This paper asks: is the performance gap inherent to language's discrete nature, or due to suboptimal continuous formulations?

The answer: it's the formulation, not the nature of language. ELF demonstrates that a clean continuous flow-based model can substantially outperform state-of-the-art discrete approaches.

Key Design Principles¶

1. Continuous Embedding Space (No Per-Step Discretization)¶

Token sequence s is encoded into contextual embeddings via a frozen T5-small encoder
Embedding dimension 512 → bottleneck 128d → model hidden size
The encoder is only used during training — not needed at inference
Denoising happens entirely in continuous space; mapping back to tokens occurs only at the final timestep

2. Flow Matching with x-Prediction¶

Rectified flow: z_t = t·x + (1-t)·ε where ε ~ N(0,I)
True velocity: v = x - ε
Network predicts clean embedding x_θ (not velocity), then velocity is recovered: v_θ = (x_θ - z_t) / (1-t)
Loss (MSE): L_MSE = E[ 1/(1-t)² ||x_θ - x||² ]
x-prediction works better on high-dimensional embeddings and enables weight sharing with the decoder

3. Shared-Weight Denoiser-Decoder¶

A single network handles both tasks, conditioned on a binary mode token:

Mode	Input	Loss	Probability
Denoise	`z_t` (noisy embedding, random t)	MSE on velocity	80%
Decode	`z̃` (t=1, token-level corruption)	Cross-entropy on tokens	20%

No separate decoder model — the denoiser is the decoder
At the final step, an unembedding matrix W projects to vocabulary logits
Token-level corruption at t=1 uses a logit-normal noise schedule to ensure nontrivial training signal

4. Training-Time Classifier-Free Guidance¶

CFG is baked into training: the network learns to directly output the post-combination velocity for a sampled guidance scale ω
Zero extra inference forward passes — unlike standard CFG which requires two passes (conditional + unconditional)
Implemented via learnable control tokens prepended to the input

5. Self-Conditioning¶

A first forward pass produces a rough prediction, which is concatenated to the input for a second pass
Used with 50% probability during training, always during inference
Zero overhead at inference (reuses the same forward pass)

Inference¶

ODE Euler sampler: start from random noise, step through the velocity field, decode at t=1
SDE-inspired variant: adds Gaussian noise at each step (scale γ) to correct errors, improving few-step generation
Self-conditioning and CFG applied at no extra forward-pass cost

z_0 ~ N(0,I)
for each step:
    t = current_time
    x_pred = net(z_t, t, mode="denoise")
    v = (x_pred - z_t) / (1-t)
    z_{t+dt} = z_t + v * dt
    // SDE variant: z_{t+dt} += γ * N(0,I)
at t=1:
    tokens = argmax(W * net(z_1, t=1, mode="decode"))

Results¶

Metric	ELF	Baselines (MDLM, Duo, etc.)
Gen. PPL	~24	Higher (requiring distillation)
Sampling steps	32	Hundreds
Training tokens	1×	10× more
Model size	105M	~170M
Distillation	None required	Often needed

Why It Matters¶

Prior wisdom held that language's discrete nature demands discrete denoising. ELF proves the gap was architectural, not fundamental — a clean continuous formulation can:

Outperform discrete DLMs with less data and fewer steps
Directly inherit image-domain techniques (CFG, self-conditioning, SDE sampling) with zero adaptation
Simplify the pipeline — one network, no separate decoder, no per-step token supervision

The paper opens a path toward simpler, more effective continuous diffusion language models that bridge the image and text generation paradigms.