ELF: Embedded Language Flows¶
Source: arXiv:2605.10938
Authors: Keya Hu*, Linlu Qiu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He (MIT) · equal contribution
Code:* github.com/lillian039/ELF
Date: 2026-05-11
TL;DR¶
Continuous diffusion models can beat discrete ones for language — if you stay in embedding space until the very last moment. ELF achieves Gen. PPL ~24 with just 32 sampling steps, using 10× fewer training tokens than leading discrete DLMs, with no distillation required.
Core Thesis¶
Diffusion/flow models dominate continuous data generation (images, video), but their application to language has predominantly led to discrete diffusion language models (DLMs) — models that add/remove tokens directly. This paper asks: is the performance gap inherent to language's discrete nature, or due to suboptimal continuous formulations?
The answer: it's the formulation, not the nature of language. ELF demonstrates that a clean continuous flow-based model can substantially outperform state-of-the-art discrete approaches.
Key Design Principles¶
1. Continuous Embedding Space (No Per-Step Discretization)¶
- Token sequence
sis encoded into contextual embeddings via a frozen T5-small encoder - Embedding dimension 512 → bottleneck 128d → model hidden size
- The encoder is only used during training — not needed at inference
- Denoising happens entirely in continuous space; mapping back to tokens occurs only at the final timestep
2. Flow Matching with x-Prediction¶
- Rectified flow:
z_t = t·x + (1-t)·εwhere ε ~ N(0,I) - True velocity:
v = x - ε - Network predicts clean embedding x_θ (not velocity), then velocity is recovered:
v_θ = (x_θ - z_t) / (1-t) - Loss (MSE):
L_MSE = E[ 1/(1-t)² ||x_θ - x||² ] - x-prediction works better on high-dimensional embeddings and enables weight sharing with the decoder
3. Shared-Weight Denoiser-Decoder¶
A single network handles both tasks, conditioned on a binary mode token:
| Mode | Input | Loss | Probability |
|---|---|---|---|
| Denoise | z_t (noisy embedding, random t) |
MSE on velocity | 80% |
| Decode | z̃ (t=1, token-level corruption) |
Cross-entropy on tokens | 20% |
- No separate decoder model — the denoiser is the decoder
- At the final step, an unembedding matrix W projects to vocabulary logits
- Token-level corruption at t=1 uses a logit-normal noise schedule to ensure nontrivial training signal
4. Training-Time Classifier-Free Guidance¶
- CFG is baked into training: the network learns to directly output the post-combination velocity for a sampled guidance scale ω
- Zero extra inference forward passes — unlike standard CFG which requires two passes (conditional + unconditional)
- Implemented via learnable control tokens prepended to the input
5. Self-Conditioning¶
- A first forward pass produces a rough prediction, which is concatenated to the input for a second pass
- Used with 50% probability during training, always during inference
- Zero overhead at inference (reuses the same forward pass)
Inference¶
- ODE Euler sampler: start from random noise, step through the velocity field, decode at t=1
- SDE-inspired variant: adds Gaussian noise at each step (scale γ) to correct errors, improving few-step generation
- Self-conditioning and CFG applied at no extra forward-pass cost
z_0 ~ N(0,I)
for each step:
t = current_time
x_pred = net(z_t, t, mode="denoise")
v = (x_pred - z_t) / (1-t)
z_{t+dt} = z_t + v * dt
// SDE variant: z_{t+dt} += γ * N(0,I)
at t=1:
tokens = argmax(W * net(z_1, t=1, mode="decode"))
Results¶
| Metric | ELF | Baselines (MDLM, Duo, etc.) |
|---|---|---|
| Gen. PPL | ~24 | Higher (requiring distillation) |
| Sampling steps | 32 | Hundreds |
| Training tokens | 1× | 10× more |
| Model size | 105M | ~170M |
| Distillation | None required | Often needed |
Why It Matters¶
Prior wisdom held that language's discrete nature demands discrete denoising. ELF proves the gap was architectural, not fundamental — a clean continuous formulation can:
- Outperform discrete DLMs with less data and fewer steps
- Directly inherit image-domain techniques (CFG, self-conditioning, SDE sampling) with zero adaptation
- Simplify the pipeline — one network, no separate decoder, no per-step token supervision
The paper opens a path toward simpler, more effective continuous diffusion language models that bridge the image and text generation paradigms.