Skip to content

ELF: Embedded Language Flows

Source: arXiv:2605.10938
Authors: Keya Hu*, Linlu Qiu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He (MIT) · equal contribution
Code:* github.com/lillian039/ELF
Date: 2026-05-11


TL;DR

Continuous diffusion models can beat discrete ones for language — if you stay in embedding space until the very last moment. ELF achieves Gen. PPL ~24 with just 32 sampling steps, using 10× fewer training tokens than leading discrete DLMs, with no distillation required.


Core Thesis

Diffusion/flow models dominate continuous data generation (images, video), but their application to language has predominantly led to discrete diffusion language models (DLMs) — models that add/remove tokens directly. This paper asks: is the performance gap inherent to language's discrete nature, or due to suboptimal continuous formulations?

The answer: it's the formulation, not the nature of language. ELF demonstrates that a clean continuous flow-based model can substantially outperform state-of-the-art discrete approaches.


Key Design Principles

1. Continuous Embedding Space (No Per-Step Discretization)

  • Token sequence s is encoded into contextual embeddings via a frozen T5-small encoder
  • Embedding dimension 512 → bottleneck 128d → model hidden size
  • The encoder is only used during training — not needed at inference
  • Denoising happens entirely in continuous space; mapping back to tokens occurs only at the final timestep

2. Flow Matching with x-Prediction

  • Rectified flow: z_t = t·x + (1-t)·ε where ε ~ N(0,I)
  • True velocity: v = x - ε
  • Network predicts clean embedding x_θ (not velocity), then velocity is recovered: v_θ = (x_θ - z_t) / (1-t)
  • Loss (MSE): L_MSE = E[ 1/(1-t)² ||x_θ - x||² ]
  • x-prediction works better on high-dimensional embeddings and enables weight sharing with the decoder

3. Shared-Weight Denoiser-Decoder

A single network handles both tasks, conditioned on a binary mode token:

Mode Input Loss Probability
Denoise z_t (noisy embedding, random t) MSE on velocity 80%
Decode (t=1, token-level corruption) Cross-entropy on tokens 20%
  • No separate decoder model — the denoiser is the decoder
  • At the final step, an unembedding matrix W projects to vocabulary logits
  • Token-level corruption at t=1 uses a logit-normal noise schedule to ensure nontrivial training signal

4. Training-Time Classifier-Free Guidance

  • CFG is baked into training: the network learns to directly output the post-combination velocity for a sampled guidance scale ω
  • Zero extra inference forward passes — unlike standard CFG which requires two passes (conditional + unconditional)
  • Implemented via learnable control tokens prepended to the input

5. Self-Conditioning

  • A first forward pass produces a rough prediction, which is concatenated to the input for a second pass
  • Used with 50% probability during training, always during inference
  • Zero overhead at inference (reuses the same forward pass)

Inference

  • ODE Euler sampler: start from random noise, step through the velocity field, decode at t=1
  • SDE-inspired variant: adds Gaussian noise at each step (scale γ) to correct errors, improving few-step generation
  • Self-conditioning and CFG applied at no extra forward-pass cost
z_0 ~ N(0,I)
for each step:
    t = current_time
    x_pred = net(z_t, t, mode="denoise")
    v = (x_pred - z_t) / (1-t)
    z_{t+dt} = z_t + v * dt
    // SDE variant: z_{t+dt} += γ * N(0,I)
at t=1:
    tokens = argmax(W * net(z_1, t=1, mode="decode"))

Results

Metric ELF Baselines (MDLM, Duo, etc.)
Gen. PPL ~24 Higher (requiring distillation)
Sampling steps 32 Hundreds
Training tokens 10× more
Model size 105M ~170M
Distillation None required Often needed

Why It Matters

Prior wisdom held that language's discrete nature demands discrete denoising. ELF proves the gap was architectural, not fundamental — a clean continuous formulation can:

  1. Outperform discrete DLMs with less data and fewer steps
  2. Directly inherit image-domain techniques (CFG, self-conditioning, SDE sampling) with zero adaptation
  3. Simplify the pipeline — one network, no separate decoder, no per-step token supervision

The paper opens a path toward simpler, more effective continuous diffusion language models that bridge the image and text generation paradigms.