Z-Reward: Beyond Scalar Rewards¶

Paper: arXiv · Authors: Jin & Cai et al. · Institution: Alibaba Group & Nankai University

Problem & Motivation¶

Scalar reward models collapse rich quality assessments into single numbers, losing information about why an output is good or bad. Meanwhile, large reward models with strong reasoning are too expensive for deployment. The field needs a way to decouple the reasoning-intensive judgment process from efficient reward deployment.

Method / Approach¶

Z-Reward introduces a teacher-student framework with two novel training methods:

Teacher (27B Qwen3.5): Trained with Group-wise Direct Score Optimization (GDSO), which combines: - Policy-gradient rewards from distribution expectations (E-step) - Direct pointwise supervision on score distributions (M-step) - Pairwise supervision on score gaps between samples

Student (9B Qwen3.5): Trained via Reasoning-Internalized Score Distillation (RISD) — transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time.

The reward evaluates 4 quality dimensions — Text-Image Alignment, Realism, Aesthetics, Physical Plausibility — on a 9-level half-point scale (0.0–4.0 in 0.5 increments).

Key Results¶

27B teacher: 89.6% human preference accuracy
9B student: 88.6% (only 1 point drop from teacher)
As differentiable reward for text-to-image RL: 41.3% net improvement over SFT baseline
Multi-dimensional scoring provides richer supervisory signal than scalar rewards

Contributions¶

Teacher-student decoupling for reward models — reasoning-heavy training, lightweight deployment
Group-wise Direct Score Optimization (GDSO) for distribution-aware reward learning
Reasoning-Internalized Score Distillation (RISD) for chain-free student inference
Four-dimensional quality scoring with fine-grained 9-level scale
Demonstration of reward model as differentiable supervisor for text-to-image RL

Strengths¶

The decoupling principle is practical and well-executed — only 1 point drop from teacher to student
Multi-dimensional scoring is more informative than scalar rewards
41.3% RL improvement validates the reward quality beyond just correlation metrics
GDSO elegantly combines pointwise, pairwise, and distributional supervision
RISD avoids the inference cost of explicit reasoning chains

Weaknesses / Limitations¶

Four quality dimensions are text-to-image specific — generalizability to other domains untested
9-level half-point scale may introduce annotation noise
Teacher-student gap, while small, still represents information loss
RL improvements reported as "net improvement over SFT" — baseline strength matters
No analysis of reward hacking or exploitation by the RL policy

Connections & Follow-ups¶

Connects to reward model literature (InstructGPT, RLHF, DPO) and knowledge distillation. The multi-dimensional approach relates to recent work on compositional reward models. RISD could generalize to other domains where reasoning chains are expensive — code generation, mathematical reasoning, or dialogue. The differentiable reward for RL opens interesting directions for end-to-end generative model training.

My Take¶

The teacher-student decoupling is the standout contribution here — it solves a real deployment problem rather than just chasing accuracy numbers. The 1-point gap between 27B and 9B is impressive and suggests RISD is doing something genuinely useful. I'd like to see the multi-dimensional approach tested in other domains beyond text-to-image, particularly in code and math where reasoning traceability matters. The 0.5-point scale granularity feels a bit arbitrary, but the results speak for themselves.