Skip to content

Stanford CS229 — Building Large Language Models (LLMs)

Source: YouTube · Lecturer: Yann Dubois (Stanford Ph.D. Candidate) · Course: Stanford CS229: Machine Learning · Published: August 27, 2024 · Duration: ~1h15m · Views: 2M

Overview

This guest lecture provides a concise, end-to-end walkthrough of what it takes to build a large language model like ChatGPT. Rather than rehashing the Transformer architecture (which is covered elsewhere in CS229), Yann Dubois focuses on the five practical pillars that determine whether an LLM succeeds or fails in production:

  1. Architecture — the Transformer backbone
  2. Data — sourcing, filtering, and deduplication at scale
  3. Evaluation — how we measure model quality
  4. Systems — hardware optimisation and training infrastructure
  5. Post-training — turning a raw language model into a useful assistant

"In industry, it's data, evaluation, and systems that make or break a model." — Yann Dubois


1. Pretraining: Autoregressive Language Modeling

Core Mechanism

LLMs are autoregressive — they model the joint probability of a sequence of tokens as a product of conditional probabilities:

[ P(w_1, w_2, \dots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, \dots, w_{i-1}) ]

At each step, the model receives the preceding tokens and predicts a probability distribution over the next token. The loss function is standard cross-entropy between the predicted distribution and the actual next token.

Pipeline

Text → Tokenize → Embed → Transformer Layers → Linear Layer → Softmax → Token Probabilities → Cross-Entropy Loss

Key points: - The Transformer architecture is a fixed cost per token — the same computation happens for every position - Pretraining is unsupervised in the sense that the labels (next tokens) come from the data itself - The scale of pretraining is vastly larger than any supervised learning task — trillions of tokens


2. Tokenization

Byte Pair Encoding (BPE)

Before the model sees text, it must be tokenized. The standard algorithm is Byte Pair Encoding, which works as follows:

  1. Start with individual characters/bytes as the initial token vocabulary
  2. Count frequencies of all adjacent token pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until a desired vocabulary size is reached (typically ~32K–100K tokens)

Trade-offs

Advantage Disadvantage
Handles punctuation, typos, and rare spellings naturally Numbers are split oddly — 42 might be "4" + "2", making math harder
Multilingual — no need for language-specific segmentation Long strings of whitespace or repeated characters waste tokens
Fixed vocabulary makes the embedding layer manageable Tokenization is an architectural crutch — future architectures may eliminate it

Key insight: Tokens are not words. A token can be a subword, a character, or even a partial character. Understanding your tokeniser's behaviour is critical for debugging model outputs.


3. Training Data

The Scale

The pretraining dataset for modern LLMs is dominated by Common Crawl — a publicly available web scrape containing ~250 billion pages (~1 exabyte of raw data). After aggressive filtering, SOTA models train on roughly 15 trillion tokens.

For perspective: GPT-3 (2020) trained on ~300 billion tokens. Llama 3 class models train on ~15 trillion — a 50× increase in four years.

The Filtering Pipeline

Common Crawl (250B pages, 1EB)
    ↓ Remove boilerplate HTML
    ↓ Filter NSFW, PII, toxic content
    ↓ Deduplication (URLs, paragraphs, n-gram overlaps)
    ↓ Quality classifier (trained on Wikipedia-linked domains)
    ↓ Domain classification & weighting
    → Final pretraining corpus (~15T tokens)

Deduplication is one of the most important and underappreciated steps: - Headers/footers appear on millions of pages - The same article may be scraped from multiple mirrors - n-gram overlap removal reduces dataset size by 10–30%

Domain weighting allows upsampling of high-quality sources: - Code repositories → better coding ability - Academic papers → better reasoning - Books → better narrative coherence

Notable Datasets

  • The Pile — an 800GB academic benchmark corpus combining Wikipedia, GitHub, books, academic papers, and more
  • C4 (Colossal Clean Crawled Corpus) — a cleaned version of Common Crawl, commonly used for research

4. Evaluation

Perplexity

The classical pretraining metric. Perplexity measures how "surprised" the model is by the next token — lower is better.

[ \text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(w_i \mid w_{<i})\right) ]

Limitations: - Tokeniser-dependent — can't compare across models with different tokenisers - Only measures next-token prediction, not reasoning or understanding - Poorly correlated with downstream task performance

"Perplexity is no longer favoured in academic settings because it depends too heavily on the tokenizer and data distribution."

MMLU (Massive Multitask Language Understanding)

A benchmark of 57 subjects spanning law, medicine, physics, mathematics, history, and more. Models answer multiple-choice questions. Most widely used academic benchmark for knowledge and reasoning.

HELM (Holistic Evaluation of Language Models)

Stanford's framework that tests models across multiple dimensions simultaneously: - Accuracy - Calibration (do confidence scores match actual correctness?) - Robustness (performance under adversarial inputs) - Fairness - Bias and toxicity - Efficiency

"HELM evaluates the 'how' and 'should' of a model's behaviour — not just the 'what'."

Chatbot Arena (LMSys)

Considered the gold standard for evaluating aligned models. Users chat with two anonymous models side-by-side and vote for which one is better. Using Elo ratings (like chess), it produces a reliable ranking that correlates well with real-world usefulness.

LLM-as-a-Judge (AlpacaEval)

Uses a strong model (GPT-4) to evaluate outputs against reference answers. Shows ~98% correlation with human judgments on Chatbot Arena. Much cheaper than human evaluation at scale.

Key Challenges

  • Test set contamination — benchmark questions may have been seen in training data (Common Crawl includes many evaluation datasets)
  • Open-ended generation is fundamentally hard to evaluate automatically
  • Length bias — LLM judges prefer longer answers regardless of quality
  • Leaderboard hacking — models can be optimised for specific benchmarks rather than general capability

5. Scaling Laws

The Empirical Law

Performance improves predictably and log-linearly with three factors: - Compute (total FLOPs) - Model size (number of parameters) - Data size (number of training tokens)

This enables a crucial workflow: train small models at various compute budgets and extrapolate to predict the performance of a much larger model before investing in the full training run.

Chinchilla Optimality

DeepMind's Chinchilla paper (2022) found that most LLMs were undertrained. The optimal allocation for a fixed compute budget is:

Regime Tokens per Parameter Use Case
Training-optimal ~20 tokens/param Lowest loss for a given compute budget
Inference-optimal ~100-150 tokens/param Prefer smaller model with more data (cheaper at inference)

Practical implication: A 7B model trained on 140B tokens (20×) will likely outperform a 13B model trained on 130B tokens (10×) at inference time, because the smaller model achieves the same quality with lower serving cost.

Llama 3 Scale (Case Study)

Dimension Value
Estimated training cost ~$75 million
GPU hours ~26 million
Training duration ~70 days
Compute (FLOPs) < 1×10²⁶ (below US regulatory threshold)
CO₂ emissions ~4,000 tonnes equivalent

Note: 4,000 tonnes of CO₂ is significant but modest compared to future models — and small relative to the total carbon footprint of the aviation or shipping industries.


6. Systems: Training Infrastructure

Training a 70B+ parameter model requires sophisticated hardware optimisation:

Mixed Precision (FP16/BF16)

  • FP32 → FP16 halves memory and doubles throughput
  • BF16 (bfloat16) preserves more dynamic range than FP16
  • Some layers (loss computation, normalisation) remain in FP32 for numerical stability

Operator Fusion

Multiple GPU kernel launches are fused into a single kernel, reducing overhead from launch latency and memory bandwidth. Frameworks like torch.compile automate this.

3D Parallelism

Strategy What It Does
Data parallelism Split the batch across GPUs
Tensor parallelism Split a single layer's weights across GPUs
Pipeline parallelism Split layers across GPUs (model depth)

Network Bottleneck

The interconnect between GPUs (NVLink, InfiniBand) is often the limiting factor. All-to-all communication in attention layers requires extremely high bandwidth.


7. Alignment: From Language Model to Assistant

A pretrained LLM generates plausible text but does not follow instructions, answer helpfully, or refuse harmful requests. Post-training fixes this.

Step 1: Supervised Fine-Tuning (SFT)

Method: Behavioural cloning — train the model on human-written instruction-output pairs using the standard language modeling loss.

Data scale: Surprisingly small. The LIMA paper showed that 2,000 high-quality examples are sufficient for good performance. Scaling beyond ~32,000 examples shows diminishing returns.

Limitation: Humans are better at judging quality than generating it. SFT is capped by the ceiling of the human demonstrator.

Step 2: Reinforcement Learning from Human Feedback (RLHF)

The three-stage process that powers ChatGPT:

  1. SFT the pretrained model on high-quality demonstrations
  2. Train a Reward Model — the SFT model generates multiple responses to each prompt; humans rank them; a separate model learns to predict the human preference score (typically using a softmax-based Bradley-Terry model)
  3. PPO Optimisation — the policy (LLM) is optimised to maximise the reward model's score, with a KL penalty to prevent it from drifting too far from the SFT model

"PPO is theoretically appealing but practically messy — it requires rollouts, out-of-loop optimisation, and clipping."

Data scale: RLHF typically uses ~1 million preference tokens — a rounding error compared to the 15 trillion pretraining tokens.

Step 3: Direct Preference Optimization (DPO)

A simpler alternative that replaces the reward model with a direct optimisation objective:

  • Maximise probability of "chosen" responses
  • Minimise probability of "rejected" responses
  • No separate reward model to train
  • No PPO instability issues

Result: DPO matches or exceeds PPO in most evaluations while being significantly simpler to implement and tune.

Evaluating Aligned Models

Traditional metrics (perplexity, loss) no longer apply after alignment. The standard evaluations become: - Chatbot Arena — human preference head-to-head - LLM-as-a-Judge (AlpacaEval, MT-Bench) - Safety benchmarks — refusal rates on harmful prompts


8. Inference Optimization

Training is expensive, but inference is where the ongoing cost lives. Several techniques make LLM serving practical:

KV Cache

In autoregressive generation, the attention mechanism recomputes key (K) and value (V) tensors for every token at every step. KV caching stores K and V from previous tokens and only computes them for the new token. This reduces the attention computation from O(n²) to O(n) per step.

Quantization

Reducing the precision of weights and activations:

Precision Memory per Parameter Quality Impact
FP32 4 bytes Full precision (baseline)
FP16 / BF16 2 bytes Negligible
INT8 1 byte Small degradation
INT4 0.5 bytes Noticeable but usable

Batching

GPUs are throughput-optimised — processing multiple prompts simultaneously increases utilisation. Batching is complicated by variable-length sequences, which require padding or dynamic batching strategies.

Flash Attention

A GPU kernel-level optimisation that: - Avoids materialising the full N×N attention matrix in HBM - Uses tiling to keep computation in fast on-chip SRAM - Provides 2–4× speedup with no loss in accuracy


9. The Five Pillars Recap

┌─────────────────────────────────────────────────────┐
│                 Building an LLM                      │
├──────────────┬──────────────┬───────────────────────┤
│   Pillar     │  Key Insight │ Where the Moat Is     │
├──────────────┼──────────────┼───────────────────────┤
│ Architecture │ Transformer  │ Commodity — everyone   │
│              │ is standard  │ uses the same designs  │
├──────────────┼──────────────┼───────────────────────┤
│ Data         │ 15T tokens  │ Biggest differentiator │
│              │ heavily      │ — curation pipeline is │
│              │ filtered     │ proprietary know-how   │
├──────────────┼──────────────┼───────────────────────┤
│ Evaluation   │ MMLU, Arena, │ Necessary but not      │
│              │ AlpacaEval   │ sufficient             │
├──────────────┼──────────────┼───────────────────────┤
│ Systems      │ FP16, 3D     │ Engineering moat —     │
│              │ parallelism  │ training at scale is   │
│              │              │ genuinely hard         │
├──────────────┼──────────────┼───────────────────────┤
│ Post-training│ SFT → RLHF   │ Alignment is the       │
│              │ or DPO       │ product differentiator │
└──────────────┴──────────────┴───────────────────────┘

Key Takeaways

  1. Data is the real moat — architecture is increasingly commoditised, but the data curation pipeline (filtering, deduplication, domain weighting) is proprietary and empirically critical

  2. Scale compounds predictably — scaling laws let you extrapolate performance before committing to a multi-million dollar training run

  3. Alignment is a separate engineering challenge — pretraining gives you a text generator; SFT + RLHF/DPO turns it into an assistant. The compute and data requirements for alignment are tiny compared to pretraining (~1M tokens vs 15T), but the engineering complexity is high

  4. Evaluation is unsolved — perplexity is dead, MMLU is contaminated, and Chatbot Arena is slow. LLM-as-a-judge is the current best practice but has known biases (length preference, self-enhancement)

  5. Inference efficiency is the long-term cost — training costs millions once; serving costs millions ongoing. KV caching, quantization, and batching are essential production techniques


References