Part III · Chapter 9

Pre-training Paradigms: BERT, GPT, and T5

Part III: The Transformer Revolution Deep ~30pp Phase 1

Why this chapter matters for prediction: This chapter explains how Transformer architectures became large language models. All three paradigms — BERT's masked language modeling, GPT's autoregressive prediction, and T5's span corruption — are fundamentally about predicting tokens. They differ only in which tokens are predicted and what context is available. The discovery that language structure learned from massive unlabeled text transfers to downstream tasks is the insight that launched the modern LLM era.

Prerequisites

Chapter 8: The Transformer Architecture — Self-attention, multi-head attention, positional encodings, encoder/decoder blocks. BERT uses the encoder, GPT uses the decoder, T5 uses the full encoder-decoder.

Summary

Chapter 9 is THE chapter that explains how Transformer architectures (Ch 8) became large language models. It presents the three foundational pre-training paradigms that transformed NLP between 2018 and 2020: BERT's masked language modeling, GPT's autoregressive language modeling, and T5's span corruption — each representing a different answer to the question "how should we train a Transformer?" The chapter's central insight is that language structure learned from massive unlabeled text transfers to downstream tasks, making the pre-train/fine-tune paradigm vastly more data-efficient than training from scratch. By the chapter's end, students understand why decoder-only models (GPT-style) won the scaling race: autoregressive training is simpler, generation is a more general capability than understanding, and in-context learning emerges as a free bonus at scale. This chapter is the pivot point of the book — everything before it builds toward it, and everything after builds upon it.

Learning Objectives

Explain why pre-training on large unlabeled corpora followed by task-specific fine-tuning outperforms training from scratch, and quantify the data-efficiency gains with concrete examples
Derive and implement the masked language modeling (MLM) objective used in BERT, including the 80/10/10 masking strategy, and contrast it with the causal language modeling (CLM) objective used in GPT
Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures in terms of their pre-training objectives, computational trade-offs, and downstream task suitability
Fine-tune a pre-trained transformer on a text classification task using the Hugging Face Transformers library, evaluate transfer learning performance against a randomly initialized baseline, and interpret fine-tuning loss curves

Section Outline

9.1 The Pre-training Revolution (~5pp)

Why pre-train? The insight that language structure learned from massive unlabeled text transfers to downstream tasks. Historical shift from feature engineering to pre-trained representations, the ImageNet analogy, and the pivotal moment when ELMo, ULMFiT, and BERT demonstrated that pre-training works for NLP.

9.1.1 From task-specific to general-purpose representations
9.1.2 Transfer learning: the ImageNet moment for NLP
9.1.3 The pre-train / fine-tune paradigm

9.2 BERT and Masked Language Modeling (~7pp)

Architecture and training of BERT (Devlin et al., 2019). The bidirectional encoder design, the MLM objective with its 80/10/10 masking strategy, next sentence prediction (NSP) and why it was later dropped, and fine-tuning for downstream tasks.

9.2.1 BERT architecture and input representation
9.2.2 The masked language modeling objective
9.2.3 Next sentence prediction (and its limitations)
9.2.4 Fine-tuning BERT for downstream tasks

9.3 GPT and Autoregressive Language Modeling (~7pp)

Architecture and training of GPT (Radford et al., 2018) and GPT-2/GPT-3. The decoder-only design, causal masking, the CLM objective, and the discovery that scale enables few-shot learning without fine-tuning.

9.3.1 GPT architecture and causal masking
9.3.2 The causal language modeling objective
9.3.3 From GPT to GPT-2 to GPT-3: the power of scale
9.3.4 Zero-shot and few-shot generalization

9.4 T5 and Encoder-Decoder Pre-training (~5pp)

The text-to-text transfer transformer (Raffel et al., 2020). Casting every NLP task as text-to-text, the span corruption pre-training objective, the C4 dataset, and systematic ablation findings.

9.4.1 The text-to-text framework
9.4.2 Span corruption objective
9.4.3 Lessons from systematic ablations

9.5 Comparing Paradigms (~6pp)

Encoder-only vs. decoder-only vs. encoder-decoder: which architecture suits which task family? Why decoder-only models won the scaling race. Practical guidelines for model selection.

9.5.1 Architecture-task suitability matrix
9.5.2 Why decoder-only models won the scaling race
9.5.3 Practical guidelines for model selection

Key Equations

Eq 9.1 — Masked Language Modeling loss (Section 9.2)

$$\mathcal{L}_{\text{MLM}} = -\sum_{m \in \mathcal{M}} \log P(w_m \mid \mathbf{w}_{\backslash \mathcal{M}}; \theta)$$

where $\mathcal{M}$ is the set of masked positions and $\mathbf{w}_{\backslash \mathcal{M}}$ denotes the sequence with masked tokens replaced.

Eq 9.2 — Causal Language Modeling loss (Section 9.3)

$$\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(w_t \mid w_1, \ldots, w_{t-1}; \theta)$$

Eq 9.3 — Span corruption loss (Section 9.4)

$$\mathcal{L}_{\text{span}} = -\sum_{t=1}^{|\mathbf{y}|} \log P(y_t \mid y_{

where the encoder processes the corrupted input $\mathbf{x}_{\text{corrupt}}$ and the decoder generates the original spans $\mathbf{y}$.

Eq 9.4 — Fine-tuning loss (Section 9.2, 9.5)

$$\mathcal{L}_{\text{fine-tune}} = -\sum_{i=1}^{N} \log P(y_i \mid \mathbf{x}_i; \theta_{\text{pre-trained}})$$

where $\theta_{\text{pre-trained}}$ is initialized from the pre-trained model and updated on the labeled task data.

Key Figures

Pre-training/Fine-tuning pipeline — Flowchart: large unlabeled corpus → pre-training → pre-trained model → task-specific labeled data → fine-tuning → task model. (Architecture diagram, TikZ)
BERT architecture + MLM — BERT encoder stack with masked input tokens, bidirectional context flow, and the MLM prediction head. (Architecture diagram, TikZ)
GPT architecture + CLM — GPT decoder stack with causal masking triangle, left-to-right generation, and the language modeling head. (Architecture diagram, TikZ)
T5 text-to-text framework — Multiple NLP tasks (classification, translation, summarization, QA) all cast as text-to-text with input/output examples. (Architecture diagram, TikZ)
Paradigm comparison table — Encoder-only vs. decoder-only vs. encoder-decoder across directionality, objectives, strengths, weaknesses, and best task families. (Table)
Transfer learning diagram — Side-by-side comparison of training from scratch vs. fine-tuning, showing data efficiency and performance curves. (Plot, Matplotlib)
Fine-tuning performance curves — Learning curves for BERT, GPT, and T5 on a benchmark task, demonstrating convergence speed differences. (Line plot, Matplotlib)

Exercises

Theory (4 exercises)

[Basic] Explain why MLM trains on only 15% of tokens per example while CLM trains on 100%. Compute how many training examples each method needs to produce the same number of gradient signals for a corpus of $N$ tokens.
[Intermediate] Derive the gradient of the MLM loss with respect to the logits for a single masked position. Compare with the CLM gradient.
[Intermediate] Explain why next sentence prediction (NSP) was dropped from RoBERTa and subsequent BERT variants. What does this tell us about pre-training objective design?
[Advanced] Compare MLM and CLM in terms of information-theoretic efficiency. Per training example of length $T$, how many bits of information does each objective extract?

Programming (5 exercises)

[Basic] Implement masked token prediction with BERT using HuggingFace. Mask tokens in 5 sentences and show BERT's top-5 predictions for each mask. Discuss the quality of predictions.
[Intermediate] Fine-tune BERT-base on a sentiment classification task (e.g., IMDB). Compare accuracy with a randomly initialized Transformer of the same size. Plot training and validation loss curves for both.
[Intermediate] Fine-tune GPT-2 on a text classification task by reformulating it as text generation. Compare with the BERT fine-tuning approach in terms of accuracy, training time, and simplicity.
[Advanced] Compare fine-tuning curves across model sizes: BERT-base (110M) vs. BERT-large (340M) on a GLUE task. Analyze convergence speed, final accuracy, and compute cost.
[Advanced] Run the T5 text-to-text pipeline on a summarization dataset (e.g., CNN/DailyMail). Compare outputs from T5-small and T5-base. Evaluate with ROUGE scores.

Cross-References

This chapter references:

Chapter 1 (Sections 1.1–1.2) — The prediction paradigm and the history of language modeling
Chapter 8 (Sections 8.1–8.6) — The full Transformer architecture: self-attention, multi-head attention, positional encodings, encoder/decoder blocks

This chapter is referenced by:

Chapter 11 (Section 11.1) — Scaling laws describe how pre-training loss decreases with model size, data, and compute
Chapter 12 (Sections 12.1–12.6) — Alignment methods (RLHF, DPO) fine-tune the pre-trained models introduced here
Chapter 13 (Section 13.1) — In-context learning emerges as a capability of large pre-trained LLMs, particularly the GPT family

Key Papers

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT, 4171–4186.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., et al. (2020). Language Models are Few-Shot Learners. Advances in NeurIPS, 1877–1901.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.

← Previous

Ch 8: The Transformer Architecture

Ch 10: Tokenization and Data at Scale