Pre-training Paradigms: BERT, GPT, and T5
Prerequisites
- Chapter 8: The Transformer Architecture — Self-attention, multi-head attention, positional encodings, encoder/decoder blocks. BERT uses the encoder, GPT uses the decoder, T5 uses the full encoder-decoder.
Summary
Chapter 9 is THE chapter that explains how Transformer architectures (Ch 8) became large language models. It presents the three foundational pre-training paradigms that transformed NLP between 2018 and 2020: BERT's masked language modeling, GPT's autoregressive language modeling, and T5's span corruption — each representing a different answer to the question "how should we train a Transformer?" The chapter's central insight is that language structure learned from massive unlabeled text transfers to downstream tasks, making the pre-train/fine-tune paradigm vastly more data-efficient than training from scratch. By the chapter's end, students understand why decoder-only models (GPT-style) won the scaling race: autoregressive training is simpler, generation is a more general capability than understanding, and in-context learning emerges as a free bonus at scale. This chapter is the pivot point of the book — everything before it builds toward it, and everything after builds upon it.
Learning Objectives
- Explain why pre-training on large unlabeled corpora followed by task-specific fine-tuning outperforms training from scratch, and quantify the data-efficiency gains with concrete examples
- Derive and implement the masked language modeling (MLM) objective used in BERT, including the 80/10/10 masking strategy, and contrast it with the causal language modeling (CLM) objective used in GPT
- Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures in terms of their pre-training objectives, computational trade-offs, and downstream task suitability
- Fine-tune a pre-trained transformer on a text classification task using the Hugging Face Transformers library, evaluate transfer learning performance against a randomly initialized baseline, and interpret fine-tuning loss curves
Section Outline
9.1 The Pre-training Revolution (~5pp)
Why pre-train? The insight that language structure learned from massive unlabeled text transfers to downstream tasks. Historical shift from feature engineering to pre-trained representations, the ImageNet analogy, and the pivotal moment when ELMo, ULMFiT, and BERT demonstrated that pre-training works for NLP.
- 9.1.1 From task-specific to general-purpose representations
- 9.1.2 Transfer learning: the ImageNet moment for NLP
- 9.1.3 The pre-train / fine-tune paradigm
9.2 BERT and Masked Language Modeling (~7pp)
Architecture and training of BERT (Devlin et al., 2019). The bidirectional encoder design, the MLM objective with its 80/10/10 masking strategy, next sentence prediction (NSP) and why it was later dropped, and fine-tuning for downstream tasks.
- 9.2.1 BERT architecture and input representation
- 9.2.2 The masked language modeling objective
- 9.2.3 Next sentence prediction (and its limitations)
- 9.2.4 Fine-tuning BERT for downstream tasks
9.3 GPT and Autoregressive Language Modeling (~7pp)
Architecture and training of GPT (Radford et al., 2018) and GPT-2/GPT-3. The decoder-only design, causal masking, the CLM objective, and the discovery that scale enables few-shot learning without fine-tuning.
- 9.3.1 GPT architecture and causal masking
- 9.3.2 The causal language modeling objective
- 9.3.3 From GPT to GPT-2 to GPT-3: the power of scale
- 9.3.4 Zero-shot and few-shot generalization
9.4 T5 and Encoder-Decoder Pre-training (~5pp)
The text-to-text transfer transformer (Raffel et al., 2020). Casting every NLP task as text-to-text, the span corruption pre-training objective, the C4 dataset, and systematic ablation findings.
- 9.4.1 The text-to-text framework
- 9.4.2 Span corruption objective
- 9.4.3 Lessons from systematic ablations
9.5 Comparing Paradigms (~6pp)
Encoder-only vs. decoder-only vs. encoder-decoder: which architecture suits which task family? Why decoder-only models won the scaling race. Practical guidelines for model selection.
- 9.5.1 Architecture-task suitability matrix
- 9.5.2 Why decoder-only models won the scaling race
- 9.5.3 Practical guidelines for model selection
Key Equations
Eq 9.1 — Masked Language Modeling loss (Section 9.2)
$$\mathcal{L}_{\text{MLM}} = -\sum_{m \in \mathcal{M}} \log P(w_m \mid \mathbf{w}_{\backslash \mathcal{M}}; \theta)$$
where $\mathcal{M}$ is the set of masked positions and $\mathbf{w}_{\backslash \mathcal{M}}$ denotes the sequence with masked tokens replaced.
Eq 9.2 — Causal Language Modeling loss (Section 9.3)
$$\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(w_t \mid w_1, \ldots, w_{t-1}; \theta)$$
Eq 9.3 — Span corruption loss (Section 9.4)
$$\mathcal{L}_{\text{span}} = -\sum_{t=1}^{|\mathbf{y}|} \log P(y_t \mid y_{ where the encoder processes the corrupted input $\mathbf{x}_{\text{corrupt}}$ and the decoder generates the original spans $\mathbf{y}$.
Eq 9.4 — Fine-tuning loss (Section 9.2, 9.5)
$$\mathcal{L}_{\text{fine-tune}} = -\sum_{i=1}^{N} \log P(y_i \mid \mathbf{x}_i; \theta_{\text{pre-trained}})$$
where $\theta_{\text{pre-trained}}$ is initialized from the pre-trained model and updated on the labeled task data.
Key Figures
- Pre-training/Fine-tuning pipeline — Flowchart: large unlabeled corpus → pre-training → pre-trained model → task-specific labeled data → fine-tuning → task model. (Architecture diagram, TikZ)
- BERT architecture + MLM — BERT encoder stack with masked input tokens, bidirectional context flow, and the MLM prediction head. (Architecture diagram, TikZ)
- GPT architecture + CLM — GPT decoder stack with causal masking triangle, left-to-right generation, and the language modeling head. (Architecture diagram, TikZ)
- T5 text-to-text framework — Multiple NLP tasks (classification, translation, summarization, QA) all cast as text-to-text with input/output examples. (Architecture diagram, TikZ)
- Paradigm comparison table — Encoder-only vs. decoder-only vs. encoder-decoder across directionality, objectives, strengths, weaknesses, and best task families. (Table)
- Transfer learning diagram — Side-by-side comparison of training from scratch vs. fine-tuning, showing data efficiency and performance curves. (Plot, Matplotlib)
- Fine-tuning performance curves — Learning curves for BERT, GPT, and T5 on a benchmark task, demonstrating convergence speed differences. (Line plot, Matplotlib)
Exercises
Theory (4 exercises)
- [Basic] Explain why MLM trains on only 15% of tokens per example while CLM trains on 100%. Compute how many training examples each method needs to produce the same number of gradient signals for a corpus of $N$ tokens.
- [Intermediate] Derive the gradient of the MLM loss with respect to the logits for a single masked position. Compare with the CLM gradient.
- [Intermediate] Explain why next sentence prediction (NSP) was dropped from RoBERTa and subsequent BERT variants. What does this tell us about pre-training objective design?
- [Advanced] Compare MLM and CLM in terms of information-theoretic efficiency. Per training example of length $T$, how many bits of information does each objective extract?
Programming (5 exercises)
- [Basic] Implement masked token prediction with BERT using HuggingFace. Mask tokens in 5 sentences and show BERT's top-5 predictions for each mask. Discuss the quality of predictions.
- [Intermediate] Fine-tune BERT-base on a sentiment classification task (e.g., IMDB). Compare accuracy with a randomly initialized Transformer of the same size. Plot training and validation loss curves for both.
- [Intermediate] Fine-tune GPT-2 on a text classification task by reformulating it as text generation. Compare with the BERT fine-tuning approach in terms of accuracy, training time, and simplicity.
- [Advanced] Compare fine-tuning curves across model sizes: BERT-base (110M) vs. BERT-large (340M) on a GLUE task. Analyze convergence speed, final accuracy, and compute cost.
- [Advanced] Run the T5 text-to-text pipeline on a summarization dataset (e.g., CNN/DailyMail). Compare outputs from T5-small and T5-base. Evaluate with ROUGE scores.
Cross-References
This chapter references:
- Chapter 1 (Sections 1.1–1.2) — The prediction paradigm and the history of language modeling
- Chapter 8 (Sections 8.1–8.6) — The full Transformer architecture: self-attention, multi-head attention, positional encodings, encoder/decoder blocks
This chapter is referenced by:
- Chapter 11 (Section 11.1) — Scaling laws describe how pre-training loss decreases with model size, data, and compute
- Chapter 12 (Sections 12.1–12.6) — Alignment methods (RLHF, DPO) fine-tune the pre-trained models introduced here
- Chapter 13 (Section 13.1) — In-context learning emerges as a capability of large pre-trained LLMs, particularly the GPT family
Key Papers
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT, 4171–4186.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., et al. (2020). Language Models are Few-Shot Learners. Advances in NeurIPS, 1877–1901.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.