Ch 12: Alignment (RLHF, DPO) | Predicting the Next Words

Why this chapter matters. A language model trained purely to predict the next token reproduces whatever patterns exist in its training data -- including harmful ones. Alignment is the process of shaping which predictions the model makes, bridging the gap between raw prediction capability and a helpful, harmless, honest assistant. This is the engineering that transformed powerful but unreliable text generators into products used by a billion people.

Prerequisites

Ch 9: Pre-training Paradigms → Ch 12: Alignment (RLHF, DPO)

Summary

Chapter 12 addresses the most consequential transformation in modern NLP: converting a raw language model -- trained to predict the next token and therefore reproducing the full distribution of its training data, including toxic, harmful, and dishonest text -- into a helpful, harmless, and honest assistant. The chapter formalizes the alignment problem as the gap between the pre-training objective (minimize cross-entropy on internet text) and the deployment objective (be useful and safe), then presents the three-stage pipeline: supervised fine-tuning (SFT) on human-written instruction-response pairs, reward model training via the Bradley-Terry preference model, and policy optimization via either RLHF (PPO with KL penalty) or DPO (a simpler closed-form reparameterization that eliminates the reward model entirely). Constitutional AI extends this by replacing human feedback with AI feedback guided by principles. The chapter closes with deployment safety: jailbreak attacks, red-teaming methodology, and the fundamental helpfulness-safety trade-off.

Learning Objectives

Explain why a pre-trained language model that achieves low perplexity is not automatically helpful, harmless, or honest, and articulate the alignment problem as the gap between the pre-training objective and the deployment objective.
Derive the RLHF pipeline end-to-end: supervised fine-tuning (SFT), reward model training using the Bradley-Terry model, and PPO optimization with a KL penalty, and implement each stage in code.
Derive the DPO objective from first principles, showing how it eliminates the need for an explicit reward model by reparameterizing the RLHF objective, and compare its computational and statistical properties to RLHF.
Evaluate alignment techniques critically -- including Constitutional AI, RLAIF, and red-teaming -- and reason about their limitations, failure modes, and the open challenges in deployment safety.

Section Outline

12.1 The Alignment Problem (~3pp)

Why raw pre-trained LLMs are not safe or helpful: they mimic training data distributions, including toxic, biased, and factually incorrect text. The three H's: Helpful, Harmless, Honest (Askell et al., 2021). The gap between the pre-training objective and the deployment objective.

12.1.1 The Pre-training / Deployment Gap
12.1.2 Helpful, Harmless, Honest: Defining the Target
12.1.3 Why Better Data Is Not Enough

12.2 Instruction Tuning (~4pp)

Supervised fine-tuning (SFT) on human-written instruction-response pairs. FLAN and InstructGPT as landmark examples. How SFT teaches the model the format and style of a helpful assistant, and why SFT alone is insufficient.

12.2.1 Supervised Fine-Tuning on Instructions
12.2.2 FLAN and InstructGPT: Case Studies
12.2.3 Limitations of SFT

12.3 RLHF (~6pp)

The full Reinforcement Learning from Human Feedback pipeline. Collecting comparison data, training a reward model using the Bradley-Terry preference model, and optimizing the policy using PPO with a KL divergence penalty. Practical challenges: reward hacking, mode collapse, and PPO hyperparameter sensitivity.

12.3.1 Collecting Human Preference Data
12.3.2 Training the Reward Model (Bradley-Terry)
12.3.3 PPO Optimization with KL Penalty
12.3.4 Reward Hacking and Practical Challenges

12.4 Direct Preference Optimization (DPO) (~5pp)

Reparameterizing the RLHF objective to eliminate the reward model entirely (Rafailov et al., 2023). Full derivation of the DPO loss from the KL-constrained RL objective. Comparison with RLHF and extensions: IPO, KTO.

12.4.1 From RLHF to DPO: The Reparameterization
12.4.2 The DPO Loss Derivation
12.4.3 DPO vs. RLHF: Trade-offs in Practice
12.4.4 Beyond DPO: IPO, KTO, and Variants

12.5 Constitutional AI and RLAIF (~3pp)

Replacing human feedback with AI feedback guided by constitutional principles (Bai et al., 2022). The RLAIF pipeline, scalability advantages, and risks of bias amplification.

12.5.1 The Constitutional AI Framework
12.5.2 RLAIF: AI as the Preference Annotator
12.5.3 Scalability vs. Bias Amplification

12.6 Safety, Red-Teaming, and Guardrails (~4pp)

Deployment-time safety: jailbreak attacks and defenses, red-teaming methodology, safety evaluation benchmarks, and the helpfulness-safety trade-off.

12.6.1 Jailbreaks and Adversarial Attacks
12.6.2 Red-Teaming Methodology
12.6.3 Safety Evaluation and Benchmarks
12.6.4 The Helpfulness-Safety Trade-off

Key Equations

(12.1)

$$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$

Reward Model Loss (Bradley-Terry) -- where $y_w$ is the preferred response, $y_l$ is the dispreferred response, $r_\phi$ is the reward model, and $\sigma$ is the sigmoid function.

(12.2)

$$J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[r_\phi(x, y) - \beta \, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right)\right]$$

PPO Objective with KL Penalty -- maximize reward while staying close to the SFT reference policy $\pi_{\text{ref}}$, controlled by $\beta$.

(12.3)

$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$

DPO Loss -- derived by substituting the closed-form optimal policy from the KL-constrained RL objective into the Bradley-Terry preference model, eliminating the reward model entirely.

(12.4)

$$D_{\text{KL}}\!\left(\pi_\theta \| \pi_{\text{ref}}\right) = \mathbb{E}_{y \sim \pi_\theta} \left[\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\right]$$

KL Divergence Penalty -- prevents the policy from diverging too far from the SFT reference model, mitigating reward hacking.

Key Figures

Figure 12.1 · Flowchart · TikZ

Alignment Pipeline

End-to-end flowchart: Pre-trained LM → SFT on instructions → Reward Model Training → PPO Optimization → Aligned Model.

Figure 12.2 · Architecture Diagram · TikZ

Reward Model Training

Preference data collection: human sees two responses, ranks them; pairwise comparisons feed the Bradley-Terry loss to train the reward model.

Figure 12.3 · Loop/Cycle Diagram · TikZ

RLHF Loop Diagram

The PPO training loop: generate response → score with reward model → compute advantage → update policy → KL check.

Figure 12.4 · Comparison Diagram · TikZ

DPO vs. RLHF Comparison

Side-by-side diagram: RLHF (3 stages, explicit RM) vs. DPO (direct optimization on preference pairs, no RM). Highlights the architectural simplification.

Figure 12.5 · Flowchart · TikZ

Constitutional AI Flow

Generate → critique against principles → revise → collect AI preferences → train preference model.

Figure 12.6 · Radar Chart · Matplotlib

Safety Evaluation Radar

A model's scores across safety dimensions: toxicity, bias, refusal appropriateness, factual accuracy, instruction following, helpfulness.

Figure 12.7 · Annotated Table · LaTeX tabular

Red-Teaming Examples

Annotated examples of jailbreak attempts (prompt injection, role-play attacks, encoded instructions) with model responses before and after alignment.

Exercises

Theory

The Alignment Gap (Basic). Explain why a language model trained to minimize cross-entropy on internet text is not automatically helpful, harmless, or honest. Give one concrete example for each property.
DPO Derivation (Intermediate). Derive the DPO loss from the KL-constrained RL objective. Start with $J(\theta)$, solve for the optimal policy, substitute into the Bradley-Terry model, and show that the reward model cancels out.
KL Penalty Analysis (Intermediate). Analyze the effect of $\beta$ on alignment. What happens as $\beta \to 0$? As $\beta \to \infty$? Why is there an optimal intermediate value?
DPO-RLHF Equivalence (Intermediate). Show that DPO reduces to the RLHF objective under the optimal reward model and correct Bradley-Terry specification.
Intransitive Preferences (Advanced). The Bradley-Terry model assumes transitivity. Give an example where human preferences might be intransitive and explain the implications for reward model training.

Programming

Reward Model Training (Basic). Implement the Bradley-Terry reward model loss in PyTorch. Train a simple reward model on 1000 synthetic preference pairs and report pairwise accuracy.
DPO Fine-Tuning (Intermediate). Implement the DPO loss and fine-tune a small model on the Anthropic HH-RLHF dataset. Compare with the SFT-only baseline using LLM-as-judge evaluation.
Red-Teaming Evaluation (Intermediate). Implement a red-teaming evaluation loop against 100 adversarial prompts. Categorize failures by type and compute attack success rate per category.
Helpfulness-Safety Trade-off (Advanced). Train 5 DPO models with $\beta \in \{0.01, 0.05, 0.1, 0.5, 1.0\}$. Measure helpfulness and safety for each. Plot the Pareto frontier.
Method Comparison (Advanced). Compare SFT-only, RLHF, and DPO on a conversational benchmark using GPT-4 as judge. Report pairwise win rates.

Cross-References

This chapter references:

Ch 1 (Section 1.1): The prediction paradigm. Chapter 12 reveals the fundamental tension: a model trained purely to predict the next token will reproduce whatever patterns exist in its training data, including harmful ones.
Ch 9 (Sections 9.1--9.3): Pre-training paradigms (BERT, GPT, T5). Chapter 9 produces the pre-trained models that Chapter 12 aligns. The SFT stage directly continues the fine-tuning paradigm from Chapter 9.

This chapter is referenced by:

Ch 13 (soft): Understanding alignment helps explain why models follow instructions and produce coherent chain-of-thought reasoning.

Key Papers

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in NeurIPS. [Sections 12.2, 12.3]
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Advances in NeurIPS. [Section 12.4]
Askell, A., Bai, Y., Chen, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861. [Section 12.1.2]
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. [Section 12.5]
Perez, E., Huang, S., Song, F., et al. (2022). Red Teaming Language Models with Language Models. Proceedings of EMNLP. [Section 12.6.2]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. [Section 12.3.3]
Wei, J., Bosma, M., Zhao, V., et al. (2022). Finetuned Language Models Are Zero-Shot Learners. Proceedings of ICLR. [Section 12.2.2]
Azar, M. G., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Feedback. Proceedings of AISTATS. [Section 12.4.4 -- IPO]
Ethayarajh, K., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306. [Section 12.4.4 -- KTO]

Alignment: RLHF, DPO, and Safety