Alignment: RLHF, DPO, and Safety
Prerequisites
Summary
Chapter 12 addresses the most consequential transformation in modern NLP: converting a raw language model -- trained to predict the next token and therefore reproducing the full distribution of its training data, including toxic, harmful, and dishonest text -- into a helpful, harmless, and honest assistant. The chapter formalizes the alignment problem as the gap between the pre-training objective (minimize cross-entropy on internet text) and the deployment objective (be useful and safe), then presents the three-stage pipeline: supervised fine-tuning (SFT) on human-written instruction-response pairs, reward model training via the Bradley-Terry preference model, and policy optimization via either RLHF (PPO with KL penalty) or DPO (a simpler closed-form reparameterization that eliminates the reward model entirely). Constitutional AI extends this by replacing human feedback with AI feedback guided by principles. The chapter closes with deployment safety: jailbreak attacks, red-teaming methodology, and the fundamental helpfulness-safety trade-off.
Learning Objectives
- Explain why a pre-trained language model that achieves low perplexity is not automatically helpful, harmless, or honest, and articulate the alignment problem as the gap between the pre-training objective and the deployment objective.
- Derive the RLHF pipeline end-to-end: supervised fine-tuning (SFT), reward model training using the Bradley-Terry model, and PPO optimization with a KL penalty, and implement each stage in code.
- Derive the DPO objective from first principles, showing how it eliminates the need for an explicit reward model by reparameterizing the RLHF objective, and compare its computational and statistical properties to RLHF.
- Evaluate alignment techniques critically -- including Constitutional AI, RLAIF, and red-teaming -- and reason about their limitations, failure modes, and the open challenges in deployment safety.
Section Outline
12.1 The Alignment Problem (~3pp)
Why raw pre-trained LLMs are not safe or helpful: they mimic training data distributions, including toxic, biased, and factually incorrect text. The three H's: Helpful, Harmless, Honest (Askell et al., 2021). The gap between the pre-training objective and the deployment objective.
- 12.1.1 The Pre-training / Deployment Gap
- 12.1.2 Helpful, Harmless, Honest: Defining the Target
- 12.1.3 Why Better Data Is Not Enough
12.2 Instruction Tuning (~4pp)
Supervised fine-tuning (SFT) on human-written instruction-response pairs. FLAN and InstructGPT as landmark examples. How SFT teaches the model the format and style of a helpful assistant, and why SFT alone is insufficient.
- 12.2.1 Supervised Fine-Tuning on Instructions
- 12.2.2 FLAN and InstructGPT: Case Studies
- 12.2.3 Limitations of SFT
12.3 RLHF (~6pp)
The full Reinforcement Learning from Human Feedback pipeline. Collecting comparison data, training a reward model using the Bradley-Terry preference model, and optimizing the policy using PPO with a KL divergence penalty. Practical challenges: reward hacking, mode collapse, and PPO hyperparameter sensitivity.
- 12.3.1 Collecting Human Preference Data
- 12.3.2 Training the Reward Model (Bradley-Terry)
- 12.3.3 PPO Optimization with KL Penalty
- 12.3.4 Reward Hacking and Practical Challenges
12.4 Direct Preference Optimization (DPO) (~5pp)
Reparameterizing the RLHF objective to eliminate the reward model entirely (Rafailov et al., 2023). Full derivation of the DPO loss from the KL-constrained RL objective. Comparison with RLHF and extensions: IPO, KTO.
- 12.4.1 From RLHF to DPO: The Reparameterization
- 12.4.2 The DPO Loss Derivation
- 12.4.3 DPO vs. RLHF: Trade-offs in Practice
- 12.4.4 Beyond DPO: IPO, KTO, and Variants
12.5 Constitutional AI and RLAIF (~3pp)
Replacing human feedback with AI feedback guided by constitutional principles (Bai et al., 2022). The RLAIF pipeline, scalability advantages, and risks of bias amplification.
- 12.5.1 The Constitutional AI Framework
- 12.5.2 RLAIF: AI as the Preference Annotator
- 12.5.3 Scalability vs. Bias Amplification
12.6 Safety, Red-Teaming, and Guardrails (~4pp)
Deployment-time safety: jailbreak attacks and defenses, red-teaming methodology, safety evaluation benchmarks, and the helpfulness-safety trade-off.
- 12.6.1 Jailbreaks and Adversarial Attacks
- 12.6.2 Red-Teaming Methodology
- 12.6.3 Safety Evaluation and Benchmarks
- 12.6.4 The Helpfulness-Safety Trade-off
Key Equations
Key Figures
Exercises
Theory
- The Alignment Gap (Basic). Explain why a language model trained to minimize cross-entropy on internet text is not automatically helpful, harmless, or honest. Give one concrete example for each property.
- DPO Derivation (Intermediate). Derive the DPO loss from the KL-constrained RL objective. Start with $J(\theta)$, solve for the optimal policy, substitute into the Bradley-Terry model, and show that the reward model cancels out.
- KL Penalty Analysis (Intermediate). Analyze the effect of $\beta$ on alignment. What happens as $\beta \to 0$? As $\beta \to \infty$? Why is there an optimal intermediate value?
- DPO-RLHF Equivalence (Intermediate). Show that DPO reduces to the RLHF objective under the optimal reward model and correct Bradley-Terry specification.
- Intransitive Preferences (Advanced). The Bradley-Terry model assumes transitivity. Give an example where human preferences might be intransitive and explain the implications for reward model training.
Programming
- Reward Model Training (Basic). Implement the Bradley-Terry reward model loss in PyTorch. Train a simple reward model on 1000 synthetic preference pairs and report pairwise accuracy.
- DPO Fine-Tuning (Intermediate). Implement the DPO loss and fine-tune a small model on the Anthropic HH-RLHF dataset. Compare with the SFT-only baseline using LLM-as-judge evaluation.
- Red-Teaming Evaluation (Intermediate). Implement a red-teaming evaluation loop against 100 adversarial prompts. Categorize failures by type and compute attack success rate per category.
- Helpfulness-Safety Trade-off (Advanced). Train 5 DPO models with $\beta \in \{0.01, 0.05, 0.1, 0.5, 1.0\}$. Measure helpfulness and safety for each. Plot the Pareto frontier.
- Method Comparison (Advanced). Compare SFT-only, RLHF, and DPO on a conversational benchmark using GPT-4 as judge. Report pairwise win rates.
Cross-References
This chapter references:
- Ch 1 (Section 1.1): The prediction paradigm. Chapter 12 reveals the fundamental tension: a model trained purely to predict the next token will reproduce whatever patterns exist in its training data, including harmful ones.
- Ch 9 (Sections 9.1--9.3): Pre-training paradigms (BERT, GPT, T5). Chapter 9 produces the pre-trained models that Chapter 12 aligns. The SFT stage directly continues the fine-tuning paradigm from Chapter 9.
This chapter is referenced by:
- Ch 13 (soft): Understanding alignment helps explain why models follow instructions and produce coherent chain-of-thought reasoning.
Key Papers
- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in NeurIPS. [Sections 12.2, 12.3]
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Advances in NeurIPS. [Section 12.4]
- Askell, A., Bai, Y., Chen, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861. [Section 12.1.2]
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. [Section 12.5]
- Perez, E., Huang, S., Song, F., et al. (2022). Red Teaming Language Models with Language Models. Proceedings of EMNLP. [Section 12.6.2]
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. [Section 12.3.3]
- Wei, J., Bosma, M., Zhao, V., et al. (2022). Finetuned Language Models Are Zero-Shot Learners. Proceedings of ICLR. [Section 12.2.2]
- Azar, M. G., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Feedback. Proceedings of AISTATS. [Section 12.4.4 -- IPO]
- Ethayarajh, K., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. arXiv:2402.01306. [Section 12.4.4 -- KTO]