Table of Contents
All 15 chapters organized by Part, with section outlines, depth indicators, and page estimates.
Part I: Foundations
Chapters 1–3 · Probability, information theory, and classical n-gram models
Introduction
Sets the stage: why predicting the next word is the central task of language modeling.
Section outline
- 1.1 The Prediction Paradigm (~5 pp)
- 1.2 A Brief History of Language Modeling (~6 pp)
- 1.3 How This Book Is Organized (~5 pp)
- 1.4 Prerequisites and Notation (~4 pp)
Mathematical Foundations
Probability, information theory, entropy, cross-entropy, and perplexity — the evaluation toolkit.
Section outline
- 2.1 Probability and Conditional Probability (~5 pp)
- 2.2 Maximum Likelihood Estimation (~5 pp)
- 2.3 Information Theory Essentials (~7 pp)
- 2.4 Evaluation Metrics (~4 pp)
- 2.5 Optimization Basics (~4 pp)
Classical Language Models
N-gram models, smoothing techniques, and the limitations that motivated neural approaches.
Section outline
- 3.1 N-gram Language Models (~6 pp)
- 3.2 Smoothing Techniques (~6 pp)
- 3.3 Language Model Evaluation (~4 pp)
- 3.4 Limitations of Count-Based Models (~4 pp)
Part II: Neural Language Models
Chapters 4–7 · Embeddings, RNNs, attention, and sequence-to-sequence models
Word Representations
From one-hot encodings to Word2Vec, GloVe, and learned embeddings — the distributed hypothesis.
Section outline
- 4.1 Sparse Representations (~3 pp)
- 4.2 Distributional Semantics (~3 pp)
- 4.3 Word2Vec (~6 pp)
- 4.4 GloVe and FastText (~4 pp)
- 4.5 Evaluating Embeddings (~4 pp)
Sequence Models
RNNs, LSTMs, GRUs: how recurrent networks process variable-length sequences for language.
Section outline
- 5.1 Vanilla RNNs (~5 pp)
- 5.2 The Vanishing Gradient Problem (~5 pp)
- 5.3 Long Short-Term Memory (LSTM) (~6 pp)
- 5.4 Gated Recurrent Units (GRUs) (~4 pp)
- 5.5 Neural Language Models (~5 pp)
The Attention Revolution
Bahdanau and Luong attention, self-attention, and the mechanism that changed everything.
Section outline
- 6.1 Motivation: The Bottleneck Problem (~4 pp)
- 6.2 Bahdanau (Additive) Attention (~6 pp)
- 6.3 Luong (Multiplicative) Attention (~4 pp)
- 6.4 Self-Attention (~5 pp)
- 6.5 Attention as a General Mechanism (~6 pp)
Seq-to-Seq and Decoding
The encoder-decoder framework, teacher forcing, beam search, and nucleus sampling.
Section outline
- 7.1 Encoder-Decoder Architecture (~5 pp)
- 7.2 Teacher Forcing and Exposure Bias (~4 pp)
- 7.3 Decoding Strategies (~5 pp)
- 7.4 Evaluation of Generated Text (~3 pp)
- 7.5 Machine Translation as a Case Study (~3 pp)
Part III: The Transformer Revolution
Chapters 8–11 · The architecture, pre-training, tokenization, and scaling
The Transformer Architecture
The book's centerpiece: scaled dot-product attention, multi-head attention, positional encodings, and the full Transformer block.
Section outline
- 8.1 From Recurrence to Attention (~4 pp)
- 8.2 Scaled Dot-Product Attention (~6 pp)
- 8.3 Multi-Head Attention (~5 pp)
- 8.4 Position Encodings (~5 pp)
- 8.5 The Full Transformer Block (~6 pp)
- 8.6 Encoder, Decoder, and Encoder-Decoder Variants (~4 pp)
Pre-training Paradigms
BERT, GPT, T5: how masked and causal language modeling objectives create powerful foundations.
Section outline
- 9.1 The Pre-training Revolution (~5 pp)
- 9.2 BERT and Masked Language Modeling (~7 pp)
- 9.3 GPT and Autoregressive Language Modeling (~7 pp)
- 9.4 T5 and Encoder-Decoder Pre-training (~5 pp)
- 9.5 Comparing Paradigms (~6 pp)
Tokenization and Data at Scale
BPE, WordPiece, Unigram: subword tokenization strategies and data preprocessing for large models.
Section outline
- 10.1 From Words to Subwords (~3 pp)
- 10.2 Byte-Pair Encoding (BPE) (~5 pp)
- 10.3 SentencePiece and Unigram (~4 pp)
- 10.4 The Impact of Tokenization (~4 pp)
- 10.5 Data Curation at Scale (~4 pp)
Scaling Laws and Emergence
Kaplan and Chinchilla scaling laws, compute-optimal training, MoE, and emergent abilities.
Section outline
- 11.1 Scaling Laws (~5 pp)
- 11.2 Emergent Abilities (~4 pp)
- 11.3 Mixture of Experts (MoE) (~4 pp)
- 11.4 Efficient Training (~4 pp)
- 11.5 The Compute Frontier (~3 pp)
Part IV: Frontiers
Chapters 12–15 · Alignment, prompting, agents, and responsible AI
Alignment (RLHF, DPO)
Reward modeling, PPO, DPO, Constitutional AI — making LLMs helpful, harmless, and honest.
Section outline
- 12.1 The Alignment Problem (~3 pp)
- 12.2 Instruction Tuning (~4 pp)
- 12.3 RLHF (~6 pp)
- 12.4 Direct Preference Optimization (DPO) (~5 pp)
- 12.5 Constitutional AI and RLAIF (~3 pp)
- 12.6 Safety, Red-Teaming, and Guardrails (~4 pp)
In-Context Learning
Few-shot learning, chain-of-thought prompting, self-consistency, and the ICL mechanism debate.
Section outline
- 13.1 In-Context Learning (~5 pp)
- 13.2 Prompt Engineering (~6 pp)
- 13.3 Chain-of-Thought Reasoning (~6 pp)
- 13.4 Tool Use and Function Calling (~4 pp)
- 13.5 The Limits of Prompting (~4 pp)
Retrieval, Agents, Multimodal
RAG, ReAct agents, tool use, CLIP, and vision-language models.
Section outline
- 14.1 Retrieval-Augmented Generation (RAG) (~5 pp)
- 14.2 Agents and Planning (~4 pp)
- 14.3 Vision-Language Models (~4 pp)
- 14.4 Long-Context Models (~3 pp)
- 14.5 Efficient Inference (~4 pp)
Ethics, Society, Future
Carbon footprint, memorization, bias, watermarking, and responsible AI development.
Section outline
- 15.1 Bias and Fairness (~4 pp)
- 15.2 Privacy and Memorization (~3 pp)
- 15.3 Environmental Impact (~3 pp)
- 15.4 Intellectual Property and Regulation (~3 pp)
- 15.5 The Future of Language Models (~2 pp)