Table of Contents

All 15 chapters organized by Part, with section outlines, depth indicators, and page estimates.

Part I: Foundations

Chapters 1–3 · Probability, information theory, and classical n-gram models

Chapter 1

Introduction

Moderate Phase 2 ~20 pp

Sets the stage: why predicting the next word is the central task of language modeling.

Section outline
  • 1.1 The Prediction Paradigm (~5 pp)
  • 1.2 A Brief History of Language Modeling (~6 pp)
  • 1.3 How This Book Is Organized (~5 pp)
  • 1.4 Prerequisites and Notation (~4 pp)
Moderate Phase 2 ~25 pp Critical Path

Probability, information theory, entropy, cross-entropy, and perplexity — the evaluation toolkit.

Section outline
  • 2.1 Probability and Conditional Probability (~5 pp)
  • 2.2 Maximum Likelihood Estimation (~5 pp)
  • 2.3 Information Theory Essentials (~7 pp)
  • 2.4 Evaluation Metrics (~4 pp)
  • 2.5 Optimization Basics (~4 pp)
Moderate Phase 2 ~20 pp

N-gram models, smoothing techniques, and the limitations that motivated neural approaches.

Section outline
  • 3.1 N-gram Language Models (~6 pp)
  • 3.2 Smoothing Techniques (~6 pp)
  • 3.3 Language Model Evaluation (~4 pp)
  • 3.4 Limitations of Count-Based Models (~4 pp)

Part II: Neural Language Models

Chapters 4–7 · Embeddings, RNNs, attention, and sequence-to-sequence models

Moderate Phase 2 ~20 pp

From one-hot encodings to Word2Vec, GloVe, and learned embeddings — the distributed hypothesis.

Section outline
  • 4.1 Sparse Representations (~3 pp)
  • 4.2 Distributional Semantics (~3 pp)
  • 4.3 Word2Vec (~6 pp)
  • 4.4 GloVe and FastText (~4 pp)
  • 4.5 Evaluating Embeddings (~4 pp)
Chapter 5

Sequence Models

Moderate Phase 2 ~25 pp Critical Path

RNNs, LSTMs, GRUs: how recurrent networks process variable-length sequences for language.

Section outline
  • 5.1 Vanilla RNNs (~5 pp)
  • 5.2 The Vanishing Gradient Problem (~5 pp)
  • 5.3 Long Short-Term Memory (LSTM) (~6 pp)
  • 5.4 Gated Recurrent Units (GRUs) (~4 pp)
  • 5.5 Neural Language Models (~5 pp)
Deep Phase 1 ~25 pp Critical Path

Bahdanau and Luong attention, self-attention, and the mechanism that changed everything.

Section outline
  • 6.1 Motivation: The Bottleneck Problem (~4 pp)
  • 6.2 Bahdanau (Additive) Attention (~6 pp)
  • 6.3 Luong (Multiplicative) Attention (~4 pp)
  • 6.4 Self-Attention (~5 pp)
  • 6.5 Attention as a General Mechanism (~6 pp)
Moderate Phase 2 ~20 pp

The encoder-decoder framework, teacher forcing, beam search, and nucleus sampling.

Section outline
  • 7.1 Encoder-Decoder Architecture (~5 pp)
  • 7.2 Teacher Forcing and Exposure Bias (~4 pp)
  • 7.3 Decoding Strategies (~5 pp)
  • 7.4 Evaluation of Generated Text (~3 pp)
  • 7.5 Machine Translation as a Case Study (~3 pp)

Part III: The Transformer Revolution

Chapters 8–11 · The architecture, pre-training, tokenization, and scaling

Deep Phase 1 ~30 pp Critical Path

The book's centerpiece: scaled dot-product attention, multi-head attention, positional encodings, and the full Transformer block.

Section outline
  • 8.1 From Recurrence to Attention (~4 pp)
  • 8.2 Scaled Dot-Product Attention (~6 pp)
  • 8.3 Multi-Head Attention (~5 pp)
  • 8.4 Position Encodings (~5 pp)
  • 8.5 The Full Transformer Block (~6 pp)
  • 8.6 Encoder, Decoder, and Encoder-Decoder Variants (~4 pp)
Deep Phase 1 ~30 pp Critical Path

BERT, GPT, T5: how masked and causal language modeling objectives create powerful foundations.

Section outline
  • 9.1 The Pre-training Revolution (~5 pp)
  • 9.2 BERT and Masked Language Modeling (~7 pp)
  • 9.3 GPT and Autoregressive Language Modeling (~7 pp)
  • 9.4 T5 and Encoder-Decoder Pre-training (~5 pp)
  • 9.5 Comparing Paradigms (~6 pp)
Moderate Phase 2 ~20 pp

BPE, WordPiece, Unigram: subword tokenization strategies and data preprocessing for large models.

Section outline
  • 10.1 From Words to Subwords (~3 pp)
  • 10.2 Byte-Pair Encoding (BPE) (~5 pp)
  • 10.3 SentencePiece and Unigram (~4 pp)
  • 10.4 The Impact of Tokenization (~4 pp)
  • 10.5 Data Curation at Scale (~4 pp)
Moderate Phase 2 ~20 pp

Kaplan and Chinchilla scaling laws, compute-optimal training, MoE, and emergent abilities.

Section outline
  • 11.1 Scaling Laws (~5 pp)
  • 11.2 Emergent Abilities (~4 pp)
  • 11.3 Mixture of Experts (MoE) (~4 pp)
  • 11.4 Efficient Training (~4 pp)
  • 11.5 The Compute Frontier (~3 pp)

Part IV: Frontiers

Chapters 12–15 · Alignment, prompting, agents, and responsible AI

Deep Phase 1 ~25 pp

Reward modeling, PPO, DPO, Constitutional AI — making LLMs helpful, harmless, and honest.

Section outline
  • 12.1 The Alignment Problem (~3 pp)
  • 12.2 Instruction Tuning (~4 pp)
  • 12.3 RLHF (~6 pp)
  • 12.4 Direct Preference Optimization (DPO) (~5 pp)
  • 12.5 Constitutional AI and RLAIF (~3 pp)
  • 12.6 Safety, Red-Teaming, and Guardrails (~4 pp)
Chapter 13

In-Context Learning

Moderate Phase 3 ~25 pp

Few-shot learning, chain-of-thought prompting, self-consistency, and the ICL mechanism debate.

Section outline
  • 13.1 In-Context Learning (~5 pp)
  • 13.2 Prompt Engineering (~6 pp)
  • 13.3 Chain-of-Thought Reasoning (~6 pp)
  • 13.4 Tool Use and Function Calling (~4 pp)
  • 13.5 The Limits of Prompting (~4 pp)
Moderate Phase 3 ~20 pp

RAG, ReAct agents, tool use, CLIP, and vision-language models.

Section outline
  • 14.1 Retrieval-Augmented Generation (RAG) (~5 pp)
  • 14.2 Agents and Planning (~4 pp)
  • 14.3 Vision-Language Models (~4 pp)
  • 14.4 Long-Context Models (~3 pp)
  • 14.5 Efficient Inference (~4 pp)
Moderate Phase 3 ~15 pp

Carbon footprint, memorization, bias, watermarking, and responsible AI development.

Section outline
  • 15.1 Bias and Fairness (~4 pp)
  • 15.2 Privacy and Memorization (~3 pp)
  • 15.3 Environmental Impact (~3 pp)
  • 15.4 Intellectual Property and Regulation (~3 pp)
  • 15.5 The Future of Language Models (~2 pp)