Table of Contents | Predicting the Next Words

All 15 chapters organized by Part, with section outlines, depth indicators, and page estimates.

Part I: Foundations

Chapters 1–3 · Probability, information theory, and classical n-gram models

Chapter 1

Introduction

Moderate Phase 2 ~20 pp

Sets the stage: why predicting the next word is the central task of language modeling.

Section outline

1.1 The Prediction Paradigm (~5 pp)
1.2 A Brief History of Language Modeling (~6 pp)
1.3 How This Book Is Organized (~5 pp)
1.4 Prerequisites and Notation (~4 pp)

Chapter 2

Mathematical Foundations

Moderate Phase 2 ~25 pp Critical Path

Probability, information theory, entropy, cross-entropy, and perplexity — the evaluation toolkit.

Section outline

2.1 Probability and Conditional Probability (~5 pp)
2.2 Maximum Likelihood Estimation (~5 pp)
2.3 Information Theory Essentials (~7 pp)
2.4 Evaluation Metrics (~4 pp)
2.5 Optimization Basics (~4 pp)

Chapter 3

Classical Language Models

Moderate Phase 2 ~20 pp

N-gram models, smoothing techniques, and the limitations that motivated neural approaches.

Section outline

3.1 N-gram Language Models (~6 pp)
3.2 Smoothing Techniques (~6 pp)
3.3 Language Model Evaluation (~4 pp)
3.4 Limitations of Count-Based Models (~4 pp)

Part II: Neural Language Models

Chapters 4–7 · Embeddings, RNNs, attention, and sequence-to-sequence models

Chapter 4

Word Representations

Moderate Phase 2 ~20 pp

From one-hot encodings to Word2Vec, GloVe, and learned embeddings — the distributed hypothesis.

Section outline

4.1 Sparse Representations (~3 pp)
4.2 Distributional Semantics (~3 pp)
4.3 Word2Vec (~6 pp)
4.4 GloVe and FastText (~4 pp)
4.5 Evaluating Embeddings (~4 pp)

Chapter 5

Sequence Models

Moderate Phase 2 ~25 pp Critical Path

RNNs, LSTMs, GRUs: how recurrent networks process variable-length sequences for language.

Section outline

5.1 Vanilla RNNs (~5 pp)
5.2 The Vanishing Gradient Problem (~5 pp)
5.3 Long Short-Term Memory (LSTM) (~6 pp)
5.4 Gated Recurrent Units (GRUs) (~4 pp)
5.5 Neural Language Models (~5 pp)

Chapter 6

The Attention Revolution

Deep Phase 1 ~25 pp Critical Path

Bahdanau and Luong attention, self-attention, and the mechanism that changed everything.

Section outline

6.1 Motivation: The Bottleneck Problem (~4 pp)
6.2 Bahdanau (Additive) Attention (~6 pp)
6.3 Luong (Multiplicative) Attention (~4 pp)
6.4 Self-Attention (~5 pp)
6.5 Attention as a General Mechanism (~6 pp)

Chapter 7

Seq-to-Seq and Decoding

Moderate Phase 2 ~20 pp

The encoder-decoder framework, teacher forcing, beam search, and nucleus sampling.

Section outline

7.1 Encoder-Decoder Architecture (~5 pp)
7.2 Teacher Forcing and Exposure Bias (~4 pp)
7.3 Decoding Strategies (~5 pp)
7.4 Evaluation of Generated Text (~3 pp)
7.5 Machine Translation as a Case Study (~3 pp)

Part III: The Transformer Revolution

Chapters 8–11 · The architecture, pre-training, tokenization, and scaling

Chapter 8

The Transformer Architecture

Deep Phase 1 ~30 pp Critical Path

The book's centerpiece: scaled dot-product attention, multi-head attention, positional encodings, and the full Transformer block.

Section outline

8.1 From Recurrence to Attention (~4 pp)
8.2 Scaled Dot-Product Attention (~6 pp)
8.3 Multi-Head Attention (~5 pp)
8.4 Position Encodings (~5 pp)
8.5 The Full Transformer Block (~6 pp)
8.6 Encoder, Decoder, and Encoder-Decoder Variants (~4 pp)

Chapter 9

Pre-training Paradigms

Deep Phase 1 ~30 pp Critical Path

BERT, GPT, T5: how masked and causal language modeling objectives create powerful foundations.

Section outline

9.1 The Pre-training Revolution (~5 pp)
9.2 BERT and Masked Language Modeling (~7 pp)
9.3 GPT and Autoregressive Language Modeling (~7 pp)
9.4 T5 and Encoder-Decoder Pre-training (~5 pp)
9.5 Comparing Paradigms (~6 pp)

Chapter 10

Tokenization and Data at Scale

Moderate Phase 2 ~20 pp

BPE, WordPiece, Unigram: subword tokenization strategies and data preprocessing for large models.

Section outline

10.1 From Words to Subwords (~3 pp)
10.2 Byte-Pair Encoding (BPE) (~5 pp)
10.3 SentencePiece and Unigram (~4 pp)
10.4 The Impact of Tokenization (~4 pp)
10.5 Data Curation at Scale (~4 pp)

Chapter 11

Scaling Laws and Emergence

Moderate Phase 2 ~20 pp

Kaplan and Chinchilla scaling laws, compute-optimal training, MoE, and emergent abilities.

Section outline

11.1 Scaling Laws (~5 pp)
11.2 Emergent Abilities (~4 pp)
11.3 Mixture of Experts (MoE) (~4 pp)
11.4 Efficient Training (~4 pp)
11.5 The Compute Frontier (~3 pp)

Part IV: Frontiers

Chapters 12–15 · Alignment, prompting, agents, and responsible AI

Chapter 12

Alignment (RLHF, DPO)

Deep Phase 1 ~25 pp

Reward modeling, PPO, DPO, Constitutional AI — making LLMs helpful, harmless, and honest.

Section outline

12.1 The Alignment Problem (~3 pp)
12.2 Instruction Tuning (~4 pp)
12.3 RLHF (~6 pp)
12.4 Direct Preference Optimization (DPO) (~5 pp)
12.5 Constitutional AI and RLAIF (~3 pp)
12.6 Safety, Red-Teaming, and Guardrails (~4 pp)

Chapter 13

In-Context Learning

Moderate Phase 3 ~25 pp

Few-shot learning, chain-of-thought prompting, self-consistency, and the ICL mechanism debate.

Section outline

13.1 In-Context Learning (~5 pp)
13.2 Prompt Engineering (~6 pp)
13.3 Chain-of-Thought Reasoning (~6 pp)
13.4 Tool Use and Function Calling (~4 pp)
13.5 The Limits of Prompting (~4 pp)

Chapter 14

Retrieval, Agents, Multimodal

Moderate Phase 3 ~20 pp

RAG, ReAct agents, tool use, CLIP, and vision-language models.

Section outline

14.1 Retrieval-Augmented Generation (RAG) (~5 pp)
14.2 Agents and Planning (~4 pp)
14.3 Vision-Language Models (~4 pp)
14.4 Long-Context Models (~3 pp)
14.5 Efficient Inference (~4 pp)

Chapter 15

Ethics, Society, Future

Moderate Phase 3 ~15 pp

Carbon footprint, memorization, bias, watermarking, and responsible AI development.

Section outline

15.1 Bias and Fairness (~4 pp)
15.2 Privacy and Memorization (~3 pp)
15.3 Environmental Impact (~3 pp)
15.4 Intellectual Property and Regulation (~3 pp)
15.5 The Future of Language Models (~2 pp)