Predicting the Next Word
A Mathematical Foundation of Language Models
A comprehensive PhD-level textbook covering the complete evolution of language modeling, from Shannon's 1948 information theory to modern large language models.
13
Chapters
364
Figures
380
Pages
Book Structure
Part I: Foundations
- Introduction
- N-gram Models
- Tokenization
- Embeddings
Part II: Neural LMs
- RNNs & LSTMs
- Transformers
- Decoding
- Training
Part III: LLMs
- Large LMs
- Scaling Laws
- Post-Training
Part IV: Applications
- Efficiency
- Applications
Featured Visualizations
Over 360 publication-quality figures with Python source code
3D Entropy Surface
Information Theory Visualization
Interactive 3D surface showing entropy across probability distributions.
Chapter 1Smoothing Comparison
N-gram Smoothing Techniques
Comparison of Laplace, Kneser-Ney, and interpolation methods.
Chapter 2BPE Algorithm
Byte Pair Encoding
Step-by-step visualization of subword tokenization.
Chapter 3Book Progress
Track the development of this comprehensive textbook.
Chapters Complete
7 / 13
Figures Generated
196 / 364
54%
Overall Completion
(c) Joerg Osterrieder 2025