Predicting the Next Word
  • Home
  • Chapters
  • Figures
  • Charts
  • Progress
  • About
  • Full Book PDF
  • GitHub

Chart Gallery

All figures from "Predicting the Next Word: A Mathematical Foundation of Language Models"

250
Total Charts
7
Chapters
45
Ch 1
44
Ch 2
26
Ch 3
29
Ch 4
32
Ch 5
43
Ch 6
31
Ch 7
Jump to: Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7

Chapter 1: Introduction: The Problem of Prediction

Timeline
fig_01_01_timeline
Timeline
Entropy Surface
fig_01_02_entropy_surface
Entropy Surface
Shannon Guessing
fig_01_03_shannon_guessing
Shannon Guessing
Chain Rule
fig_01_04_chain_rule
Chain Rule
Ngram Window
fig_01_05_ngram_window
Ngram Window
Probability Simplex
fig_01_06_probability_simplex
Probability Simplex
Cross Entropy Loss
fig_01_07_cross_entropy_loss
Cross Entropy Loss
Perplexity Interpretation
fig_01_08_perplexity_interpretation
Perplexity Interpretation
Zipf Loglog
fig_01_09a_zipf_loglog
Zipf Loglog
Top Words
fig_01_09b_top_words
Top Words
Vocab Coverage
fig_01_09c_vocab_coverage
Vocab Coverage
Alpha Comparison
fig_01_09d_alpha_comparison
Alpha Comparison
Kl Divergence
fig_01_10_kl_divergence
Kl Divergence
Lm History Evolution
fig_01_11_lm_history_evolution
Lm History Evolution
Context Window Comparison
fig_01_12_context_window_comparison
Context Window Comparison
Context Arch
fig_01_12a_context_arch
Context Arch
Context Evolution
fig_01_12b_context_evolution
Context Evolution
Softmax Temperature
fig_01_13_softmax_temperature
Softmax Temperature
Vocabulary Structure
fig_01_14_vocabulary_structure
Vocabulary Structure
Conditional Probability Tree
fig_01_15_conditional_probability_tree
Conditional Probability Tree
Log Probability Space
fig_01_16_log_probability_space
Log Probability Space
Prediction Difficulty
fig_01_17_prediction_difficulty
Prediction Difficulty
Evaluation Metrics
fig_01_18_evaluation_metrics
Evaluation Metrics
Metric Relationships
fig_01_18a_metric_relationships
Metric Relationships
Training Curves
fig_01_18b_training_curves
Training Curves
Training Data Scale
fig_01_19_training_data_scale
Training Data Scale
Bits Per Character
fig_01_20_bits_per_character
Bits Per Character
Bpc Surface 3D
fig_01_20a_bpc_surface_3d
Bpc Surface 3D
Bpc Progress
fig_01_20b_bpc_progress
Bpc Progress
Maximum Likelihood
fig_01_21_maximum_likelihood
Maximum Likelihood
Smoothing Techniques
fig_01_22_smoothing_techniques
Smoothing Techniques
Addk Smoothing
fig_01_22a_addk_smoothing
Addk Smoothing
Method Comparison
fig_01_22b_method_comparison
Method Comparison
Language Structure
fig_01_23_language_structure
Language Structure
Prediction Examples
fig_01_24_prediction_examples
Prediction Examples
Syntactic Predictions
fig_01_24a_syntactic_predictions
Syntactic Predictions
Ambiguous Contexts
fig_01_24b_ambiguous_contexts
Ambiguous Contexts
Information Flow
fig_01_25_information_flow
Information Flow
Lm Pipeline
fig_01_25a_lm_pipeline
Lm Pipeline
Output Distribution
fig_01_25b_output_distribution
Output Distribution
Model Comparison Radar
fig_01_26_model_comparison_radar
Model Comparison Radar
Ngram Vs Neural
fig_01_26a_ngram_vs_neural
Ngram Vs Neural
Rnn Vs Transformer
fig_01_26b_rnn_vs_transformer
Rnn Vs Transformer
Chapter Roadmap
fig_01_27_chapter_roadmap
Chapter Roadmap
Book Themes
fig_01_28_book_themes
Book Themes

Chapter 2: N-gram Language Models

Markov Chain
fig_02_01_markov_chain
Markov Chain
Ngram Context
fig_02_02_ngram_context
Ngram Context
Bigram Surface
fig_02_03_bigram_surface
Bigram Surface
Count Matrix Heatmap
fig_02_04a_count_matrix_heatmap
Count Matrix Heatmap
Count Matrix Sparsity
fig_02_04b_count_matrix_sparsity
Count Matrix Sparsity
Mle Estimation
fig_02_05_mle_estimation
Mle Estimation
Zero Probability
fig_02_06_zero_probability
Zero Probability
Vocab Parameter Growth
fig_02_07a_vocab_parameter_growth
Vocab Parameter Growth
Vocab Coverage
fig_02_07b_vocab_coverage
Vocab Coverage
Coverage Diminishing
fig_02_08a_coverage_diminishing
Coverage Diminishing
Coverage Longtail
fig_02_08b_coverage_longtail
Coverage Longtail
Sparsity 3D
fig_02_09_sparsity_3d
Sparsity 3D
Zipf Distribution
fig_02_10a_zipf_distribution
Zipf Distribution
Zipf Cumulative
fig_02_10b_zipf_cumulative
Zipf Cumulative
Laplace Counts
fig_02_11a_laplace_counts
Laplace Counts
Laplace Formula
fig_02_11b_laplace_formula
Laplace Formula
Add K Effect
fig_02_12a_add_k_effect
Add K Effect
Optimal K
fig_02_12b_optimal_k
Optimal K
Frequency Of Frequencies
fig_02_13a_frequency_of_frequencies
Frequency Of Frequencies
Count Adjustment
fig_02_13b_count_adjustment
Count Adjustment
Mle Distribution
fig_02_14a_mle_distribution
Mle Distribution
Smoothed Distribution
fig_02_14b_smoothed_distribution
Smoothed Distribution
Discount Effect
fig_02_15a_discount_effect
Discount Effect
Mass Collected
fig_02_15b_mass_collected
Mass Collected
Kneser Ney
fig_02_16_kneser_ney
Kneser Ney
Method Comparison
fig_02_17a_method_comparison
Method Comparison
Ngram Orders
fig_02_17b_ngram_orders
Ngram Orders
Backoff Strategy
fig_02_18_backoff_strategy
Backoff Strategy
Interpolation Formula
fig_02_19a_interpolation_formula
Interpolation Formula
Interpolation Weights
fig_02_19b_interpolation_weights
Interpolation Weights
Stupid Backoff Algorithm
fig_02_20a_stupid_backoff_algorithm
Stupid Backoff Algorithm
Stupid Backoff Performance
fig_02_20b_stupid_backoff_performance
Stupid Backoff Performance
Backoff Hierarchy 3D
fig_02_21_backoff_hierarchy_3d
Backoff Hierarchy 3D
Perplexity Formula
fig_02_22a_perplexity_formula
Perplexity Formula
Perplexity Comparison
fig_02_22b_perplexity_comparison
Perplexity Comparison
Cross Entropy Conversion
fig_02_23a_cross_entropy_conversion
Cross Entropy Conversion
Cross Entropy Bound
fig_02_23b_cross_entropy_bound
Cross Entropy Bound
Limitations
fig_02_24_limitations
Limitations
Prediction Common
fig_02_25a_prediction_common
Prediction Common
Prediction Phrase
fig_02_25b_prediction_phrase
Prediction Phrase
Prediction Novel
fig_02_25c_prediction_novel
Prediction Novel
Prediction Rare
fig_02_25d_prediction_rare
Prediction Rare
Chapter Summary
fig_02_26_chapter_summary
Chapter Summary
Historical Impact
fig_02_27_historical_impact
Historical Impact

Chapter 3: Tokenization

Tokenization Overview
fig_03_01_tokenization_overview
Tokenization Overview
Word Level Problems
fig_03_02_word_level_problems
Word Level Problems
Character Vs Word
fig_03_03_character_vs_word
Character Vs Word
Subword Concept
fig_03_04_subword_concept
Subword Concept
Bpe Algorithm
fig_03_05_bpe_algorithm
Bpe Algorithm
Bpe Merge Steps
fig_03_06_bpe_merge_steps
Bpe Merge Steps
Bpe Vocabulary Growth
fig_03_07_bpe_vocabulary_growth
Bpe Vocabulary Growth
Wordpiece Algorithm
fig_03_08_wordpiece_algorithm
Wordpiece Algorithm
Sentencepiece Framework
fig_03_09_sentencepiece_framework
Sentencepiece Framework
Unigram Lm Tokenization
fig_03_10_unigram_lm_tokenization
Unigram Lm Tokenization
Vocabulary Size Tradeoff
fig_03_11_vocabulary_size_tradeoff
Vocabulary Size Tradeoff
Fertility Comparison
fig_03_12_fertility_comparison
Fertility Comparison
Oov Handling
fig_03_13_oov_handling
Oov Handling
Special Tokens
fig_03_14_special_tokens
Special Tokens
Tokenization Examples
fig_03_15_tokenization_examples
Tokenization Examples
Multilingual Tokenization
fig_03_16_multilingual_tokenization
Multilingual Tokenization
Byte Fallback
fig_03_17_byte_fallback
Byte Fallback
Compression Ratio
fig_03_18_compression_ratio
Compression Ratio
Token Frequency 3D
fig_03_19_token_frequency_3d
Token Frequency 3D
Vocab Coverage 3D
fig_03_20_vocab_coverage_3d
Vocab Coverage 3D
Bpe Surface 3D
fig_03_21_bpe_surface_3d
Bpe Surface 3D
Tokenization Comparison 3D
fig_03_22_tokenization_comparison_3d
Tokenization Comparison 3D
Fertility Surface 3D
fig_03_23_fertility_surface_3d
Fertility Surface 3D
Prediction With Tokens
fig_03_24_prediction_with_tokens
Prediction With Tokens
Context Representation
fig_03_25_context_representation
Context Representation
Chapter Summary
fig_03_26_chapter_summary
Chapter Summary

Chapter 4: Word Embeddings

Embedding Concept
fig_04_01_embedding_concept
Embedding Concept
Distributional Hypothesis
fig_04_02_distributional_hypothesis
Distributional Hypothesis
Skipgram Architecture
fig_04_03_skipgram_architecture
Skipgram Architecture
Cbow Architecture
fig_04_04_cbow_architecture
Cbow Architecture
Negative Sampling
fig_04_05_negative_sampling
Negative Sampling
Glove Cooccurrence
fig_04_06_glove_cooccurrence
Glove Cooccurrence
Fasttext Subwords
fig_04_07_fasttext_subwords
Fasttext Subwords
Contextual Vs Static
fig_04_08_contextual_vs_static
Contextual Vs Static
Word Frequency Embedding
fig_04_09_word_frequency_embedding
Word Frequency Embedding
Window Performance
fig_04_10a_window_performance
Window Performance
Similarity Heatmap
fig_04_10b_similarity_heatmap
Similarity Heatmap
Embedding Dimension
fig_04_11_embedding_dimension
Embedding Dimension
Training Loss Curve
fig_04_12_training_loss_curve
Training Loss Curve
Analogy Accuracy
fig_04_13_analogy_accuracy
Analogy Accuracy
Similarity Distribution
fig_04_14_similarity_distribution
Similarity Distribution
Nearest Neighbors
fig_04_15_nearest_neighbors
Nearest Neighbors
Word Clustering
fig_04_16_word_clustering
Word Clustering
Bias Detection
fig_04_17_bias_detection
Bias Detection
Oov Coverage
fig_04_18_oov_coverage
Oov Coverage
Embedding Space 3D
fig_04_19_embedding_space_3d
Embedding Space 3D
Skipgram Objective 3D
fig_04_20_skipgram_objective_3d
Skipgram Objective 3D
Analogy Geometry 3D
fig_04_21_analogy_geometry_3d
Analogy Geometry 3D
Semantic Clusters 3D
fig_04_22_semantic_clusters_3d
Semantic Clusters 3D
Context Evolution 3D
fig_04_23_context_evolution_3d
Context Evolution 3D
Dot Product Geometry
fig_04_24_dot_product_geometry
Dot Product Geometry
Softmax Normalization
fig_04_25_softmax_normalization
Softmax Normalization
Gradient Flow
fig_04_26_gradient_flow
Gradient Flow
Matrix Factorization
fig_04_27_matrix_factorization
Matrix Factorization
Chapter Summary
fig_04_28_chapter_summary
Chapter Summary

Chapter 5: RNNs and LSTMs

Static Embeddings
fig_05_01a_static_embeddings
Static Embeddings
Dynamic Hidden
fig_05_01b_dynamic_hidden
Dynamic Hidden
Sequential Processing
fig_05_02_sequential_processing
Sequential Processing
Running Example
fig_05_03_running_example
Running Example
Hidden State Funnel
fig_05_04_hidden_state_funnel
Hidden State Funnel
Rnn Cell
fig_05_05_rnn_cell
Rnn Cell
Unrolled Rnn
fig_05_06_unrolled_rnn
Unrolled Rnn
Hidden Trajectory 3D
fig_05_07_hidden_trajectory_3d
Hidden Trajectory 3D
Running Rnn
fig_05_08_running_rnn
Running Rnn
Gradient Decay
fig_05_09_gradient_decay
Gradient Decay
Lstm Cell
fig_05_10_lstm_cell
Lstm Cell
Gate Surfaces 3D
fig_05_11_gate_surfaces_3d
Gate Surfaces 3D
Lstm Highway
fig_05_12a_lstm_highway
Lstm Highway
Rnn Recompute
fig_05_12b_rnn_recompute
Rnn Recompute
Running Lstm
fig_05_13_running_lstm
Running Lstm
Cell Evolution 3D
fig_05_14_cell_evolution_3d
Cell Evolution 3D
Gradient Comparison
fig_05_15_gradient_comparison
Gradient Comparison
Forget Gate
fig_05_16a_forget_gate
Forget Gate
Input Gate
fig_05_16b_input_gate
Input Gate
Output Gate
fig_05_16c_output_gate
Output Gate
Gru Cell
fig_05_17_gru_cell
Gru Cell
Parameter Comparison
fig_05_18_parameter_comparison
Parameter Comparison
Performance Comparison
fig_05_19_performance_comparison
Performance Comparison
Bptt Diagram
fig_05_20_bptt_diagram
Bptt Diagram
Gradient Flow 3D
fig_05_21_gradient_flow_3d
Gradient Flow 3D
Truncated Bptt
fig_05_22_truncated_bptt
Truncated Bptt
Training Curve
fig_05_23_training_curve
Training Curve
Context Evolution
fig_05_24_context_evolution
Context Evolution
Prediction Surface 3D
fig_05_25_prediction_surface_3d
Prediction Surface 3D
Sequence Length Effect
fig_05_26_sequence_length_effect
Sequence Length Effect
Stacked Lstm
fig_05_27_stacked_lstm
Stacked Lstm
Chapter Summary
fig_05_28_chapter_summary
Chapter Summary

Chapter 6: Transformers

Rnn Sequential
fig_06_01a_rnn_sequential
Rnn Sequential
Transformer Parallel
fig_06_01b_transformer_parallel
Transformer Parallel
Information Bottleneck
fig_06_02_information_bottleneck
Information Bottleneck
Attention Intuition
fig_06_03_attention_intuition
Attention Intuition
Running Example Context
fig_06_04_running_example_context
Running Example Context
Qkv Projection
fig_06_05_qkv_projection
Qkv Projection
Attention Scores
fig_06_06_attention_scores
Attention Scores
Softmax Normalization
fig_06_07_softmax_normalization
Softmax Normalization
Weighted Sum
fig_06_08_weighted_sum
Weighted Sum
Attention Matrix
fig_06_09_attention_matrix
Attention Matrix
Attention Example Running
fig_06_10_attention_example_running
Attention Example Running
Attention Surface 3D
fig_06_11_attention_surface_3d
Attention Surface 3D
Scaling Effect
fig_06_12_scaling_effect
Scaling Effect
Causal Mask
fig_06_13_causal_mask
Causal Mask
Masked Attention Weights
fig_06_14_masked_attention_weights
Masked Attention Weights
Running Example Masked
fig_06_15_running_example_masked
Running Example Masked
Before Mask
fig_06_16a_before_mask
Before Mask
After Mask
fig_06_16b_after_mask
After Mask
Generation Steps
fig_06_17_generation_steps
Generation Steps
Single Head
fig_06_18a_single_head
Single Head
Fig 06 18B1 Head Recent
fig_06_18b1_head_recent
Fig 06 18B1 Head Recent
Fig 06 18B2 Head Longrange
fig_06_18b2_head_longrange
Fig 06 18B2 Head Longrange
Fig 06 18B3 Head Midrange
fig_06_18b3_head_midrange
Fig 06 18B3 Head Midrange
Fig 06 18B4 Head Immediate
fig_06_18b4_head_immediate
Fig 06 18B4 Head Immediate
Head Specialization
fig_06_19_head_specialization
Head Specialization
Head Concatenation
fig_06_20_head_concatenation
Head Concatenation
Running Example Multihead
fig_06_21_running_example_multihead
Running Example Multihead
Syntax Head
fig_06_22a_syntax_head
Syntax Head
Semantic Head
fig_06_22b_semantic_head
Semantic Head
Position Head
fig_06_22c_position_head
Position Head
Longrange Head
fig_06_22d_longrange_head
Longrange Head
Sentence1
fig_06_23a_sentence1
Sentence1
Sentence2
fig_06_23b_sentence2
Sentence2
Positional Addition
fig_06_24_positional_addition
Positional Addition
Sinusoidal Surface 3D
fig_06_25_sinusoidal_surface_3d
Sinusoidal Surface 3D
Learned Vs Sinusoidal
fig_06_26_learned_vs_sinusoidal
Learned Vs Sinusoidal
Rope Encoding
fig_06_27_rope_encoding
Rope Encoding
Position Encoding Comparison
fig_06_28_position_encoding_comparison
Position Encoding Comparison
Rnn Context
fig_06_29a_rnn_context
Rnn Context
Transformer Context
fig_06_29b_transformer_context
Transformer Context
Stacked Layers
fig_06_30_stacked_layers
Stacked Layers
Context Comparison 3D
fig_06_31_context_comparison_3d
Context Comparison 3D
Running Example Final
fig_06_32_running_example_final
Running Example Final

Chapter 7: Decoding Strategies

Running Example Prompt
fig_07_01_running_example_prompt
Running Example Prompt
Decoding Landscape
fig_07_02_decoding_landscape
Decoding Landscape
Deterministic
fig_07_03a_deterministic
Deterministic
Stochastic
fig_07_03b_stochastic
Stochastic
Greedy Decoding Tree
fig_07_04_greedy_decoding_tree
Greedy Decoding Tree
Repetition Failure
fig_07_05_repetition_failure
Repetition Failure
Temperature Distributions
fig_07_06_temperature_distributions
Temperature Distributions
Temperature Surface 3D
fig_07_07_temperature_surface_3d
Temperature Surface 3D
Temperature Entropy
fig_07_08_temperature_entropy
Temperature Entropy
Topk Truncation 3D
fig_07_09_topk_truncation_3d
Topk Truncation 3D
Topk Effect
fig_07_10_topk_effect
Topk Effect
Topk5
fig_07_11a_topk5
Topk5
Topk50
fig_07_11b_topk50
Topk50
Nucleus Threshold 3D
fig_07_12_nucleus_threshold_3d
Nucleus Threshold 3D
Nucleus Peaked
fig_07_13a_nucleus_peaked
Nucleus Peaked
Nucleus Flat
fig_07_13b_nucleus_flat
Nucleus Flat
Nucleus Cumulative
fig_07_14_nucleus_cumulative
Nucleus Cumulative
Typical Set
fig_07_15_typical_set
Typical Set
Typical Vs Nucleus
fig_07_16_typical_vs_nucleus
Typical Vs Nucleus
Beam Search Tree 3D
fig_07_17_beam_search_tree_3d
Beam Search Tree 3D
Beam Search Paths
fig_07_18_beam_search_paths
Beam Search Paths
Length Normalization
fig_07_19_length_normalization
Length Normalization
No Coverage
fig_07_20a_no_coverage
No Coverage
With Coverage
fig_07_20b_with_coverage
With Coverage
Contrastive Expert Amateur
fig_07_21_contrastive_expert_amateur
Contrastive Expert Amateur
Contrastive Improvement
fig_07_22_contrastive_improvement
Contrastive Improvement
Constrained Grid
fig_07_23_constrained_grid
Constrained Grid
Constraint Satisfaction
fig_07_24_constraint_satisfaction
Constraint Satisfaction
Sampling Trajectory 3D
fig_07_25_sampling_trajectory_3d
Sampling Trajectory 3D
Speculative Verify
fig_07_26_speculative_verify
Speculative Verify
Quality Diversity 3D
fig_07_27_quality_diversity_3d
Quality Diversity 3D
×
Predicting the Next Word

A Mathematical Foundation of Language Models

Springer Texts in Computer Science
Expected: 2025

How to Cite
@book{osterrieder2025predicting,
  title = {Predicting the Next Word},
  author = {Osterrieder, Joerg},
  publisher = {Springer},
  year = {2025}
}
Contact
  • joerg.osterrieder@fhgr.ch
  • Digital-AI-Finance
  • FHGR - University of Applied Sciences of the Grisons

© 2025 Joerg Osterrieder. All rights reserved.
Built with Jekyll and hosted on GitHub Pages.

(c) Joerg Osterrieder 2025