Chart Gallery
All figures from "Predicting the Next Word: A Mathematical Foundation of Language Models"
250
Total Charts
7
Chapters
45
Ch 1
44
Ch 2
26
Ch 3
29
Ch 4
32
Ch 5
43
Ch 6
31
Ch 7
Chapter 1: Introduction: The Problem of Prediction
fig_01_01_timeline
Timeline
fig_01_02_entropy_surface
Entropy Surface
fig_01_03_shannon_guessing
Shannon Guessing
fig_01_04_chain_rule
Chain Rule
fig_01_05_ngram_window
Ngram Window
fig_01_06_probability_simplex
Probability Simplex
fig_01_07_cross_entropy_loss
Cross Entropy Loss
fig_01_08_perplexity_interpretation
Perplexity Interpretation
fig_01_09a_zipf_loglog
Zipf Loglog
fig_01_09b_top_words
Top Words
fig_01_09c_vocab_coverage
Vocab Coverage
fig_01_09d_alpha_comparison
Alpha Comparison
fig_01_10_kl_divergence
Kl Divergence
fig_01_11_lm_history_evolution
Lm History Evolution
fig_01_12_context_window_comparison
Context Window Comparison
fig_01_12a_context_arch
Context Arch
fig_01_12b_context_evolution
Context Evolution
fig_01_13_softmax_temperature
Softmax Temperature
fig_01_14_vocabulary_structure
Vocabulary Structure
fig_01_15_conditional_probability_tree
Conditional Probability Tree
fig_01_16_log_probability_space
Log Probability Space
fig_01_17_prediction_difficulty
Prediction Difficulty
fig_01_18_evaluation_metrics
Evaluation Metrics
fig_01_18a_metric_relationships
Metric Relationships
fig_01_18b_training_curves
Training Curves
fig_01_19_training_data_scale
Training Data Scale
fig_01_20_bits_per_character
Bits Per Character
fig_01_20a_bpc_surface_3d
Bpc Surface 3D
fig_01_20b_bpc_progress
Bpc Progress
fig_01_21_maximum_likelihood
Maximum Likelihood
fig_01_22_smoothing_techniques
Smoothing Techniques
fig_01_22a_addk_smoothing
Addk Smoothing
fig_01_22b_method_comparison
Method Comparison
fig_01_23_language_structure
Language Structure
fig_01_24_prediction_examples
Prediction Examples
fig_01_24a_syntactic_predictions
Syntactic Predictions
fig_01_24b_ambiguous_contexts
Ambiguous Contexts
fig_01_25_information_flow
Information Flow
fig_01_25a_lm_pipeline
Lm Pipeline
fig_01_25b_output_distribution
Output Distribution
fig_01_26_model_comparison_radar
Model Comparison Radar
fig_01_26a_ngram_vs_neural
Ngram Vs Neural
fig_01_26b_rnn_vs_transformer
Rnn Vs Transformer
fig_01_27_chapter_roadmap
Chapter Roadmap
fig_01_28_book_themes
Book Themes
Chapter 2: N-gram Language Models
fig_02_01_markov_chain
Markov Chain
fig_02_02_ngram_context
Ngram Context
fig_02_03_bigram_surface
Bigram Surface
fig_02_04a_count_matrix_heatmap
Count Matrix Heatmap
fig_02_04b_count_matrix_sparsity
Count Matrix Sparsity
fig_02_05_mle_estimation
Mle Estimation
fig_02_06_zero_probability
Zero Probability
fig_02_07a_vocab_parameter_growth
Vocab Parameter Growth
fig_02_07b_vocab_coverage
Vocab Coverage
fig_02_08a_coverage_diminishing
Coverage Diminishing
fig_02_08b_coverage_longtail
Coverage Longtail
fig_02_09_sparsity_3d
Sparsity 3D
fig_02_10a_zipf_distribution
Zipf Distribution
fig_02_10b_zipf_cumulative
Zipf Cumulative
fig_02_11a_laplace_counts
Laplace Counts
fig_02_11b_laplace_formula
Laplace Formula
fig_02_12a_add_k_effect
Add K Effect
fig_02_12b_optimal_k
Optimal K
fig_02_13a_frequency_of_frequencies
Frequency Of Frequencies
fig_02_13b_count_adjustment
Count Adjustment
fig_02_14a_mle_distribution
Mle Distribution
fig_02_14b_smoothed_distribution
Smoothed Distribution
fig_02_15a_discount_effect
Discount Effect
fig_02_15b_mass_collected
Mass Collected
fig_02_16_kneser_ney
Kneser Ney
fig_02_17a_method_comparison
Method Comparison
fig_02_17b_ngram_orders
Ngram Orders
fig_02_18_backoff_strategy
Backoff Strategy
fig_02_19a_interpolation_formula
Interpolation Formula
fig_02_19b_interpolation_weights
Interpolation Weights
fig_02_20a_stupid_backoff_algorithm
Stupid Backoff Algorithm
fig_02_20b_stupid_backoff_performance
Stupid Backoff Performance
fig_02_21_backoff_hierarchy_3d
Backoff Hierarchy 3D
fig_02_22a_perplexity_formula
Perplexity Formula
fig_02_22b_perplexity_comparison
Perplexity Comparison
fig_02_23a_cross_entropy_conversion
Cross Entropy Conversion
fig_02_23b_cross_entropy_bound
Cross Entropy Bound
fig_02_24_limitations
Limitations
fig_02_25a_prediction_common
Prediction Common
fig_02_25b_prediction_phrase
Prediction Phrase
fig_02_25c_prediction_novel
Prediction Novel
fig_02_25d_prediction_rare
Prediction Rare
fig_02_26_chapter_summary
Chapter Summary
fig_02_27_historical_impact
Historical Impact
Chapter 3: Tokenization
fig_03_01_tokenization_overview
Tokenization Overview
fig_03_02_word_level_problems
Word Level Problems
fig_03_03_character_vs_word
Character Vs Word
fig_03_04_subword_concept
Subword Concept
fig_03_05_bpe_algorithm
Bpe Algorithm
fig_03_06_bpe_merge_steps
Bpe Merge Steps
fig_03_07_bpe_vocabulary_growth
Bpe Vocabulary Growth
fig_03_08_wordpiece_algorithm
Wordpiece Algorithm
fig_03_09_sentencepiece_framework
Sentencepiece Framework
fig_03_10_unigram_lm_tokenization
Unigram Lm Tokenization
fig_03_11_vocabulary_size_tradeoff
Vocabulary Size Tradeoff
fig_03_12_fertility_comparison
Fertility Comparison
fig_03_13_oov_handling
Oov Handling
fig_03_14_special_tokens
Special Tokens
fig_03_15_tokenization_examples
Tokenization Examples
fig_03_16_multilingual_tokenization
Multilingual Tokenization
fig_03_17_byte_fallback
Byte Fallback
fig_03_18_compression_ratio
Compression Ratio
fig_03_19_token_frequency_3d
Token Frequency 3D
fig_03_20_vocab_coverage_3d
Vocab Coverage 3D
fig_03_21_bpe_surface_3d
Bpe Surface 3D
fig_03_22_tokenization_comparison_3d
Tokenization Comparison 3D
fig_03_23_fertility_surface_3d
Fertility Surface 3D
fig_03_24_prediction_with_tokens
Prediction With Tokens
fig_03_25_context_representation
Context Representation
fig_03_26_chapter_summary
Chapter Summary
Chapter 4: Word Embeddings
fig_04_01_embedding_concept
Embedding Concept
fig_04_02_distributional_hypothesis
Distributional Hypothesis
fig_04_03_skipgram_architecture
Skipgram Architecture
fig_04_04_cbow_architecture
Cbow Architecture
fig_04_05_negative_sampling
Negative Sampling
fig_04_06_glove_cooccurrence
Glove Cooccurrence
fig_04_07_fasttext_subwords
Fasttext Subwords
fig_04_08_contextual_vs_static
Contextual Vs Static
fig_04_09_word_frequency_embedding
Word Frequency Embedding
fig_04_10a_window_performance
Window Performance
fig_04_10b_similarity_heatmap
Similarity Heatmap
fig_04_11_embedding_dimension
Embedding Dimension
fig_04_12_training_loss_curve
Training Loss Curve
fig_04_13_analogy_accuracy
Analogy Accuracy
fig_04_14_similarity_distribution
Similarity Distribution
fig_04_15_nearest_neighbors
Nearest Neighbors
fig_04_16_word_clustering
Word Clustering
fig_04_17_bias_detection
Bias Detection
fig_04_18_oov_coverage
Oov Coverage
fig_04_19_embedding_space_3d
Embedding Space 3D
fig_04_20_skipgram_objective_3d
Skipgram Objective 3D
fig_04_21_analogy_geometry_3d
Analogy Geometry 3D
fig_04_22_semantic_clusters_3d
Semantic Clusters 3D
fig_04_23_context_evolution_3d
Context Evolution 3D
fig_04_24_dot_product_geometry
Dot Product Geometry
fig_04_25_softmax_normalization
Softmax Normalization
fig_04_26_gradient_flow
Gradient Flow
fig_04_27_matrix_factorization
Matrix Factorization
fig_04_28_chapter_summary
Chapter Summary
Chapter 5: RNNs and LSTMs
fig_05_01a_static_embeddings
Static Embeddings
fig_05_01b_dynamic_hidden
Dynamic Hidden
fig_05_02_sequential_processing
Sequential Processing
fig_05_03_running_example
Running Example
fig_05_04_hidden_state_funnel
Hidden State Funnel
fig_05_05_rnn_cell
Rnn Cell
fig_05_06_unrolled_rnn
Unrolled Rnn
fig_05_07_hidden_trajectory_3d
Hidden Trajectory 3D
fig_05_08_running_rnn
Running Rnn
fig_05_09_gradient_decay
Gradient Decay
fig_05_10_lstm_cell
Lstm Cell
fig_05_11_gate_surfaces_3d
Gate Surfaces 3D
fig_05_12a_lstm_highway
Lstm Highway
fig_05_12b_rnn_recompute
Rnn Recompute
fig_05_13_running_lstm
Running Lstm
fig_05_14_cell_evolution_3d
Cell Evolution 3D
fig_05_15_gradient_comparison
Gradient Comparison
fig_05_16a_forget_gate
Forget Gate
fig_05_16b_input_gate
Input Gate
fig_05_16c_output_gate
Output Gate
fig_05_17_gru_cell
Gru Cell
fig_05_18_parameter_comparison
Parameter Comparison
fig_05_19_performance_comparison
Performance Comparison
fig_05_20_bptt_diagram
Bptt Diagram
fig_05_21_gradient_flow_3d
Gradient Flow 3D
fig_05_22_truncated_bptt
Truncated Bptt
fig_05_23_training_curve
Training Curve
fig_05_24_context_evolution
Context Evolution
fig_05_25_prediction_surface_3d
Prediction Surface 3D
fig_05_26_sequence_length_effect
Sequence Length Effect
fig_05_27_stacked_lstm
Stacked Lstm
fig_05_28_chapter_summary
Chapter Summary
Chapter 6: Transformers
fig_06_01a_rnn_sequential
Rnn Sequential
fig_06_01b_transformer_parallel
Transformer Parallel
fig_06_02_information_bottleneck
Information Bottleneck
fig_06_03_attention_intuition
Attention Intuition
fig_06_04_running_example_context
Running Example Context
fig_06_05_qkv_projection
Qkv Projection
fig_06_06_attention_scores
Attention Scores
fig_06_07_softmax_normalization
Softmax Normalization
fig_06_08_weighted_sum
Weighted Sum
fig_06_09_attention_matrix
Attention Matrix
fig_06_10_attention_example_running
Attention Example Running
fig_06_11_attention_surface_3d
Attention Surface 3D
fig_06_12_scaling_effect
Scaling Effect
fig_06_13_causal_mask
Causal Mask
fig_06_14_masked_attention_weights
Masked Attention Weights
fig_06_15_running_example_masked
Running Example Masked
fig_06_16a_before_mask
Before Mask
fig_06_16b_after_mask
After Mask
fig_06_17_generation_steps
Generation Steps
fig_06_18a_single_head
Single Head
fig_06_18b1_head_recent
Fig 06 18B1 Head Recent
fig_06_18b2_head_longrange
Fig 06 18B2 Head Longrange
fig_06_18b3_head_midrange
Fig 06 18B3 Head Midrange
fig_06_18b4_head_immediate
Fig 06 18B4 Head Immediate
fig_06_19_head_specialization
Head Specialization
fig_06_20_head_concatenation
Head Concatenation
fig_06_21_running_example_multihead
Running Example Multihead
fig_06_22a_syntax_head
Syntax Head
fig_06_22b_semantic_head
Semantic Head
fig_06_22c_position_head
Position Head
fig_06_22d_longrange_head
Longrange Head
fig_06_23a_sentence1
Sentence1
fig_06_23b_sentence2
Sentence2
fig_06_24_positional_addition
Positional Addition
fig_06_25_sinusoidal_surface_3d
Sinusoidal Surface 3D
fig_06_26_learned_vs_sinusoidal
Learned Vs Sinusoidal
fig_06_27_rope_encoding
Rope Encoding
fig_06_28_position_encoding_comparison
Position Encoding Comparison
fig_06_29a_rnn_context
Rnn Context
fig_06_29b_transformer_context
Transformer Context
fig_06_30_stacked_layers
Stacked Layers
fig_06_31_context_comparison_3d
Context Comparison 3D
fig_06_32_running_example_final
Running Example Final
Chapter 7: Decoding Strategies
fig_07_01_running_example_prompt
Running Example Prompt
fig_07_02_decoding_landscape
Decoding Landscape
fig_07_03a_deterministic
Deterministic
fig_07_03b_stochastic
Stochastic
fig_07_04_greedy_decoding_tree
Greedy Decoding Tree
fig_07_05_repetition_failure
Repetition Failure
fig_07_06_temperature_distributions
Temperature Distributions
fig_07_07_temperature_surface_3d
Temperature Surface 3D
fig_07_08_temperature_entropy
Temperature Entropy
fig_07_09_topk_truncation_3d
Topk Truncation 3D
fig_07_10_topk_effect
Topk Effect
fig_07_11a_topk5
Topk5
fig_07_11b_topk50
Topk50
fig_07_12_nucleus_threshold_3d
Nucleus Threshold 3D
fig_07_13a_nucleus_peaked
Nucleus Peaked
fig_07_13b_nucleus_flat
Nucleus Flat
fig_07_14_nucleus_cumulative
Nucleus Cumulative
fig_07_15_typical_set
Typical Set
fig_07_16_typical_vs_nucleus
Typical Vs Nucleus
fig_07_17_beam_search_tree_3d
Beam Search Tree 3D
fig_07_18_beam_search_paths
Beam Search Paths
fig_07_19_length_normalization
Length Normalization
fig_07_20a_no_coverage
No Coverage
fig_07_20b_with_coverage
With Coverage
fig_07_21_contrastive_expert_amateur
Contrastive Expert Amateur
fig_07_22_contrastive_improvement
Contrastive Improvement
fig_07_23_constrained_grid
Constrained Grid
fig_07_24_constraint_satisfaction
Constraint Satisfaction
fig_07_25_sampling_trajectory_3d
Sampling Trajectory 3D
fig_07_26_speculative_verify
Speculative Verify
fig_07_27_quality_diversity_3d
Quality Diversity 3D
×