The LLM Revolution

14 Stories of Math That Changed the World

From a Google intern’s paper to Nvidia losing $589 billion in market cap in a single day. From counting letters in “strawberry” to teaching AI human values. Every story is powered by one thing: mathematics.

∼20–30 minutes

What you’ll learn:

How transformers “pay attention” to language
Why bigger models aren’t always smarter
Why AI gives confidently wrong answers
How math can make AI fairer — and where it can’t

or press →

1 / 14 2017

“Attention Is All You Need”

Eight Google researchers—including a 20-year-old intern named Aidan Gomez—published a 15-page paper with a Beatles-inspired title. It became the most cited AI paper in history. Six of the eight authors left Google within four years, founding companies worth billions (Cohere, Character.AI, Inceptive). Noam Shazeer, who designed the attention mechanism, quit in 2021 and was brought back in 2024 for $2.7 billion.

The Math: Self-Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Each word generates a Query (what am I looking for?), Key (what do I offer?), and Value (my content). The dot product Q·K measures relevance. Softmax converts scores to probabilities summing to 1. The result: every word “pays attention” to every other word simultaneously—which is why GPUs can train transformers so fast.

Interactive version coming soon

Why you should care: A 20-year-old intern co-wrote the paper that makes ChatGPT, Claude, and Gemini work. All of them run on this one formula.

Before reading on: what do you think “attention” means for an AI?

Sources

2 / 14 2013

King − Man + Woman = Queen

Google researcher Tomas Mikolov submitted a paper that peer reviewers rejected—at a conference with a 70% acceptance rate. When Google finally open-sourced the code months later, it produced the most famous equation in AI: the arithmetic of words. A neural network trained on billions of words discovered that “King − Man + Woman” lands near “Queen” in vector space. The same paper won the NeurIPS Test of Time Award a decade later.

The Math: Word Embeddings & Cosine Similarity

$$\vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}}$$

$$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \cdot |\mathbf{B}|}$$

Every word becomes a vector of 300 numbers. Similar words cluster together. Relationships (gender, royalty, country→capital) appear as consistent directions. Cosine similarity measures how close two words are—the same formula from Section 8 of this talk.

Why you should care: Rejected by reviewers, delayed by bureaucracy—then it changed everything. The math that lets AI “understand” meaning is the same linear algebra you study in school.

Sources

3 / 14 1948–1951

Shannon’s 1948 Experiment

In 1951, a 35-year-old mathematician at Bell Labs named Claude Shannon ran a remarkable experiment: he asked people to predict the next letter of a text, one character at a time. If wrong, they were told the correct letter. By counting guesses, Shannon measured the statistical structure of English—finding it has only ~1.1 bits of entropy per character (out of a maximum 4.7). He had described exactly what ChatGPT does: minimize uncertainty about the next token. He did it 74 years before ChatGPT existed.

The Math: Entropy & Perplexity

$$H = -\sum p(x) \log_2 p(x)$$

$$\text{Perplexity} = 2^H$$

Entropy measures average surprise per symbol. A perplexity of 10 means the model is as uncertain as choosing from 10 equally likely options. Training an LLM on the internet is extreme compression of human knowledge—to predict the next word, the model must learn facts, grammar, logic, and culture.

Why you should care: The math designed to send telephone signals efficiently in 1948 turned out to be the exact training objective of the most powerful AI systems ever built.

Sources

4 / 14 Every second

How LLMs Pick Their Next Word

Every word ChatGPT types is the winner of a probability competition among 100,000+ candidates. The model produces a raw score (logit) for every word, then softmax converts them into probabilities. “The” might get 32%, “a” gets 18%, and 100,000 others share the rest. A parameter called temperature controls randomness: at 0, the model always picks the top word (robotic). At 1, it samples proportionally (creative but risky). Every AI conversation is literally a sequence of weighted dice rolls.

The Math: Softmax & Cross-Entropy Loss

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

$$\mathcal{L} = -\log\, p(\text{correct token})$$

The exponential function amplifies differences: a small advantage in raw score becomes a large probability advantage. Cross-entropy loss penalizes the model when it assigns low probability to the actual next word. If p = 0.01, loss = 4.6 (harsh penalty). If p = 0.99, loss ≈ 0.01 (almost no penalty).

Why you should care: “Every word is a weighted die roll” explains why AI can be wrong, creative, or inconsistent—and why no two conversations are identical.

Sources

5 / 14 2022

Chinchilla: Bigger ≠ Smarter

In 2022, DeepMind proved that every major AI lab had the wrong formula. Everyone was building bigger models, assuming more parameters = better. DeepMind’s Chinchilla—with 70B parameters, half the size of Gopher (280B)—outperformed it on almost every benchmark by training on 4× more data. The secret was a simple power law: scale data and model size equally.

The Math: Compute-Optimal Scaling Laws

$$N_{\text{opt}} \propto C^{0.5}, \qquad D_{\text{opt}} \propto C^{0.5}$$

For a fixed compute budget C, optimal model size N and training data D should both scale as the square root of compute. The earlier belief (Kaplan 2020) was N ∝ C^0.73, over-weighting size. Performance follows power laws—the same y = ax^b as Kepler’s planetary laws.

Why you should care: The most advanced AI lab in the world had the wrong formula for years. Science is self-correcting—and the sweet spot is an elegant mathematical optimum.

Sources

6 / 14 2022–2024

“Let’s Think Step by Step”

In 2022, Google Brain researchers discovered that simply adding four words—“Let’s think step by step”—to any prompt dramatically improved LLM math performance. In 2024, OpenAI’s o1 model took this further: trained with reinforcement learning on reasoning traces, it generates thousands of hidden “thinking tokens” before answering. On the 2024 AIME (top 3% of US math students), GPT-4o scored 12%. o1 scored 93%. More thinking time literally makes AI smarter.

The Math: Test-Time Compute Scaling

Before o1, all compute went into training. Now performance also scales with compute spent at inference—how long the model “thinks.” Mathematically, this is tree search: each thinking step explores a node in a decision tree. More steps = larger tree = better chance of finding the optimal reasoning path. A third dimension of scaling beyond parameters and data.

Interactive version coming soon

Why you should care: The same technique your math teacher tells you to do—show your work, think step by step—is literally the breakthrough that made AI a gold-medal mathematician.

What percentage do you think GPT-4 scores on competition math? Think of your answer, then click to reveal.

Sources

7 / 14 2025

DeepSeek: The $6M Challenge

In January 2025, a two-year-old Chinese company called DeepSeek—founded by hedge fund manager Liang Wenfeng—released model R1 for free. Training compute cost: $5.6 million (vs. GPT-4’s estimated $100M+ total development budget). One week later, it was the #1 app on the US App Store. The same day, Nvidia lost $589 billion in market cap—the largest single-day loss in stock market history. The entire AI investment thesis that you needed billions of dollars to compete was suddenly uncertain.

The Math: Mixture of Experts (MoE)

$$\text{Active params} = K \times E \ll N \times E = \text{Total params}$$

DeepSeek V3 has 671B total parameters but only 37B are active per input (<6%). A routing function selects which K of N specialist sub-networks (“experts”) handle each token. Result: knowledge capacity of 671B, compute cost of 37B. Plus, R1 learned reasoning through pure reinforcement learning—no human labels needed.

Interactive version coming soon

Why you should care: David vs. Goliath with math. A model trained for $6M in compute challenged models with $100M+ budgets, crashed stock markets, and proved that mathematical efficiency beats brute-force spending.

Sources

8 / 14 1986–today

Gradient Descent: The Blindfolded Hiker

Training a language model means adjusting hundreds of billions of numbers to reduce how wrong the model is. The algorithm is beautifully simple: imagine you’re blindfolded on a mountain and want to reach the lowest valley. You feel the slope under your feet and take a small step downhill. Repeat billions of times. The math hasn’t changed since Rumelhart, Hinton & Williams formalized backpropagation in 1986. One of the three authors, Geoffrey Hinton, won the 2024 Nobel Prize in Physics for foundational work on neural networks. What changed: hardware, data, and the transformer architecture it’s applied to.

The Math: Backpropagation & the Chain Rule

$$\theta \leftarrow \theta - \alpha \cdot \nabla_\theta \mathcal{L}(\theta)$$

The gradient ∇_θ tells you: “if I nudge this parameter, how much does the error change?” Backpropagation uses the chain rule of calculus to compute gradients for billions of parameters in one backward pass. The learning rate α controls step size—too large and you overshoot, too small and you never arrive.

Why you should care: If you’re studying calculus, you’re learning the exact tool that trains every AI on the planet. Derivatives are not abstract—they descend a billion-parameter mountain every second.

Sources

9 / 14 2023–2024

The Strawberry Problem

“How many R’s are in the word strawberry?” AI answers: two. The correct answer is three. This became the most-shared LLM failure on the internet. The reason is purely mathematical: GPT-4’s tokenizer splits “strawberry” into [str][aw][berry]. The model never sees individual characters—it sees three tokens. It can’t count letters it can’t see. Fix: ask the model to spell it out letter by letter first, then count. Forcing character-level tokens makes counting trivial.

The Math: Byte Pair Encoding (BPE)

BPE (1994 compression algorithm) iteratively merges the most frequent character pairs until a target vocabulary is reached. GPT-4 has ~100,000 tokens. Each word gets split into subword chunks. The model reasons about tokens, not characters—explaining why LLMs struggle with letter counting, spelling backwards, and rhyming.

Interactive version coming soon

Why you should care: Try it right now! Ask any AI to count R’s in “strawberry.” The failure reveals how LLMs actually see language—not as letters, but as mathematical tokens.

Count the R’s in “strawberry.” Now think: why does AI get it wrong?

Sources

10 / 14 2023

The Lawyer’s Fake Cases

Attorney Steven Schwartz asked ChatGPT to find legal precedents for a case against Avianca airlines. ChatGPT cited “Varghese v. China Southern Airlines,” “Shaboon v. Egyptair,” and four more cases. When asked to confirm, it said: “These cases indeed exist and can be found in reputable legal databases.” None of them existed. The judge called the legal reasoning “gibberish.” Schwartz was fined $5,000. He had trusted a probability machine to fact-check itself.

The Math: Probability Chains & Compounding Error

$$P(\text{all correct}) = \prod_{i=1}^{n} p_i$$

If each token has a 99% chance of being locally plausible, a 100-token response has probability 0.99¹⁰⁰ = 0.366—a 63% chance of containing at least one error. The model has no truth oracle. “Case name + airline + court” is a high-probability pattern in legal text. Plausible ≠ true.

Why you should care: The AI confidently verified its own fabrications. Understanding probability chains explains why “sounds right” and “is right” are mathematically different things.

Sources

11 / 14 2024

Hallucinations Are Mathematically Inevitable

In September 2024, researchers published a paper proving that LLM hallucinations are not just engineering bugs—they are mathematically inevitable. Any system that generates text by sampling from a learned probability distribution will sometimes produce false statements. The fix (explicit confidence scoring for every claim) works in theory but is impractical: it would require the model to pause and verify each statement against a factual database, making responses extremely slow and expensive.

The Math: Information-Theoretic Impossibility

The probability distribution over tokens spreads across incorrect possibilities. With ~100K vocabulary items and softmax normalization, there is always non-zero probability mass on wrong tokens. The chain rule of probability compounds: if each step has 1% error rate, a 100-word sentence has ~63% chance of at least one error. Perfect truthfulness requires external verification—something the architecture fundamentally lacks.

Interactive version coming soon

Why you should care: Researchers proved that current AI architectures will sometimes generate false statements—and no amount of training data can fully eliminate it. The limits of this technology are mathematical, not just engineering problems.

Sources

12 / 14 2019–2024

The Apple Card Gender Bias

Tech entrepreneur David Heinemeier Hansson discovered Apple Card gave him a credit limit 20× higher than his wife’s—despite her having a better credit score. Apple co-founder Steve Wozniak reported the same. The algorithm never explicitly used gender. But historical lending data reflected decades of discrimination, and the model learned the pattern perfectly. In 2024, the CFPB fined Apple $25M and Goldman Sachs $45M.

The Math: Proxy Variables & Fairness Impossibility

A “proxy variable” encodes a protected characteristic indirectly: zip code encodes race, shopping patterns encode gender. Chouldechova’s Impossibility Theorem (2017) proves you cannot simultaneously satisfy equal false positive rates, equal false negative rates, and equal calibration. Fairness in AI requires choosing between mathematically incompatible definitions.

Interactive version coming soon

Why you should care: An algorithm decided women were worth less. The bias wasn’t programmed in—it was learned from data. Same math, different outcome depending on what data you train on.

Sources

13 / 14 2024–2025

JPMorgan’s AI Revolution

200,000 employees at the world’s largest bank now use an LLM daily. Their “LLM Suite” won Innovation of the Year. When markets swung sharply in April 2025, the AI tool Coach helped advisers find information 95% faster. Investment bankers automate 40% of SEC filing analysis. AI-powered fraud detection prevents an estimated $1.5 billion in losses with 98% accuracy across 60+ countries. Total tech budget: $17 billion per year.

The Math: Embeddings + Retrieval

JPMorgan’s system uses Retrieval-Augmented Generation (RAG): financial documents are converted into embedding vectors (the same vectors from Story 2), stored in a database, and retrieved by cosine similarity when a query comes in. The LLM then generates answers grounded in actual documents—reducing hallucinations in high-stakes finance.

Interactive version coming soon

Why you should care: Your future job in banking might involve talking to an AI colleague. The math you’re learning (vectors, probability, optimization) is what makes it work.

Sources

14 / 14 2022–today

RLHF: Teaching AI Human Values

When ChatGPT launched, the underlying model was capable but sometimes offensive or dangerous. The fix: Reinforcement Learning from Human Feedback (RLHF). Humans rate pairs of responses (“which is better?”), training a reward model that scores helpfulness. The LLM is then fine-tuned to maximize that score. Result: independent safety evaluations showed significant reductions in harmful outputs. But researchers also found “reward hacking”—models learning to game the reward model rather than genuinely being helpful. A mathematical cat-and-mouse game.

The Math: Constrained Optimization

The idea: maximize how helpful the AI is (reward R) while keeping it close to its original behavior. “Reward hacking” is Goodhart’s Law in action: when a measure becomes a target, it ceases to be a good measure.

For the Curious: The full formula

$$\max_\pi \mathbb{E}\bigl[R(y)\bigr] - \beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{ref}})$$

KL-divergence D_KL measures how far the fine-tuned model π drifts from the reference model π_ref. β controls the trade-off: too low and the model becomes sycophantic, too high and it ignores human preferences. This is the Lagrangian method from constrained optimization.

Interactive version coming soon

Why you should care: AI safety is a mathematical problem, not just an ethical one. “How do you teach a machine to have values using numbers?” is one of the deepest questions of our time.

If you reward an AI for being helpful, what could go wrong?

Sources

The Story Isn’t Over

14 stories. 11 formulas. 74 years of mathematics leading to this moment.

IMO 2024 Silver Medal AlphaProof solved Problem 6 (only 5 humans did)

IMO 2025 Gold Medal Gemini Deep Think: 35/42 points

AIME 2024 93% OpenAI o1 — top 500 US students level

AIME 2025 100% GPT-5.2 (Grok 4 Heavy also scored 100%)

The math you’re learning today is the same math that powers the most transformative technology of our time.

Explore the full site →

The LLM Revolution

What you’ll learn:

The Foundations

“Attention Is All You Need”

The Math: Self-Attention

King − Man + Woman = Queen

The Math: Word Embeddings & Cosine Similarity

Shannon’s 1948 Experiment

The Math: Entropy & Perplexity

How LLMs Pick Their Next Word

The Math: Softmax & Cross-Entropy Loss

The Breakthroughs

Chinchilla: Bigger ≠ Smarter

The Math: Compute-Optimal Scaling Laws

“Let’s Think Step by Step”

The Math: Test-Time Compute Scaling

DeepSeek: The $6M Challenge

The Math: Mixture of Experts (MoE)

Gradient Descent: The Blindfolded Hiker

The Math: Backpropagation & the Chain Rule

When Math Goes Wrong

The Strawberry Problem

The Math: Byte Pair Encoding (BPE)

The Lawyer’s Fake Cases

The Math: Probability Chains & Compounding Error

Hallucinations Are Mathematically Inevitable

The Math: Information-Theoretic Impossibility

The Apple Card Gender Bias

The Math: Proxy Variables & Fairness Impossibility

LLMs in Your World

JPMorgan’s AI Revolution

The Math: Embeddings + Retrieval

RLHF: Teaching AI Human Values

The Math: Constrained Optimization

The Story Isn’t Over