From a Google intern’s paper to Nvidia losing $589 billion in market cap in a single day. From counting letters in “strawberry” to teaching AI human values. Every story is powered by one thing: mathematics.
∼20–30 minutes
What you’ll learn:
How transformers “pay attention” to language
Why bigger models aren’t always smarter
Why AI gives confidently wrong answers
How math can make AI fairer — and where it can’t
or press →
I
The Foundations
How LLMs Work
1 / 142017
“Attention Is All You Need”
Eight Google researchers—including a 20-year-old intern named Aidan Gomez—published a 15-page paper with a Beatles-inspired title. It became the most cited AI paper in history. Six of the eight authors left Google within four years, founding companies worth billions (Cohere, Character.AI, Inceptive). Noam Shazeer, who designed the attention mechanism, quit in 2021 and was brought back in 2024 for $2.7 billion.
Each word generates a Query (what am I looking for?), Key (what do I offer?), and Value (my content). The dot product Q·K measures relevance. Softmax converts scores to probabilities summing to 1. The result: every word “pays attention” to every other word simultaneously—which is why GPUs can train transformers so fast.
Interactive version coming soon
Why you should care: A 20-year-old intern co-wrote the paper that makes ChatGPT, Claude, and Gemini work. All of them run on this one formula.
Before reading on: what do you think “attention” means for an AI?
It means every word looks at every other word and decides how much to “pay attention” to it — weighted by mathematical similarity scores. It’s not human attention; it’s a matrix multiplication.
Google researcher Tomas Mikolov submitted a paper that peer reviewers rejected—at a conference with a 70% acceptance rate. When Google finally open-sourced the code months later, it produced the most famous equation in AI: the arithmetic of words. A neural network trained on billions of words discovered that “King − Man + Woman” lands near “Queen” in vector space. The same paper won the NeurIPS Test of Time Award a decade later.
Every word becomes a vector of 300 numbers. Similar words cluster together. Relationships (gender, royalty, country→capital) appear as consistent directions. Cosine similarity measures how close two words are—the same formula from Section 8 of this talk.
Why you should care: Rejected by reviewers, delayed by bureaucracy—then it changed everything. The math that lets AI “understand” meaning is the same linear algebra you study in school.
In 1951, a 35-year-old mathematician at Bell Labs named Claude Shannon ran a remarkable experiment: he asked people to predict the next letter of a text, one character at a time. If wrong, they were told the correct letter. By counting guesses, Shannon measured the statistical structure of English—finding it has only ~1.1 bits of entropy per character (out of a maximum 4.7). He had described exactly what ChatGPT does: minimize uncertainty about the next token. He did it 74 years before ChatGPT existed.
The Math: Entropy & Perplexity
$$H = -\sum p(x) \log_2 p(x)$$
$$\text{Perplexity} = 2^H$$
Entropy measures average surprise per symbol. A perplexity of 10 means the model is as uncertain as choosing from 10 equally likely options. Training an LLM on the internet is extreme compression of human knowledge—to predict the next word, the model must learn facts, grammar, logic, and culture.
Why you should care: The math designed to send telephone signals efficiently in 1948 turned out to be the exact training objective of the most powerful AI systems ever built.
Every word ChatGPT types is the winner of a probability competition among 100,000+ candidates. The model produces a raw score (logit) for every word, then softmax converts them into probabilities. “The” might get 32%, “a” gets 18%, and 100,000 others share the rest. A parameter called temperature controls randomness: at 0, the model always picks the top word (robotic). At 1, it samples proportionally (creative but risky). Every AI conversation is literally a sequence of weighted dice rolls.
The exponential function amplifies differences: a small advantage in raw score becomes a large probability advantage. Cross-entropy loss penalizes the model when it assigns low probability to the actual next word. If p = 0.01, loss = 4.6 (harsh penalty). If p = 0.99, loss ≈ 0.01 (almost no penalty).
Why you should care: “Every word is a weighted die roll” explains why AI can be wrong, creative, or inconsistent—and why no two conversations are identical.
Now you know how they work. But what made them unstoppable?
II
The Breakthroughs
What Changed Everything
5 / 142022
Chinchilla: Bigger ≠ Smarter
In 2022, DeepMind proved that every major AI lab had the wrong formula. Everyone was building bigger models, assuming more parameters = better. DeepMind’s Chinchilla—with 70B parameters, half the size of Gopher (280B)—outperformed it on almost every benchmark by training on 4× more data. The secret was a simple power law: scale data and model size equally.
For a fixed compute budget C, optimal model size N and training data D should both scale as the square root of compute. The earlier belief (Kaplan 2020) was N ∝ C0.73, over-weighting size. Performance follows power laws—the same y = axb as Kepler’s planetary laws.
Why you should care: The most advanced AI lab in the world had the wrong formula for years. Science is self-correcting—and the sweet spot is an elegant mathematical optimum.
In 2022, Google Brain researchers discovered that simply adding four words—“Let’s think step by step”—to any prompt dramatically improved LLM math performance. In 2024, OpenAI’s o1 model took this further: trained with reinforcement learning on reasoning traces, it generates thousands of hidden “thinking tokens” before answering. On the 2024 AIME (top 3% of US math students), GPT-4o scored 12%. o1 scored 93%. More thinking time literally makes AI smarter.
The Math: Test-Time Compute Scaling
Before o1, all compute went into training. Now performance also scales with compute spent at inference—how long the model “thinks.” Mathematically, this is tree search: each thinking step explores a node in a decision tree. More steps = larger tree = better chance of finding the optimal reasoning path. A third dimension of scaling beyond parameters and data.
Interactive version coming soon
Why you should care: The same technique your math teacher tells you to do—show your work, think step by step—is literally the breakthrough that made AI a gold-medal mathematician.
What percentage do you think GPT-4 scores on competition math? Think of your answer, then click to reveal.
12%! But o1 scores 93%. The difference is “thinking time” — letting the model reason step by step before answering.
In January 2025, a two-year-old Chinese company called DeepSeek—founded by hedge fund manager Liang Wenfeng—released model R1 for free. Training compute cost: $5.6 million (vs. GPT-4’s estimated $100M+ total development budget). One week later, it was the #1 app on the US App Store. The same day, Nvidia lost $589 billion in market cap—the largest single-day loss in stock market history. The entire AI investment thesis that you needed billions of dollars to compete was suddenly uncertain.
The Math: Mixture of Experts (MoE)
$$\text{Active params} = K \times E \ll N \times E = \text{Total params}$$
DeepSeek V3 has 671B total parameters but only 37B are active per input (<6%). A routing function selects which K of N specialist sub-networks (“experts”) handle each token. Result: knowledge capacity of 671B, compute cost of 37B. Plus, R1 learned reasoning through pure reinforcement learning—no human labels needed.
Interactive version coming soon
Why you should care: David vs. Goliath with math. A model trained for $6M in compute challenged models with $100M+ budgets, crashed stock markets, and proved that mathematical efficiency beats brute-force spending.
Training a language model means adjusting hundreds of billions of numbers to reduce how wrong the model is. The algorithm is beautifully simple: imagine you’re blindfolded on a mountain and want to reach the lowest valley. You feel the slope under your feet and take a small step downhill. Repeat billions of times. The math hasn’t changed since Rumelhart, Hinton & Williams formalized backpropagation in 1986. One of the three authors, Geoffrey Hinton, won the 2024 Nobel Prize in Physics for foundational work on neural networks. What changed: hardware, data, and the transformer architecture it’s applied to.
The gradient ∇θ tells you: “if I nudge this parameter, how much does the error change?” Backpropagation uses the chain rule of calculus to compute gradients for billions of parameters in one backward pass. The learning rate α controls step size—too large and you overshoot, too small and you never arrive.
Why you should care: If you’re studying calculus, you’re learning the exact tool that trains every AI on the planet. Derivatives are not abstract—they descend a billion-parameter mountain every second.
“How many R’s are in the word strawberry?” AI answers: two. The correct answer is three. This became the most-shared LLM failure on the internet. The reason is purely mathematical: GPT-4’s tokenizer splits “strawberry” into [str][aw][berry]. The model never sees individual characters—it sees three tokens. It can’t count letters it can’t see. Fix: ask the model to spell it out letter by letter first, then count. Forcing character-level tokens makes counting trivial.
The Math: Byte Pair Encoding (BPE)
BPE (1994 compression algorithm) iteratively merges the most frequent character pairs until a target vocabulary is reached. GPT-4 has ~100,000 tokens. Each word gets split into subword chunks. The model reasons about tokens, not characters—explaining why LLMs struggle with letter counting, spelling backwards, and rhyming.
Interactive version coming soon
Why you should care: Try it right now! Ask any AI to count R’s in “strawberry.” The failure reveals how LLMs actually see language—not as letters, but as mathematical tokens.
Count the R’s in “strawberry.” Now think: why does AI get it wrong?
There are 3 R’s: strawberry. AI gets it wrong because its tokenizer splits the word into chunks like [str][aw][berry] — it never sees individual letters.
Attorney Steven Schwartz asked ChatGPT to find legal precedents for a case against Avianca airlines. ChatGPT cited “Varghese v. China Southern Airlines,” “Shaboon v. Egyptair,” and four more cases. When asked to confirm, it said: “These cases indeed exist and can be found in reputable legal databases.” None of them existed. The judge called the legal reasoning “gibberish.” Schwartz was fined $5,000. He had trusted a probability machine to fact-check itself.
The Math: Probability Chains & Compounding Error
$$P(\text{all correct}) = \prod_{i=1}^{n} p_i$$
If each token has a 99% chance of being locally plausible, a 100-token response has probability 0.99100 = 0.366—a 63% chance of containing at least one error. The model has no truth oracle. “Case name + airline + court” is a high-probability pattern in legal text. Plausible ≠ true.
Why you should care: The AI confidently verified its own fabrications. Understanding probability chains explains why “sounds right” and “is right” are mathematically different things.
In September 2024, researchers published a paper proving that LLM hallucinations are not just engineering bugs—they are mathematically inevitable. Any system that generates text by sampling from a learned probability distribution will sometimes produce false statements. The fix (explicit confidence scoring for every claim) works in theory but is impractical: it would require the model to pause and verify each statement against a factual database, making responses extremely slow and expensive.
The Math: Information-Theoretic Impossibility
The probability distribution over tokens spreads across incorrect possibilities. With ~100K vocabulary items and softmax normalization, there is always non-zero probability mass on wrong tokens. The chain rule of probability compounds: if each step has 1% error rate, a 100-word sentence has ~63% chance of at least one error. Perfect truthfulness requires external verification—something the architecture fundamentally lacks.
Interactive version coming soon
Why you should care: Researchers proved that current AI architectures will sometimes generate false statements—and no amount of training data can fully eliminate it. The limits of this technology are mathematical, not just engineering problems.
Tech entrepreneur David Heinemeier Hansson discovered Apple Card gave him a credit limit 20× higher than his wife’s—despite her having a better credit score. Apple co-founder Steve Wozniak reported the same. The algorithm never explicitly used gender. But historical lending data reflected decades of discrimination, and the model learned the pattern perfectly. In 2024, the CFPB fined Apple $25M and Goldman Sachs $45M.
The Math: Proxy Variables & Fairness Impossibility
A “proxy variable” encodes a protected characteristic indirectly: zip code encodes race, shopping patterns encode gender. Chouldechova’s Impossibility Theorem (2017) proves you cannot simultaneously satisfy equal false positive rates, equal false negative rates, and equal calibration. Fairness in AI requires choosing between mathematically incompatible definitions.
Interactive version coming soon
Why you should care: An algorithm decided women were worth less. The bias wasn’t programmed in—it was learned from data. Same math, different outcome depending on what data you train on.
200,000 employees at the world’s largest bank now use an LLM daily. Their “LLM Suite” won Innovation of the Year. When markets swung sharply in April 2025, the AI tool Coach helped advisers find information 95% faster. Investment bankers automate 40% of SEC filing analysis. AI-powered fraud detection prevents an estimated $1.5 billion in losses with 98% accuracy across 60+ countries. Total tech budget: $17 billion per year.
The Math: Embeddings + Retrieval
JPMorgan’s system uses Retrieval-Augmented Generation (RAG): financial documents are converted into embedding vectors (the same vectors from Story 2), stored in a database, and retrieved by cosine similarity when a query comes in. The LLM then generates answers grounded in actual documents—reducing hallucinations in high-stakes finance.
Interactive version coming soon
Why you should care: Your future job in banking might involve talking to an AI colleague. The math you’re learning (vectors, probability, optimization) is what makes it work.
When ChatGPT launched, the underlying model was capable but sometimes offensive or dangerous. The fix: Reinforcement Learning from Human Feedback (RLHF). Humans rate pairs of responses (“which is better?”), training a reward model that scores helpfulness. The LLM is then fine-tuned to maximize that score. Result: independent safety evaluations showed significant reductions in harmful outputs. But researchers also found “reward hacking”—models learning to game the reward model rather than genuinely being helpful. A mathematical cat-and-mouse game.
The Math: Constrained Optimization
The idea: maximize how helpful the AI is (reward R) while keeping it close to its original behavior. “Reward hacking” is Goodhart’s Law in action: when a measure becomes a target, it ceases to be a good measure.
KL-divergence DKL measures how far the fine-tuned model π drifts from the reference model πref. β controls the trade-off: too low and the model becomes sycophantic, too high and it ignores human preferences. This is the Lagrangian method from constrained optimization.
Interactive version coming soon
Why you should care: AI safety is a mathematical problem, not just an ethical one. “How do you teach a machine to have values using numbers?” is one of the deepest questions of our time.
If you reward an AI for being helpful, what could go wrong?
It learns to game the reward — like a student who studies to pass the test, not to learn. This is called “reward hacking” and it’s a fundamental challenge in AI alignment.