Every time an AI says it’s “92% confident” that your email is spam, or a language model chooses the word “the” over “a,” it’s using probability theory — a branch of mathematics born from a 17th-century gambling problem. This lecture traces the 370-year journey from a letter between two French mathematicians to the softmax function running inside every large language model on Earth.
The Timeline
Blaise Pascal & Pierre de Fermat
The Problem of Points — two gamblers must stop a game early. How do you fairly split the stakes? Pascal and Fermat’s famous correspondence invented probability theory itself. Pascal’s approach was revolutionary in its simplicity: enumerate all possible future outcomes and count the favorable ones.
Before Pascal, probability didn’t exist as mathematics. Games of chance were considered the domain of fate, not numbers.
Jacob Bernoulli
Published posthumously in Ars Conjectandi. Bernoulli proved that as you repeat an experiment more times, the observed frequency converges to the true probability. This was the first formal limit theorem — connecting finite observations to infinite truth.
Bernoulli took 20 years to prove this. It bridges the gap between theory and observation — the same gap AI must cross.
Thomas Bayes (published by Richard Price)
An essay published posthumously. Bayes asked: if I observe evidence, how should I update my beliefs? The answer inverts conditional probability. Ignored for centuries, then became the foundation of spam filters, medical diagnosis AI, and every Bayesian neural network.
The Unsolved Debate: Bayesians vs. Frequentists raged for 200 years. Are probabilities beliefs (Bayesian) or frequencies (Frequentist)? Machine learning uses both.
Carl Friedrich Gauss & Pierre-Simon Laplace
Gauss used the bell curve for astronomical observations. Laplace proved the Central Limit Theorem: no matter what the original distribution, averages of large samples are approximately normal. This is why the bell curve appears everywhere — from weight initialization in neural networks to the noise in diffusion models.
Neural network weights are initialized from Gaussian distributions. The CLT is the reason this works so well.
Andrey Markov
Markov studied sequences where the next state depends only on the current state, not the full history. He applied this to analyze the alternation of vowels and consonants in Pushkin’s Eugene Onegin. This “memoryless” property became the foundation of language modeling — predicting the next word from recent context.
The Markov property — “the future depends only on the present, not the past” — is both the oldest and simplest language model.
Claude Shannon
Shannon’s A Mathematical Theory of Communication defined information as surprise. High-probability events carry little information; low-probability events carry a lot. Shannon entropy measures the average surprise in a probability distribution. This became the loss function for training every language model.
Shannon used Markov chains to generate random English text in 1948. It was arguably the first language model.
Solomon Kullback & Richard Leibler
KL divergence measures how one probability distribution differs from another. Cross-entropy combines this with Shannon’s entropy. When training an LLM, cross-entropy loss measures how different the model’s predicted word probabilities are from the actual next word. Minimizing this loss IS the entire training objective.
Where $p$ is the true distribution (actual next word) and $q$ is the model’s prediction.
Every LLM in existence — GPT-4, Claude, Gemini, Llama — is trained by minimizing cross-entropy loss. This one equation drives the entire field.
The Transformer Revolution
The softmax function converts raw neural network outputs (logits) into a probability distribution. Temperature controls how “confident” or “creative” the model is. At temperature 0, the model always picks the most likely word. At high temperature, it explores more creative choices. This is pure probability theory applied to language generation.
Where $T$ is the temperature parameter. $T \to 0$: deterministic. $T \to \infty$: uniform random.
When you adjust the “creativity” slider in ChatGPT, you’re adjusting the temperature parameter in this 370-year-old probability equation.
The Thread That Connects
This 370-year journey from a gambling problem to softmax temperature shows that probability was always the mathematics of uncertainty — and AI is fundamentally about managing uncertainty. Every milestone on this timeline tackled the same question in a different way: How do we make rational decisions when we don’t know what will happen next? Pascal counted outcomes. Bayes updated beliefs. Shannon measured surprise. And modern transformers convert all of it into the probability distributions that power every AI system you use today.
Connections to Other Lectures
- Lecture 6: Number Theory & Encoding — How tokenization converts text into the discrete symbols that probability distributions operate on.
- Lecture 7: Statistics & Learning Theory — How statistical learning theory proves that minimizing cross-entropy on training data generalizes to unseen data.
- Lecture 4: Logic & Computation — How Boolean logic and Turing machines provide the computational substrate on which probability calculations run.