Lecture 1: Probability & Uncertainty

Every time an AI says it’s “92% confident” that your email is spam, or a language model chooses the word “the” over “a,” it’s using probability theory — a branch of mathematics born from a 17th-century gambling problem. This lecture traces the 370-year journey from a letter between two French mathematicians to the softmax function running inside every large language model on Earth.

The Timeline

Origin 1654

Blaise Pascal & Pierre de Fermat

The Problem of Points — two gamblers must stop a game early. How do you fairly split the stakes? Pascal and Fermat’s famous correspondence invented probability theory itself. Pascal’s approach was revolutionary in its simplicity: enumerate all possible future outcomes and count the favorable ones.

$$P(A) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$$

Origin

Before Pascal, probability didn’t exist as mathematics. Games of chance were considered the domain of fate, not numbers.

Breakthrough 1713

Jacob Bernoulli

Published posthumously in Ars Conjectandi. Bernoulli proved that as you repeat an experiment more times, the observed frequency converges to the true probability. This was the first formal limit theorem — connecting finite observations to infinite truth.

$$\lim_{n \to \infty} P\left(\left|\frac{S_n}{n} - p\right| > \varepsilon\right) = 0$$

Breakthrough

Bernoulli took 20 years to prove this. It bridges the gap between theory and observation — the same gap AI must cross.

Discovery 1763

Thomas Bayes (published by Richard Price)

An essay published posthumously. Bayes asked: if I observe evidence, how should I update my beliefs? The answer inverts conditional probability. Ignored for centuries, then became the foundation of spam filters, medical diagnosis AI, and every Bayesian neural network.

$$P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}$$

Discovery

The Unsolved Debate: Bayesians vs. Frequentists raged for 200 years. Are probabilities beliefs (Bayesian) or frequencies (Frequentist)? Machine learning uses both.

Breakthrough 1809

Carl Friedrich Gauss & Pierre-Simon Laplace

Gauss used the bell curve for astronomical observations. Laplace proved the Central Limit Theorem: no matter what the original distribution, averages of large samples are approximately normal. This is why the bell curve appears everywhere — from weight initialization in neural networks to the noise in diffusion models.

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

Breakthrough

Neural network weights are initialized from Gaussian distributions. The CLT is the reason this works so well.

Discovery 1906

Andrey Markov

Markov studied sequences where the next state depends only on the current state, not the full history. He applied this to analyze the alternation of vowels and consonants in Pushkin’s Eugene Onegin. This “memoryless” property became the foundation of language modeling — predicting the next word from recent context.

$$P(X_{n+1} = x \mid X_n, X_{n-1}, \ldots, X_0) = P(X_{n+1} = x \mid X_n)$$

Discovery

The Markov property — “the future depends only on the present, not the past” — is both the oldest and simplest language model.

Breakthrough 1948

Claude Shannon

Shannon’s A Mathematical Theory of Communication defined information as surprise. High-probability events carry little information; low-probability events carry a lot. Shannon entropy measures the average surprise in a probability distribution. This became the loss function for training every language model.

$$H(X) = -\sum_{i} p(x_i) \log_2 p(x_i)$$

Breakthrough

Shannon used Markov chains to generate random English text in 1948. It was arguably the first language model.

Discovery 1951

Solomon Kullback & Richard Leibler

KL divergence measures how one probability distribution differs from another. Cross-entropy combines this with Shannon’s entropy. When training an LLM, cross-entropy loss measures how different the model’s predicted word probabilities are from the actual next word. Minimizing this loss IS the entire training objective.

$$\mathcal{L} = -\sum_{i} p(x_i) \log q(x_i)$$

Where $p$ is the true distribution (actual next word) and $q$ is the model’s prediction.

AI Connection

Every LLM in existence — GPT-4, Claude, Gemini, Llama — is trained by minimizing cross-entropy loss. This one equation drives the entire field.

AI Connection 2017–2024

The Transformer Revolution

The softmax function converts raw neural network outputs (logits) into a probability distribution. Temperature controls how “confident” or “creative” the model is. At temperature 0, the model always picks the most likely word. At high temperature, it explores more creative choices. This is pure probability theory applied to language generation.

$$\text{softmax}(z_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$

Where $T$ is the temperature parameter. $T \to 0$: deterministic. $T \to \infty$: uniform random.

AI Connection

When you adjust the “creativity” slider in ChatGPT, you’re adjusting the temperature parameter in this 370-year-old probability equation.

The Thread That Connects

This 370-year journey from a gambling problem to softmax temperature shows that probability was always the mathematics of uncertainty — and AI is fundamentally about managing uncertainty. Every milestone on this timeline tackled the same question in a different way: How do we make rational decisions when we don’t know what will happen next? Pascal counted outcomes. Bayes updated beliefs. Shannon measured surprise. And modern transformers convert all of it into the probability distributions that power every AI system you use today.

The Probability Chain

$$\text{Pascal} \to \text{Bayes} \to \text{Markov} \to \text{Shannon} \to \text{Softmax}$$

370 years of probability theory, now running in every AI prediction.

Connections to Other Lectures

Lecture 6: Number Theory & Encoding — How tokenization converts text into the discrete symbols that probability distributions operate on.
Lecture 7: Statistics & Learning Theory — How statistical learning theory proves that minimizing cross-entropy on training data generalizes to unseen data.
Lecture 4: Logic & Computation — How Boolean logic and Turing machines provide the computational substrate on which probability calculations run.