How does a machine learn from data without simply memorizing it? The answer lies in a 200-year statistical tradition: from Gauss fitting curves to astronomical data, to modern scaling laws that predict how much data an LLM needs. Statistics is the science of learning from limited observations — exactly what AI must do.
The Timeline
Carl Friedrich Gauss & Adrien-Marie Legendre
Gauss used least squares to predict the orbit of the asteroid Ceres from just a few observations — and was proven right when Ceres reappeared exactly where he predicted. Least squares minimizes the sum of squared errors between predictions and observations. It’s the first machine learning algorithm, predating the term by 200 years.
Gauss was 24 when he predicted Ceres’ orbit. His method — minimize squared errors — is still the loss function for linear regression, the simplest machine learning model.
Ronald A. Fisher
Fisher asked: given observed data, what parameter values make the data most likely? Maximum likelihood estimation (MLE) finds the parameters that maximize the probability of seeing what we actually saw. This is exactly how neural networks are trained: find the weights that make the training data most likely under the model.
Minimizing cross-entropy loss IS maximum likelihood estimation. Every LLM training run is Fisher’s MLE applied at massive scale.
Jerzy Neyman & Egon Pearson
How do you decide if a pattern in data is real or just random noise? Neyman and Pearson formalized hypothesis testing with Type I errors (false positives) and Type II errors (false negatives). This framework — balancing false alarms against missed detections — directly maps to precision and recall in AI classification, and to the false positive/negative tradeoffs in fraud detection.
Type I: $P(\text{reject } H_0 | H_0 \text{ true}) = \alpha$ • Type II: $P(\text{fail to reject } H_0 | H_0 \text{ false}) = \beta$
When an AI fraud detector flags a legitimate transaction (false positive) or misses a fraudulent one (false negative), it’s navigating the Neyman-Pearson tradeoff.
Multiple Contributors
A model that’s too simple (high bias) misses real patterns. A model that’s too complex (high variance) memorizes noise. The bias-variance tradeoff says you can’t minimize both simultaneously. For decades, this limited model complexity. Then deep learning shattered the tradeoff — enormous models with billions of parameters somehow generalize well, a phenomenon called “double descent.”
Classical statistics says more parameters = more overfitting. But GPT-4 has 1.8 trillion parameters and generalizes remarkably. The classical theory breaks down at scale — and we don’t fully understand why.
Vladimir Vapnik & Leslie Valiant
Vapnik’s VC (Vapnik-Chervonenkis) dimension measures a model’s capacity — how complex a pattern it can learn. Valiant’s PAC (Probably Approximately Correct) framework asks: how much data do you need to learn a good model with high probability? Together, they created computational learning theory — the first rigorous theory of when and why machines can learn.
VC dimension: the largest set of points that can be shattered (correctly classified in all possible ways).
$$m \geq \frac{1}{\epsilon}\left(\ln\frac{1}{\delta} + d_{VC} \ln\frac{1}{\epsilon}\right)$$VC theory says you need roughly as many training examples as you have parameters. LLMs violate this by orders of magnitude — and still work. This is one of the deepest puzzles in modern ML theory.
Robert Tibshirani
Regularization adds a penalty for model complexity, preventing overfitting. The Lasso (L1 regularization) forces many parameters to exactly zero, performing automatic feature selection. Weight decay (L2 regularization) keeps parameters small. In LLMs, dropout (randomly zeroing neurons during training) and weight decay are essential regularization techniques.
Lasso (L1): $\min_{\beta} \sum(y_i - \mathbf{x}_i^T\beta)^2 + \lambda\|\beta\|_1$
Ridge (L2): $\min_{\beta} \sum(y_i - \mathbf{x}_i^T\beta)^2 + \lambda\|\beta\|_2^2$
Weight decay (a form of L2 regularization) is applied in every LLM training run. AdamW — the optimizer behind GPT-4 and Claude — was created specifically to correctly apply weight decay.
Jared Kaplan et al. (OpenAI)
Kaplan et al. discovered that LLM performance follows predictable power laws: loss decreases as a power of model size, dataset size, and compute. This was revolutionary — it meant you could predict how well a model would perform before training it, just from its size. Scaling laws enabled the “race to scale” that produced GPT-3, GPT-4, and beyond.
Where $N$ = parameters, $D$ = dataset size, $C$ = compute, and $\alpha$ values are empirical power-law exponents.
Scaling laws are empirical statistical regularities. They predicted that GPT-4 would be much better than GPT-3 before anyone trained it. This is statistics as prophecy.
Jordan Hoffmann et al. (DeepMind)
The Chinchilla paper revealed that most LLMs were trained on too little data relative to their size. The compute-optimal ratio: for a given compute budget, the number of training tokens should scale roughly linearly with model parameters. This shifted the field from “bigger models” to “more data” — leading to the data-centric AI movement.
Chinchilla (70B parameters, 1.4T tokens) outperformed the much larger Gopher (280B parameters, 300B tokens). Statistics showed that data quality and quantity matter as much as model size.
The Thread That Connects
From Gauss’s asteroid prediction to Chinchilla’s scaling laws, statistics has always asked the same question: how do you learn reliable truths from limited data? AI is the ultimate test of this question — and the answer keeps evolving.
Connections to Other Lectures
- Lecture 1: Probability & Uncertainty — Cross-entropy and maximum likelihood estimation bridge probability theory and statistical learning.
- Lecture 3: Calculus & Optimization — Gradient descent is the optimization engine that minimizes the loss functions statistics defines.
- Lecture 10: Game Theory & Strategic AI — RLHF uses statistical reward models to fine-tune LLMs through strategic interaction.