Every time a language model improves its predictions — learning from trillions of words to produce coherent, meaningful text — it’s performing an act of calculus. The same mathematics that Newton invented to describe falling apples now teaches AI to write poetry, code, and conversation.
The Timeline
Pierre de Fermat
Before calculus formally existed, Fermat developed a method to find maxima and minima of curves. He would set $f(x) \approx f(x+e)$, divide by $e$, then let $e \to 0$. This “method of adequality” was the first optimization algorithm — finding the peaks and valleys of functions.
Fermat’s insight — that at a maximum or minimum, the slope is zero — is literally the principle behind training every neural network.
Isaac Newton & Gottfried Wilhelm Leibniz
The most famous priority dispute in mathematics. Newton developed “fluxions” for physics (1665–66, published later). Leibniz independently developed calculus with superior notation (1684). Their combined work unified differentiation and integration through the Fundamental Theorem of Calculus. Newton used it for gravity; we use it for gradient descent.
Leibniz’s notation ($\frac{dy}{dx}$, $\int$) won out over Newton’s dots. We still use Leibniz notation today in every machine learning paper.
Leonhard Euler
Euler extended calculus to functions of multiple variables. The chain rule — the derivative of a composition of functions — is the mathematical heart of backpropagation. If a neural network has layers $f \circ g \circ h$, the chain rule tells us how each layer’s parameters affect the final output.
The chain rule is THE mathematical reason deep learning works. Without it, we couldn’t compute gradients through 100+ layers.
Leonhard Euler & Joseph-Louis Lagrange
Instead of optimizing functions, what if we could optimize entire curves and paths? Variational calculus finds the function that minimizes a quantity. This idea — optimization over function spaces — prefigured the training of neural networks, where we search for the optimal function from a parameterized family.
The Brachistochrone Problem: What curve allows the fastest descent under gravity? This optimization problem inspired variational calculus, which inspired optimization over neural network parameters.
Augustin-Louis Cauchy
Cauchy proposed a simple idea: to find the minimum of a function, take small steps in the direction of steepest descent (the negative gradient). This is gradient descent — the engine that trains every neural network. Simple, elegant, and 175 years later still the foundation of AI training.
Where $\eta$ is the learning rate and $\nabla L$ is the gradient of the loss.
Cauchy invented gradient descent to solve systems of equations. Today it trains models with 1.8 trillion parameters.
David Rumelhart, Geoffrey Hinton & Ronald Williams
The key insight: use the chain rule to efficiently compute gradients through a neural network, layer by layer, from output back to input. Before backpropagation, training deep networks was computationally impractical. After it, the age of deep learning began. The paper “Learning representations by back-propagating errors” is one of the most cited in all of science.
Backpropagation is just the chain rule, applied systematically. Hinton called it “the calculus of neural networks.”
Sepp Hochreiter, Yoshua Bengio & Kaiming He
Deep networks faced a crisis: gradients vanished or exploded as they propagated through many layers. Hochreiter identified this in 1991. Solutions came gradually: LSTMs (1997), ReLU activation (2010), batch normalization (2015), residual connections (2015). Each was a calculus insight — engineering the gradient flow through the network.
The vanishing gradient was the biggest unsolved problem in deep learning for 25 years. Residual connections (skip connections) finally tamed it by creating gradient “highways.”
Diederik Kingma & Jimmy Ba
Adam (Adaptive Moment Estimation) combines the best ideas: momentum (using gradient history) and adaptive learning rates (different rates for different parameters). It uses first and second moments of the gradient — mean and variance — to adjust each step. Nearly every LLM is trained with Adam or its variants (AdamW with weight decay).
Training GPT-4 reportedly cost over $100 million in compute. Every dollar was spent running Adam optimizer — this equation — trillions of times.
Culmination
From Fermat’s tangent lines to Adam’s adaptive moments, calculus has always been the mathematics of change. AI training is optimization — and optimization is calculus.
Connections to Other Lectures
Lecture 2: Linear Algebra Backpropagation computes gradients through layers of matrix multiplications — calculus and linear algebra are inseparable in neural network training.
Lecture 1: Probability The loss function that gradient descent minimizes is cross-entropy — a quantity rooted in probability theory and information theory.
Lecture 7: Statistics & Learning Theory Learning theory tells us when optimization on training data will generalize — connecting the calculus of training to the statistics of prediction.