3

Calculus & Optimization

From Tangent Lines to Training Runs

1629 → 2024 How Machines Learn

Every time a language model improves its predictions — learning from trillions of words to produce coherent, meaningful text — it’s performing an act of calculus. The same mathematics that Newton invented to describe falling apples now teaches AI to write poetry, code, and conversation.

The Timeline

Origin 1629

Pierre de Fermat

Before calculus formally existed, Fermat developed a method to find maxima and minima of curves. He would set $f(x) \approx f(x+e)$, divide by $e$, then let $e \to 0$. This “method of adequality” was the first optimization algorithm — finding the peaks and valleys of functions.

$$f'(x) = \lim_{e \to 0} \frac{f(x+e) - f(x)}{e}$$
Origin

Fermat’s insight — that at a maximum or minimum, the slope is zero — is literally the principle behind training every neural network.

Breakthrough 1665–1684

Isaac Newton & Gottfried Wilhelm Leibniz

The most famous priority dispute in mathematics. Newton developed “fluxions” for physics (1665–66, published later). Leibniz independently developed calculus with superior notation (1684). Their combined work unified differentiation and integration through the Fundamental Theorem of Calculus. Newton used it for gravity; we use it for gradient descent.

$$\int_a^b f'(x)\,dx = f(b) - f(a)$$
Breakthrough

Leibniz’s notation ($\frac{dy}{dx}$, $\int$) won out over Newton’s dots. We still use Leibniz notation today in every machine learning paper.

Discovery 1740s

Leonhard Euler

Euler extended calculus to functions of multiple variables. The chain rule — the derivative of a composition of functions — is the mathematical heart of backpropagation. If a neural network has layers $f \circ g \circ h$, the chain rule tells us how each layer’s parameters affect the final output.

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z_3} \cdot \frac{\partial z_3}{\partial z_2} \cdot \frac{\partial z_2}{\partial w_1}$$
Discovery

The chain rule is THE mathematical reason deep learning works. Without it, we couldn’t compute gradients through 100+ layers.

Breakthrough 1755

Leonhard Euler & Joseph-Louis Lagrange

Instead of optimizing functions, what if we could optimize entire curves and paths? Variational calculus finds the function that minimizes a quantity. This idea — optimization over function spaces — prefigured the training of neural networks, where we search for the optimal function from a parameterized family.

$$\frac{\partial F}{\partial y} - \frac{d}{dx}\frac{\partial F}{\partial y'} = 0$$
Unsolved Problem

The Brachistochrone Problem: What curve allows the fastest descent under gravity? This optimization problem inspired variational calculus, which inspired optimization over neural network parameters.

Breakthrough 1847

Augustin-Louis Cauchy

Cauchy proposed a simple idea: to find the minimum of a function, take small steps in the direction of steepest descent (the negative gradient). This is gradient descent — the engine that trains every neural network. Simple, elegant, and 175 years later still the foundation of AI training.

$$w_{t+1} = w_t - \eta \nabla L(w_t)$$

Where $\eta$ is the learning rate and $\nabla L$ is the gradient of the loss.

Breakthrough

Cauchy invented gradient descent to solve systems of equations. Today it trains models with 1.8 trillion parameters.

AI Connection 1986

David Rumelhart, Geoffrey Hinton & Ronald Williams

The key insight: use the chain rule to efficiently compute gradients through a neural network, layer by layer, from output back to input. Before backpropagation, training deep networks was computationally impractical. After it, the age of deep learning began. The paper “Learning representations by back-propagating errors” is one of the most cited in all of science.

$$\delta^{(l)} = \left(\mathbf{W}^{(l+1)T} \delta^{(l+1)}\right) \odot \sigma'(z^{(l)})$$
AI Connection

Backpropagation is just the chain rule, applied systematically. Hinton called it “the calculus of neural networks.”

Unsolved 1991–2015

Sepp Hochreiter, Yoshua Bengio & Kaiming He

Deep networks faced a crisis: gradients vanished or exploded as they propagated through many layers. Hochreiter identified this in 1991. Solutions came gradually: LSTMs (1997), ReLU activation (2010), batch normalization (2015), residual connections (2015). Each was a calculus insight — engineering the gradient flow through the network.

$$\text{ReLU}(x) = \max(0, x), \quad \text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$$
Unsolved Problem

The vanishing gradient was the biggest unsolved problem in deep learning for 25 years. Residual connections (skip connections) finally tamed it by creating gradient “highways.”

AI Connection 2014–2024

Diederik Kingma & Jimmy Ba

Adam (Adaptive Moment Estimation) combines the best ideas: momentum (using gradient history) and adaptive learning rates (different rates for different parameters). It uses first and second moments of the gradient — mean and variance — to adjust each step. Nearly every LLM is trained with Adam or its variants (AdamW with weight decay).

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2, \quad w_{t+1} = w_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
AI Connection

Training GPT-4 reportedly cost over $100 million in compute. Every dollar was spent running Adam optimizer — this equation — trillions of times.

Culmination

From Fermat’s tangent lines to Adam’s adaptive moments, calculus has always been the mathematics of change. AI training is optimization — and optimization is calculus.

The Calculus Chain
$$\text{Fermat} \to \text{Newton} \to \text{Chain Rule} \to \text{Gradient Descent} \to \text{Backpropagation} \to \text{Adam}$$
400 years of calculus, now training every AI model on Earth.

Connections to Other Lectures

Lecture 2: Linear Algebra Backpropagation computes gradients through layers of matrix multiplications — calculus and linear algebra are inseparable in neural network training.

Lecture 1: Probability The loss function that gradient descent minimizes is cross-entropy — a quantity rooted in probability theory and information theory.

Lecture 7: Statistics & Learning Theory Learning theory tells us when optimization on training data will generalize — connecting the calculus of training to the statistics of prediction.

Linear Algebra All Lectures Logic & Computation