Lecture 2: Linear Algebra — Mathematics for AI

When a language model reads your sentence and “understands” which words relate to which, it’s performing millions of matrix multiplications per second — using a branch of mathematics that began with ancient Babylonian scribes solving systems of equations on clay tablets.

2000 BCE

Babylonian & Chinese Mathematicians

Origin

Babylonian clay tablets show systems of linear equations solved 4 000 years ago. The Chinese text Nine Chapters on the Mathematical Art (200 BCE) described what we now call Gaussian elimination — 2 000 years before Gauss.

$$\begin{cases} 3x + 2y = 7 \\ x + 4y = 9 \end{cases}$$

Origin

The same idea — organizing numbers into rows and columns to solve for unknowns — is exactly what a matrix is.

1855

Arthur Cayley & James Joseph Sylvester

Breakthrough

Cayley formalized matrices as mathematical objects with their own algebra. Sylvester coined the word “matrix” (Latin for “womb”) because matrices generate determinants. Cayley proved that matrices could be added, multiplied, and inverted — creating a complete algebraic system for transformations.

$$\mathbf{AB} = \begin{pmatrix} a & b \\ c & d \end{pmatrix}\begin{pmatrix} e & f \\ g & h \end{pmatrix} = \begin{pmatrix} ae+bg & af+bh \\ ce+dg & cf+dh \end{pmatrix}$$

Breakthrough

Matrix multiplication is NOT commutative: $\mathbf{AB} \neq \mathbf{BA}$. This asymmetry is essential for neural networks, where the order of operations matters.

1904

David Hilbert

Discovery

Hilbert generalized eigenvalue theory to infinite dimensions. Eigenvalues reveal the fundamental modes of a transformation — the directions that don’t change, only scale. In data science, this became Principal Component Analysis (PCA), the first dimensionality reduction technique.

$$\mathbf{A}\vec{v} = \lambda \vec{v}$$

Discovery

An eigenvector $\vec{v}$ is a direction that a matrix only stretches, never rotates. The eigenvalue $\lambda$ is the stretch factor. This is how PCA finds the most important directions in data.

1936

Carl Eckart & Gale Young

Breakthrough

SVD factorizes any matrix into three simpler matrices, revealing its fundamental structure. It’s the mathematical Swiss Army knife: used in data compression, noise reduction, recommendation systems, and — critically — the low-rank approximations that make LLMs computationally feasible (LoRA fine-tuning).

$$\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T$$

Breakthrough

Netflix’s recommendation algorithm was built on SVD. LoRA fine-tuning, which adapts LLMs cheaply, is also built on low-rank matrix decomposition.

1940s

Stefan Banach, John von Neumann

Discovery

The dot product measures similarity between vectors. Two vectors pointing the same direction have a large positive dot product; perpendicular vectors have zero. This simple operation became the foundation of word embeddings — representing words as vectors where similar meanings have similar directions.

$$\vec{a} \cdot \vec{b} = \sum_{i} a_i b_i = \|\vec{a}\|\,\|\vec{b}\|\cos\theta$$

Discovery

The cosine similarity between word vectors: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$. This is just the dot product, normalized.

2013

Tomas Mikolov (Google)

AI Connection

Word2Vec mapped every word to a point in 300-dimensional space, trained so that words in similar contexts land in similar locations. “King” and “queen” are close; “king” and “banana” are far apart. The entire vocabulary becomes a matrix — each row is a word, each column is a dimension of meaning.

$$\mathbf{E} \in \mathbb{R}^{V \times d}$$

Where $V$ is vocabulary size and $d$ is embedding dimension (e.g., 300 or 768)

AI Connection

Every LLM starts by converting each input token to a vector using an embedding matrix. GPT-4’s embedding matrix has roughly 100,000 rows × 12,288 columns.

2017

Vaswani et al. (“Attention Is All You Need”)

AI Connection

The transformer’s self-attention is pure linear algebra. Each word generates three vectors: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I carry?). Attention scores are computed by matrix-multiplying Queries with Keys, then using those scores to weight the Values.

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

AI Connection

This single equation IS the transformer. Every word attends to every other word through matrix multiplication. The $\sqrt{d_k}$ scaling prevents dot products from getting too large — a linear algebra insight.

2020–2024

GPT-4, Claude, Llama

AI Connection

Modern LLMs use multi-head attention: the same QKV computation repeated with different projection matrices, then concatenated. GPT-4 reportedly has 120 attention heads across 120 layers. Each forward pass involves trillions of matrix multiplications. The entire intelligence of LLMs is stored in their weight matrices.

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,\mathbf{W}^O$$

AI Connection

When Claude reads your question, it performs billions of matrix multiplications — each one a descendant of those ancient Babylonian systems of equations.

Culmination

From clay tablets to attention heads, the thread is continuous: organize numbers in grids, transform them, find the essential patterns. Linear algebra is not just a tool for AI — it IS the medium in which AI thinks.

The Thread of Linear Algebra

$$\text{Babylon} \to \text{Matrices} \to \text{Eigenvalues} \to \text{SVD} \to \text{Embeddings} \to \text{Attention}$$

Connections to Other Lectures

Lecture 1: Probability The softmax function in the attention formula converts raw dot-product scores into probabilities — a direct bridge from linear algebra to probability theory.

Lecture 3: Calculus Gradient descent is the algorithm that trains these weight matrices — calculus tells each matrix element which direction to move.

Lecture 5: Geometry Word embeddings live in high-dimensional vector spaces — the geometry of those spaces determines what “similarity” means for AI.

Probability All Lectures Calculus