When a language model reads your sentence and “understands” which words relate to which, it’s performing millions of matrix multiplications per second — using a branch of mathematics that began with ancient Babylonian scribes solving systems of equations on clay tablets.
The same idea — organizing numbers into rows and columns to solve for unknowns — is exactly what a matrix is.
Matrix multiplication is NOT commutative: $\mathbf{AB} \neq \mathbf{BA}$. This asymmetry is essential for neural networks, where the order of operations matters.
An eigenvector $\vec{v}$ is a direction that a matrix only stretches, never rotates. The eigenvalue $\lambda$ is the stretch factor. This is how PCA finds the most important directions in data.
Netflix’s recommendation algorithm was built on SVD. LoRA fine-tuning, which adapts LLMs cheaply, is also built on low-rank matrix decomposition.
The cosine similarity between word vectors: \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\). This is just the dot product, normalized.
Where $V$ is vocabulary size and $d$ is embedding dimension (e.g., 300 or 768)
Every LLM starts by converting each input token to a vector using an embedding matrix. GPT-4’s embedding matrix has roughly 100,000 rows × 12,288 columns.
This single equation IS the transformer. Every word attends to every other word through matrix multiplication. The $\sqrt{d_k}$ scaling prevents dot products from getting too large — a linear algebra insight.
When Claude reads your question, it performs billions of matrix multiplications — each one a descendant of those ancient Babylonian systems of equations.
Culmination
From clay tablets to attention heads, the thread is continuous: organize numbers in grids, transform them, find the essential patterns. Linear algebra is not just a tool for AI — it IS the medium in which AI thinks.
Connections to Other Lectures
Lecture 1: Probability The softmax function in the attention formula converts raw dot-product scores into probabilities — a direct bridge from linear algebra to probability theory.
Lecture 3: Calculus Gradient descent is the algorithm that trains these weight matrices — calculus tells each matrix element which direction to move.
Lecture 5: Geometry Word embeddings live in high-dimensional vector spaces — the geometry of those spaces determines what “similarity” means for AI.