In 1736, Euler solved a puzzle about bridges and accidentally invented a new branch of mathematics. Nearly 300 years later, that same mathematics describes the internet, social networks, biological systems — and the architecture of every AI model. Graph theory is the mathematics of connections, and intelligence is, at its core, about making the right connections.
The Timeline
Leonhard Euler
Can you cross all seven bridges of Königsberg exactly once? Euler proved it’s impossible by abstracting the city into nodes (landmasses) and edges (bridges). In doing so, he invented graph theory. The key insight: what matters is not the shape of the bridges, but the pattern of connections. This abstraction — from physical structure to connectivity — is exactly how neural network architectures are designed.
Euler’s criterion: A graph has an Eulerian path iff it has exactly 0 or 2 vertices of odd degree.
Euler proved that the shape doesn’t matter — only the connections do. This is the founding insight of graph theory AND neural network architecture design.
Arthur Cayley
Cayley’s formula: the number of labeled trees on $n$ vertices is $n^{n-2}$. Tree structures became fundamental in computer science — parse trees for language, decision trees for classification, syntax trees for compilers. Every time an NLP system parses a sentence’s grammatical structure, it’s building a tree in the sense of Cayley.
Parse trees represent sentence structure: “The cat sat on the mat” has a tree showing subject, verb, and prepositional phrase. Before transformers, NLP was built on trees.
Paul Erdős & Alfréd Rényi
What happens when you randomly connect nodes? Erdős and Rényi discovered phase transitions: below a critical threshold, the graph is fragmented; above it, a giant connected component suddenly appears. This “emergence” from random connections prefigures the emergent abilities of neural networks — at some scale, new capabilities suddenly appear.
Giant component appears when edge probability $p > \frac{1}{n}$, i.e., average degree $> 1$.
Phase transitions in random graphs mirror the “emergent abilities” of LLMs: at some size, models suddenly gain capabilities (reasoning, translation, coding) that smaller models completely lack.
Duncan Watts & Steven Strogatz
Watts and Strogatz showed that most real networks (social, biological, technological) are “small worlds” — highly clustered locally, but with short paths globally. Just 6 degrees separate any two people. This structure — local clustering with global shortcuts — is remarkably similar to how residual connections work in transformers: local processing with skip connections that create shortcuts.
Skip connections in ResNets and transformers create “shortcuts” through the network — making them small-world networks where information can flow quickly from any layer to any other.
Larry Page & Sergey Brin (Google)
PageRank models the web as a graph and computes each page’s importance by the importance of pages linking to it. This recursive definition uses eigenvectors of the link matrix — the same linear algebra as word embeddings. PageRank was the first massive-scale AI application of graph theory, and it created the most valuable company in the world.
Where $d \approx 0.85$ is the damping factor, $B_p$ is pages linking to $p$, and $L(q)$ is outgoing links from $q$.
Google’s PageRank is a random walk on a graph — a Markov chain. It computes the stationary distribution: “If you randomly clicked links forever, how often would you visit each page?”
Various (AlexNet, VGG, ResNet, Inception)
Every neural network IS a directed graph: nodes are neurons, edges are weighted connections. The architecture revolution (2012–2017) was about graph topology: deeper graphs (VGG), branching graphs (Inception), graphs with skip edges (ResNet). The transformer is a specific graph: a complete bipartite graph where every input node connects to every other via attention.
The transformer is a graph where every token attends to every other token — it’s a complete graph. This $O(n^2)$ connectivity is both its strength (any word can relate to any other) and its weakness (quadratic memory cost).
Google Knowledge Graph, various
Knowledge graphs represent facts as (entity, relation, entity) triples: (Einstein, born_in, Germany). Google’s Knowledge Graph (2012) contains billions of facts. Retrieval-Augmented Generation (RAG) combines LLMs with knowledge graphs: the LLM generates text while consulting a graph database for factual accuracy. This hybrid approach addresses hallucination — one of the biggest challenges in AI.
RAG is the most popular technique for making LLMs factually accurate. It works by turning the LLM’s query into a graph search, then injecting the results back into the generation process.
Various (GNN community)
Graph Neural Networks generalize transformers to arbitrary graph structures. Instead of attending to all tokens (complete graph), GNNs attend only to graph neighbors. Graph Transformers combine the best of both: graph structure for efficiency, attention for expressiveness. Applications span drug discovery (molecular graphs), social network analysis, and improving LLMs themselves.
Drug discovery AI represents molecules as graphs (atoms = nodes, bonds = edges). GNNs predict whether a molecule will be an effective drug — graph theory saving lives.
The Thread That Connects
From bridges in Königsberg to transformer attention patterns, graph theory reveals the architecture of intelligence. Connections matter more than the things being connected. The topology of a neural network — which neurons connect to which — determines what it can learn, just as the topology of a social network determines what information flows.
Connections to Other Lectures
- Lecture 2: Linear Algebra & Transformations — Adjacency matrices and eigenvalues are the linear algebra of graphs; PageRank is an eigenvector problem.
- Lecture 5: Geometry of High Dimensions — Graph embeddings map nodes into high-dimensional geometric spaces where distance encodes connectivity.
- Lecture 4: Logic & Computation — Computational complexity of graph problems (P vs NP) — many hard optimization problems are graph problems.