Lecture 6: Number Theory & Encoding

Number theory was once called “the purest mathematics” — beautiful but useless. Then it became the foundation of cryptography, the internet, and digital security. Now it underpins how AI reads and writes: from the encoding of text into numbers, to the positional encodings that tell a transformer where each word sits in a sentence.

The Timeline

Origin 300 BCE

Euclid of Alexandria

Euclid’s proof that there are infinitely many prime numbers is one of the most elegant in all of mathematics. Assume finitely many primes, multiply them all together, add 1 — the result is divisible by none of the assumed primes, contradiction. This proof-by-contradiction technique is still used in computer science today.

$$\text{If } p_1, p_2, \ldots, p_n \text{ are all primes, then } p_1 \cdot p_2 \cdots p_n + 1 \text{ has a prime factor not in the list.}$$

Origin

Euclid’s proof is 2,300 years old and still taught in every mathematics program. The concept of prime factorization underlies both cryptography and hashing algorithms used in AI systems.

Breakthrough 1640–1763

Pierre de Fermat & Leonhard Euler

Fermat’s Little Theorem: if $p$ is prime and $a$ is not divisible by $p$, then $a^{p-1} \equiv 1 \pmod{p}$. Euler generalized this with his totient function. Modular arithmetic — “clock arithmetic” where numbers wrap around — became the foundation of cryptography, hash functions, and the positional encodings in transformers.

$$a^{p-1} \equiv 1 \pmod{p}$$

Breakthrough

Modular arithmetic is “clock math”: after 12 comes 1 again. Transformers use sinusoidal functions — essentially continuous clocks — for positional encoding.

Discovery 1801

Carl Friedrich Gauss

Every integer greater than 1 has a unique prime factorization. This uniqueness is the bedrock of number theory — and of hash functions. Hash functions map data to fixed-size numbers, and their security relies on the difficulty of reversing the process (finding the prime factors of large numbers). Every AI system uses hashing for data deduplication, caching, and retrieval.

$$n = p_1^{a_1} \cdot p_2^{a_2} \cdots p_k^{a_k} \quad \text{(unique up to order)}$$

Discovery

Hash functions compress any data (a book, an image, an entire dataset) into a fixed-size fingerprint. This uniqueness property comes from number theory.

Breakthrough 1948–1952

Claude Shannon & David Huffman

Shannon proved that data has a fundamental compression limit: the entropy. Huffman (1952) found an optimal way to encode symbols using variable-length codes — frequent symbols get short codes, rare symbols get long codes. This is the ancestor of all text compression and directly inspired subword tokenization for LLMs.

$$L = \sum_i p_i \cdot l_i \;\geq\; H(X) = -\sum_i p_i \log_2 p_i$$

Expected code length $L$ is bounded below by the entropy $H(X)$

Breakthrough

Huffman’s key insight: assign shorter codes to common symbols. This is exactly the principle behind BPE tokenization — common words get short tokens.

Discovery 1963–1991

Bob Bemer (ASCII) & Unicode Consortium

ASCII (1963) mapped 128 characters to numbers (A=65, B=66, …). Unicode (1991) extended this to 149,000+ characters covering every writing system. This was the fundamental step: converting human text to numbers that computers — and eventually AI — can process. Every LLM begins by converting text to numerical representations.

$$\text{ASCII: } \texttt{'A'} = 65 = 01000001_2, \quad \texttt{'a'} = 97 = 01100001_2$$

Discovery

Before AI can “read,” every character must become a number. Unicode solved the encoding problem for all human languages.

AI Connection 2015–2020

Rico Sennrich (BPE for NLP) & OpenAI (GPT tokenizer)

BPE was originally a data compression algorithm (1994). Sennrich (2015) adapted it for NLP: start with individual characters, repeatedly merge the most frequent adjacent pairs. “unhappiness” becomes [“un”, “happiness”]. This solves the vocabulary problem — the model doesn’t need to memorize every word, just common subword pieces. GPT-4 uses ~100,000 BPE tokens.

$$\text{Merge rule: if ``t'' + ``h'' is most frequent} \to \text{replace with ``th''}$$

Then if “th” + “e” is most frequent → replace with “the”

AI Connection

GPT-4’s tokenizer splits “unhappiness” into [“un”, “happiness”]. Common English words get single tokens; rare words are split into pieces. This is Huffman coding applied to language.

AI Connection 2017

Vaswani et al.

Transformers process all words in parallel — so how does the model know word order? The answer: add a unique positional encoding to each word. The original transformer used sinusoidal functions at different frequencies, inspired by Fourier series. Each position gets a unique “fingerprint” based on sine and cosine waves — a direct application of number theory and trigonometry.

$$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

AI Connection

The base 10,000 in positional encoding creates a system like counting in different number bases simultaneously — each dimension is a different “clock” cycling at a different frequency.

AI Connection 2021–2024

Jianlin Su (RoPE) & Meta (Llama)

Rotary Position Embedding (RoPE) represents positions as rotations in 2D subspaces of the embedding. It uses Euler’s formula ($e^{i\theta} = \cos\theta + i\sin\theta$) to rotate query and key vectors based on their position. This elegant number-theoretic approach allows LLMs to generalize to longer sequences than they were trained on — a crucial capability.

$$\text{RoPE}(x_m, m) = x_m \cdot e^{im\theta} \quad \text{where } \theta_j = 10000^{-2j/d}$$

AI Connection

RoPE uses complex number rotations — Euler’s 18th-century formula — to tell modern LLMs where words are. Llama, Mistral, and many other models rely on this.

Culmination

The journey from prime numbers to position encoding reveals number theory’s hidden power: the “purest” mathematics became the most practical. Every text you send to an AI passes through layers of encoding — Unicode, tokenization, positional encoding — each rooted in centuries of number-theoretic insight.

The Number Theory Chain

$$\text{Primes} \to \text{Modular Arithmetic} \to \text{Hashing} \to \text{Huffman} \to \text{BPE} \to \text{RoPE}$$

Connections to Other Lectures

Lecture 1: Probability & Uncertainty — Shannon’s entropy is the theoretical limit for compression, connecting information theory to the Huffman and BPE encoding schemes in this lecture.
Lecture 9: Fourier Analysis — Sinusoidal positional encoding is a direct application of Fourier series; each position is decomposed into a sum of sine and cosine waves at different frequencies.
Lecture 2: Linear Algebra & Transformations — Embedding matrices convert tokens into vectors; the positional encodings from this lecture are added to those vectors before attention begins.