The discovery that music can be described by numbers is one of humanity’s oldest mathematical insights. From Pythagoras hearing harmony in hammers to Fourier decomposing heat into waves, the mathematics of vibrations has an unbroken thread leading to how modern AI encodes the position of words in a sentence — using the very same sine and cosine functions that describe a vibrating guitar string.
The Timeline
Pythagoras of Samos
Legend says Pythagoras walked past a blacksmith and noticed that hammers of certain weight ratios produced harmonious sounds. He discovered that harmony corresponds to simple numerical ratios: octave (2:1), fifth (3:2), fourth (4:3). This was the first discovery that nature obeys mathematical relationships — and specifically that waves and vibrations have mathematical structure.
Pythagoras showed that beauty (harmony) has mathematical structure. 2,500 years later, we discovered that meaning (in language) also has mathematical structure.
Jean le Rond d’Alembert & Daniel Bernoulli
D’Alembert derived the wave equation describing a vibrating string. Bernoulli proposed that any vibration is a sum of simple harmonic modes (sines and cosines). This was controversial — can every shape really be built from waves? The answer, proven later by Fourier, was yes. This idea — decomposing complexity into simple waves — is the foundation of signal processing and, surprisingly, transformer position encoding.
Bernoulli’s claim that any vibration is a sum of sines was rejected by Euler and d’Alembert. Fourier proved Bernoulli right 60 years later.
Joseph Fourier
Fourier’s radical claim: ANY periodic function can be decomposed into a sum of sines and cosines. He developed this to study heat flow, but it became one of the most important ideas in all of mathematics and engineering. Fourier analysis lets you see any signal as a mixture of frequencies — and this frequency-domain view is how transformers encode position.
Fourier’s idea is everywhere: JPEG compression, audio processing, MRI machines, telecommunications, and transformer position encoding all use Fourier decomposition.
Harry Nyquist & Claude Shannon
To perfectly reconstruct a continuous signal from discrete samples, you need at least twice the highest frequency. This theorem governs all digital audio (CD quality: 44,100 samples/second for sounds up to 22,050 Hz), all digital images, and fundamentally, how AI discretizes continuous information into tokens.
Music CDs sample at 44.1 kHz because human hearing tops out at ~20 kHz. The Nyquist theorem says 40 kHz suffices — CD quality adds a small margin.
James Cooley & John Tukey
The FFT algorithm computes the Discrete Fourier Transform in $O(n \log n)$ instead of $O(n^2)$ — a speedup so dramatic it’s been called one of the most important algorithms of the 20th century. Without the FFT, modern signal processing, telecommunications, and many AI applications would be computationally impractical. It’s also the inspiration behind efficient attention mechanisms.
FFT reduces this from $O(N^2)$ to $O(N \log N)$.
The FFT made the impossible practical. Transforming a million-point signal went from $10^{12}$ operations to $2 \times 10^{7}$ — a 50,000× speedup.
Jean Morlet, Ingrid Daubechies & Stéphane Mallat
Wavelets improved on Fourier by providing BOTH frequency AND time information simultaneously. A Fourier transform tells you which frequencies are present, but not when. Wavelets are localized waves that capture transient features. In AI, multi-scale processing (like the hierarchical features learned by CNNs and the multi-head attention of transformers) follows the wavelet philosophy: analyze at multiple resolutions.
Multi-head attention in transformers operates at multiple “scales” — each head can attend to different ranges of context, just like wavelets analyze at multiple resolutions.
Vaswani et al.
Transformers process words in parallel — they need positional information injected. The original solution: add sine and cosine waves at geometrically increasing frequencies. Position 1 gets one pattern; position 2 gets a different pattern. Like a Fourier basis, each position has a unique “fingerprint.” The brilliance: the model can learn to compute relative positions because $\sin(a+b)$ can be expressed in terms of $\sin(a)$ and $\sin(b)$.
Position encoding IS a Fourier basis. Each dimension is a sine/cosine wave at a different frequency. The position of a word is encoded as a point in this frequency space — Fourier analysis applied to language.
Various Researchers
Standard attention is $O(n^2)$ — quadratic in sequence length. Spectral methods use the FFT to approximate attention in $O(n \log n)$, just as Cooley-Tukey accelerated Fourier transforms. FNet (Google, 2021) replaced attention entirely with Fourier transforms, achieving 92% of transformer quality at 7× the speed. Hyena (2023) uses long convolutions in the frequency domain. The future of efficient AI may be Fourier-based.
FNet layer: two FFTs replace self-attention.
FNet showed that replacing attention with simple Fourier transforms loses only 8% accuracy but runs 7× faster. Fourier’s 200-year-old idea may be the key to efficient AI.
The Thread That Connects
From Pythagoras hearing harmony in hammers to FNet using Fourier transforms as attention, the mathematics of waves has always been about decomposing complexity into simple, understandable components. Position encoding, efficient attention, multi-scale processing — all are descendants of the insight that any signal can be built from waves.
Connections to Other Lectures
- Lecture 6: Number Theory & Encoding — Positional encoding and RoPE connect number-theoretic structure with harmonic functions.
- Lecture 2: Linear Algebra & Transformations — The FFT is a factorization of the DFT matrix; Fourier analysis is linear algebra in disguise.
- Lecture 3: Calculus & Optimization — The wave equation and PDEs that gave rise to Fourier analysis are calculus at its deepest.