10

Game Theory & Alignment

From War Games to AI Safety

1928 → 2024 Guiding the Values of AI

How do you ensure that a superintelligent AI acts in humanity’s interest? This question — the alignment problem — is perhaps the most important unsolved problem of our time. And its mathematical foundation is game theory: the mathematics of strategy, incentives, and rational decision-making, born from Cold War military strategy and now applied to the greatest challenge in AI.

The Timeline

Origin 1928

John von Neumann

Von Neumann proved the minimax theorem: in any two-player zero-sum game, there exists an optimal strategy for each player. Your best move is to minimize your maximum possible loss. This was the birth of game theory — and the principle behind adversarial training in AI. The same mathematician who designed the architecture of modern computers also gave us the mathematics for training them against adversaries.

$$\min_x \max_y f(x, y) = \max_y \min_x f(x, y)$$
Origin

Von Neumann was a titan: he contributed to quantum mechanics, computer architecture, the atomic bomb, AND game theory. His minimax theorem is the foundation of adversarial AI training.

Breakthrough 1950

John Nash

Nash generalized von Neumann’s work to non-zero-sum games with any number of players. A Nash equilibrium is a state where no player can improve their outcome by unilaterally changing strategy. Nash proved that every finite game has at least one equilibrium (possibly in mixed strategies). This concept appears everywhere in multi-agent AI systems.

$$\forall i, \forall s_i' \neq s_i^*: \quad u_i(s_i^*, s_{-i}^*) \geq u_i(s_i', s_{-i}^*)$$

No player $i$ can benefit by deviating from the equilibrium strategy $s_i^*$.

Breakthrough

Nash was 21 when he proved this in his PhD thesis — a 27-page paper that won the Nobel Prize. His life story, including his struggle with schizophrenia, was portrayed in ‘A Beautiful Mind.’

Discovery 1950

Merrill Flood & Melvin Dresher (RAND Corporation)

Two prisoners are better off cooperating, but individual rationality leads both to defect — making both worse off. This paradox reveals the fundamental tension between individual and collective rationality. In AI alignment, the same tension appears: an AI optimizing for its own objective may take actions harmful to humanity. How do we design incentives for AI to cooperate with human values?

Cooperate Defect
Cooperate (3, 3) (0, 5)
Defect (5, 0) (1, 1)
Discovery

The prisoner’s dilemma is why AI alignment is hard: a perfectly rational agent might pursue goals that harm everyone. Alignment means designing the “game” so that cooperation is the rational choice.

Breakthrough 1960s–2007

Leonid Hurwicz, Roger Myerson & Eric Maskin (Nobel 2007)

Mechanism design is “reverse game theory”: instead of analyzing a game, you design the rules to achieve a desired outcome. Auction design, voting systems, market mechanisms — all are mechanism design. For AI alignment, mechanism design asks: how do we design the training process so the AI’s incentives align with human values? RLHF is a form of mechanism design.

A mechanism $(M, g)$ implements outcome $f$ if the equilibrium of game $M$ with outcome function $g$ produces $f$ for all type profiles.

Breakthrough

Mechanism design won the Nobel Prize in Economics (2007). AI alignment IS mechanism design: engineering the training rules so that the AI’s optimal strategy aligns with human welfare.

AI Connection 2014

Ian Goodfellow

GANs pit two neural networks against each other in a minimax game: a generator creates fake data, a discriminator tries to detect fakes. They improve by competing — like an art forger and a detective. The generator learns to create increasingly realistic images. The game-theoretic equilibrium: the generator produces data indistinguishable from real. GANs produced the first photorealistic AI-generated images.

$$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$
AI Connection

Goodfellow invented GANs during a bar argument about generative models. He went home, coded it that night, and it worked on the first try. The minimax formulation is von Neumann’s 1928 theorem applied directly to neural networks.

AI Connection 2017–2022

Paul Christiano & Jan Leike (OpenAI / DeepMind)

RLHF trains an AI to align with human preferences by having humans compare AI outputs, training a reward model from these comparisons, then using reinforcement learning to optimize the AI against that reward model. This is mechanism design in action: the “game” is designed so the AI improves by producing outputs humans prefer. ChatGPT’s helpfulness comes from RLHF.

Reward model: $r_\theta(x, y)$ trained from human comparisons.

$$\max_\pi \mathbb{E}_{x,y \sim \pi}[r_\theta(x,y)] - \beta \cdot D_{KL}(\pi \| \pi_{ref})$$
AI Connection

RLHF is why ChatGPT feels helpful instead of chaotic. Without it, language models produce raw text completions. With RLHF, they produce answers, follow instructions, and avoid harmful content.

AI Connection 2023–2024

Rafailov et al. (DPO) & Anthropic (Constitutional AI)

DPO (Direct Preference Optimization) eliminates the need for a separate reward model — it optimizes human preferences directly. KTO (Kahneman-Tversky Optimization) uses prospect theory from behavioral economics. Constitutional AI (Anthropic) has AI critique its own outputs against written principles. These innovations refine the alignment game — making AI safer and more aligned without the instability of reinforcement learning.

$$\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

$y_w$ = preferred output, $y_l$ = dispreferred output, $\sigma$ = sigmoid function.

AI Connection

Claude (Anthropic) uses Constitutional AI: the AI critiques its own responses against a written “constitution” of principles. It’s self-governance through mechanism design.

Unsolved 2024+

The AI Safety Community

The alignment problem remains unsolved: how do we ensure increasingly powerful AI systems remain beneficial? Game theory provides the framework but not the complete answer. Key open questions: How do we specify human values precisely? What happens when AI systems are smarter than their overseers (the “weak-to-strong generalization” problem)? Can we prove alignment mathematically? These are simultaneously mathematical, philosophical, and existential questions.

Unsolved

The alignment problem may be the most important unsolved problem in mathematics and computer science combined. Getting it right — or wrong — could determine humanity’s future.

The Thread That Connects

From von Neumann’s war games to the alignment of artificial superintelligence, game theory has always been about designing systems where rational agents act in everyone’s interest. As AI systems grow more powerful, the mathematics of incentives, strategy, and cooperation isn’t just interesting — it’s essential. The future of AI is not just a technical challenge; it’s a game-theoretic one.

The Game Theory Chain
$$\text{Minimax} \to \text{Nash} \to \text{Mechanism Design} \to \text{GANs} \to \text{RLHF} \to \text{Alignment}$$
A century of strategic mathematics, now guarding AI’s alignment with humanity.

Connections to Other Lectures

End of the Journey

You’ve now traveled 10 paths through the history of mathematics, each leading to the same destination: the mathematical foundations of artificial intelligence. These paths interweave — probability feeds into statistics, linear algebra enables geometry, calculus powers optimization, logic guides computation, and game theory guards alignment. Together, they form the mathematical bedrock of the AI revolution.

← Return to All Lectures →
Harmonic Analysis All Lectures