A fun, accessible look at how mathematics and artificial intelligence help banks, apps, and digital services make decisions
A 45-Minute Talk for High School Students
Every morning, AI makes hundreds of invisible decisions about you
“You already use AI finance”
Surprise, curiosity
“Math finds what humans miss”
“Whoa, that’s clever”
“From data to action”
Empowerment
“Power, fairness, and your future”
Reflection
“Math is your superpower”
Inspiration
See Bayes, Sigmoid, Bell Curve & Scatter Plot visualizations ↓
| Concept | Where | How Presented | Formula |
|---|---|---|---|
| Probability | Section 2 | Cafeteria analogy — “What are the chances the mystery meat is good?” | Informal introduction, no formula yet |
| Bayes’ Theorem | Section 3 | Updating your belief when new evidence arrives — “How surprised should you be?” | $$P(\text{Fraud} \mid \text{Data}) = \frac{P(\text{Data} \mid \text{Fraud}) \cdot P(\text{Fraud})}{P(\text{Data})}$$ |
| Normal Distribution | Section 3 | The bell curve describes “normal” spending — outliers trigger alerts | $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$ |
| Sigmoid Function | Section 3 | Squishes any number into a probability between 0 and 1 | $$\sigma(x) = \frac{1}{1 + e^{-x}}$$ |
| Decision Boundaries | Section 3 | The line where the AI switches from “OK” to “suspicious” | Visual concept (threshold on sigmoid output) |
| Gradient Descent | Section 4 | Named only — “the AI rolls downhill to find the best answer” | Named, not derived |
| Weighted Average | Section 6 | Credit scores work like school grades — different factors carry different weight | $$\text{Score} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$$ |
| Linear Regression | Section 6 | The simplest prediction: draw a straight line through data points | $$y = mx + b$$ |
| Dot Product | Section 8 | Multiply matching preferences, add them up to measure similarity | $$\mathbf{A} \cdot \mathbf{B} = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$$ |
| Cosine Similarity | Section 8 | How similar are two people’s tastes? Measure the angle between their preference vectors | $$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \cdot |\mathbf{B}|}$$ |
Interactive visualizations of the key mathematical concepts
All the formulas from the talk — screenshot-friendly!
How surprised should we be? Update your belief with new evidence.
Squish any number into a probability between 0 and 1.
Different factors matter differently — just like your grade.
Multiply matching preferences, add them up.
How similar are two people’s tastes? Measure the angle.
The simplest prediction: a straight line through your data.
The bell curve — what “normal” looks like mathematically. Outliers trigger alerts.
μ is the mean (center), σ is the standard deviation (spread). The further a transaction is from the center, the more unusual it is.
How the AI knows which direction to adjust — elegantly expressed in terms of the sigmoid itself.
This derivative is used in backpropagation to train neural networks. Its simplicity is what made early neural networks computationally feasible.
Every story connects a real event to the mathematics behind Large Language Models
Eight Google researchers—including a 20-year-old intern named Aidan Gomez—published a 15-page paper with a Beatles-inspired title. It became the most cited AI paper in history. Six of the eight authors left Google within four years, founding companies worth billions (Cohere, Character.AI, Inceptive). Noam Shazeer, who designed the attention mechanism, quit in 2021 and was brought back in 2024 for $2.7 billion.
Each word generates a Query (what am I looking for?), Key (what do I offer?), and Value (my content). The dot product Q·K measures relevance. Softmax converts scores to probabilities summing to 1. The result: every word “pays attention” to every other word simultaneously—which is why GPUs can train transformers so fast.
Google researcher Tomas Mikolov submitted a paper that peer reviewers rejected—at a conference with a 70% acceptance rate. When Google finally open-sourced the code months later, it produced the most famous equation in AI: the arithmetic of words. A neural network trained on billions of words discovered that “King − Man + Woman” lands near “Queen” in vector space. The same paper won the NeurIPS Test of Time Award a decade later.
Every word becomes a vector of 300 numbers. Similar words cluster together. Relationships (gender, royalty, country→capital) appear as consistent directions. Cosine similarity measures how close two words are—the same formula from Section 8 of this talk.
In 1951, a 35-year-old mathematician at Bell Labs named Claude Shannon ran a remarkable experiment: he asked people to predict the next letter of a text, one character at a time. If wrong, they were told the correct letter. By counting guesses, Shannon measured the statistical structure of English—finding it has only ~1.1 bits of entropy per character (out of a maximum 4.7). He had described exactly what ChatGPT does: minimize uncertainty about the next token. He did it 74 years before ChatGPT existed.
Entropy measures average surprise per symbol. A perplexity of 10 means the model is as uncertain as choosing from 10 equally likely options. Training an LLM on the internet is extreme compression of human knowledge—to predict the next word, the model must learn facts, grammar, logic, and culture.
Every word ChatGPT types is the winner of a probability competition among 100,000+ candidates. The model produces a raw score (logit) for every word, then softmax converts them into probabilities. “The” might get 32%, “a” gets 18%, and 100,000 others share the rest. A parameter called temperature controls randomness: at 0, the model always picks the top word (robotic). At 1, it samples proportionally (creative but risky). Every AI conversation is literally a sequence of weighted dice rolls.
The exponential function amplifies differences: a small advantage in raw score becomes a large probability advantage. Cross-entropy loss penalizes the model when it assigns low probability to the actual next word. If p = 0.01, loss = 4.6 (harsh penalty). If p = 0.99, loss ≈ 0.01 (almost no penalty).
In 2022, DeepMind proved that every major AI lab had the wrong formula. Everyone was building bigger models, assuming more parameters = better. DeepMind’s Chinchilla—with 70B parameters, half the size of Gopher (280B)—outperformed it on almost every benchmark by training on 4× more data. The secret was a simple power law: scale data and model size equally.
For a fixed compute budget C, optimal model size N and training data D should both scale as the square root of compute. The earlier belief (Kaplan 2020) was N ∝ C0.73, over-weighting size. Performance follows power laws—the same y = axb as Kepler’s planetary laws.
In 2022, Google Brain researchers discovered that simply adding four words—“Let’s think step by step”—to any prompt dramatically improved LLM math performance. In 2024, OpenAI’s o1 model took this further: trained with reinforcement learning on reasoning traces, it generates thousands of hidden “thinking tokens” before answering. On the 2024 AIME (top 3% of US math students), GPT-4o scored 12%. o1 scored 93%. More thinking time literally makes AI smarter.
Before o1, all compute went into training. Now performance also scales with compute spent at inference—how long the model “thinks.” Mathematically, this is tree search: each thinking step explores a node in a decision tree. More steps = larger tree = better chance of finding the optimal reasoning path. A third dimension of scaling beyond parameters and data.
In January 2025, a two-year-old Chinese company called DeepSeek—founded by hedge fund manager Liang Wenfeng—released model R1 for free. Training compute cost: $5.6 million (vs. GPT-4’s estimated $100M+ total development budget). One week later, it was the #1 app on the US App Store. The same day, Nvidia lost $589 billion in market cap—the largest single-day loss in stock market history. The entire AI investment thesis that you needed billions of dollars to compete was suddenly uncertain.
DeepSeek V3 has 671B total parameters but only 37B are active per input (<6%). A routing function selects which K of N specialist sub-networks (“experts”) handle each token. Result: knowledge capacity of 671B, compute cost of 37B. Plus, R1 learned reasoning through pure reinforcement learning—no human labels needed.
Training a language model means adjusting hundreds of billions of numbers to reduce how wrong the model is. The algorithm is beautifully simple: imagine you’re blindfolded on a mountain and want to reach the lowest valley. You feel the slope under your feet and take a small step downhill. Repeat billions of times. The math hasn’t changed since Rumelhart, Hinton & Williams formalized backpropagation in 1986. One of the three authors, Geoffrey Hinton, won the 2024 Nobel Prize in Physics for foundational work on neural networks. What changed: hardware, data, and the transformer architecture it’s applied to.
The gradient ∇θ tells you: “if I nudge this parameter, how much does the error change?” Backpropagation uses the chain rule of calculus to compute gradients for billions of parameters in one backward pass. The learning rate α controls step size—too large and you overshoot, too small and you never arrive.
“How many R’s are in the word strawberry?” AI answers: two. The correct answer is three. This became the most-shared LLM failure on the internet. The reason is purely mathematical: GPT-4’s tokenizer splits “strawberry” into [str][aw][berry]. The model never sees individual characters—it sees three tokens. It can’t count letters it can’t see. Fix: ask the model to spell it out letter by letter first, then count. Forcing character-level tokens makes counting trivial.
BPE (1994 compression algorithm) iteratively merges the most frequent character pairs until a target vocabulary is reached. GPT-4 has ~100,000 tokens. Each word gets split into subword chunks. The model reasons about tokens, not characters—explaining why LLMs struggle with letter counting, spelling backwards, and rhyming.
Attorney Steven Schwartz asked ChatGPT to find legal precedents for a case against Avianca airlines. ChatGPT cited “Varghese v. China Southern Airlines,” “Shaboon v. Egyptair,” and four more cases. When asked to confirm, it said: “These cases indeed exist and can be found in reputable legal databases.” None of them existed. The judge called the legal reasoning “gibberish.” Schwartz was fined $5,000. He had trusted a probability machine to fact-check itself.
If each token has a 99% chance of being locally plausible, a 100-token response has probability 0.99100 = 0.366—a 63% chance of containing at least one error. The model has no truth oracle. “Case name + airline + court” is a high-probability pattern in legal text. Plausible ≠ true.
In September 2024, researchers published a paper proving that LLM hallucinations are not just engineering bugs—they are mathematically inevitable. Any system that generates text by sampling from a learned probability distribution will sometimes produce false statements. The fix (explicit confidence scoring for every claim) works in theory but is impractical: it would require the model to pause and verify each statement against a factual database, making responses extremely slow and expensive.
The probability distribution over tokens spreads across incorrect possibilities. With ~100K vocabulary items and softmax normalization, there is always non-zero probability mass on wrong tokens. The chain rule of probability compounds: if each step has 1% error rate, a 100-word sentence has ~63% chance of at least one error. Perfect truthfulness requires external verification—something the architecture fundamentally lacks.
Tech entrepreneur David Heinemeier Hansson discovered Apple Card gave him a credit limit 20× higher than his wife’s—despite her having a better credit score. Apple co-founder Steve Wozniak reported the same. The algorithm never explicitly used gender. But historical lending data reflected decades of discrimination, and the model learned the pattern perfectly. In 2024, the CFPB fined Apple $25M and Goldman Sachs $45M.
A “proxy variable” encodes a protected characteristic indirectly: zip code encodes race, shopping patterns encode gender. Chouldechova’s Impossibility Theorem (2017) proves you cannot simultaneously satisfy equal false positive rates, equal false negative rates, and equal calibration. Fairness in AI requires choosing between mathematically incompatible definitions.
200,000 employees at the world’s largest bank now use an LLM daily. Their “LLM Suite” won Innovation of the Year. When markets swung sharply in April 2025, the AI tool Coach helped advisers find information 95% faster. Investment bankers automate 40% of SEC filing analysis. AI-powered fraud detection prevents an estimated $1.5 billion in losses with 98% accuracy across 60+ countries. Total tech budget: $17 billion per year.
JPMorgan’s system uses Retrieval-Augmented Generation (RAG): financial documents are converted into embedding vectors (the same vectors from Story 2), stored in a database, and retrieved by cosine similarity when a query comes in. The LLM then generates answers grounded in actual documents—reducing hallucinations in high-stakes finance.
When ChatGPT launched, the underlying model was capable but sometimes offensive or dangerous. The fix: Reinforcement Learning from Human Feedback (RLHF). Humans rate pairs of responses (“which is better?”), training a reward model that scores helpfulness. The LLM is then fine-tuned to maximize that score. Result: independent safety evaluations showed significant reductions in harmful outputs. But researchers also found “reward hacking”—models learning to game the reward model rather than genuinely being helpful. A mathematical cat-and-mouse game.
The idea: maximize how helpful the AI is (reward R) while keeping it close to its original behavior. “Reward hacking” is Goodhart’s Law in action: when a measure becomes a target, it ceases to be a good measure.
KL-divergence DKL measures how far the fine-tuned model π drifts from the reference model πref. β controls the trade-off: too low and the model becomes sycophantic, too high and it ignores human preferences. This is the Lagrangian method from constrained optimization.
Sources: DeepMind, OpenAI, HuggingFace Sept 2025 Report
Presbyterian minister whose theorem powers modern fraud detection. His work on probability was published posthumously by his friend Richard Price in 1763. Core question: “How surprised should you be?”
Used in Section 3 — Fraud Detection
Child prodigy who summed 1 to 100 in seconds (50 pairs of 101 = 5,050). Discovered the bell curve that describes “normal” behavior in data. His hair allegedly matched its shape.
Used in Section 3 — Normal Distribution
Saw that math and computation were one, 180 years before ChatGPT. Wrote the first algorithm for Charles Babbage’s Analytical Engine. “The Analytical Engine weaves algebraic patterns just as the Jacquard loom weaves flowers and leaves.”
Used in Section 4 — How Does the AI Learn?
Used her “coxcomb diagrams” (polar area charts) to prove that sanitation saves more soldiers than medicine. Data visualization pioneer who changed hospital policy through statistics.
Used in Section 2 — Pattern Recognition
Analyzed vowels and consonants in Pushkin’s poetry to discover sequential patterns. His Markov chains now power credit scoring models, autocomplete, and speech recognition.
Used in Section 6 — Credit Scoring
WWII statistician who told the military to armor the parts of returning planes that DID NOT have bullet holes — because the planes hit in those spots never came back. A masterclass in survivorship bias.
Used in Section 5 — Spot the Fraud
Built the Perceptron in 1958 — the first machine that could learn from data. The New York Times headline read: “New Navy Device Learns by Doing.” Ancestor of every neural network alive today.
Used in Section 3 — Fraud Detection
The father of information theory. His 1948 paper defined entropy and showed that all communication is mathematics. In 1951, he ran the first “predict the next character” experiment — exactly what ChatGPT does, 74 years early.
Used in LLM Stories — Story 3
Nightingale + Lovelace remind us: mathematics has always needed diverse thinkers. The field moves forward when different perspectives ask different questions.
Used in Section 10 — The Bigger Picture
Follow BankBot’s journey from overconfident to wise
“I analyzed your breakfast. You are 94% human.”Overconfident (Sec 1)
“Too easy. Next.”Cocky (Sec 4 warmup)
(sweating on the boundary)Stressed (Sec 3)
“FRAUD!” → “Most things are fine” → “Let me calculate…”Growing (Sec 3.5)
“FRAUD DETECTED!” / “…suspicious flowers.”Humbled (Sec 4)
“Based on my analysis, you need 47 houseplants.”Observant (Sec 8)
(shaking head at social media data)Critical (Sec 9)
“I am 73% confident. But I defer to the human.”Wise (Sec 10)
“Raise your hand if you KNEW that AI made 50–200 decisions about you before breakfast.”
“Should the AI block this transaction? Hands up for YES, down for NO.”
Three scenarios: Alex buys a guitar in another city; Tomoko buys 50 gift cards at 3 AM; Karla has small charges in 4 countries. Vote: Fraud or Legit?
Predict the AI’s output during the live demo — thumbs up if you think it will approve, thumbs down if it will flag.
Rapid-fire: Should an AI use this data for credit decisions? Income? Social media? Zip code? GPA? Shout “USE IT” or “SKIP IT”!
Visual math explanations that make linear algebra, calculus, and neural networks click.
Free statistics and probability courses — master the foundations at your own pace.
Train your own machine learning model right in the browser — no coding required.
Fun, engaging math videos covering everything from prime numbers to infinity.
By Cathy O’Neil — how unchecked algorithms reinforce inequality. Essential reading on AI bias.
By David Spiegelhalter — learn to make sense of data in everyday life. Accessible and brilliantly written.
Jay Alammar’s visual walkthrough of the transformer architecture—the best visual explanation on the internet.
Grant Sanderson’s visual deep dive into how GPT works—attention, embeddings, and training in one video.
Career paths that combine mathematics, AI, and finance
The detailed plans behind this talk — from content to visuals to deployment
The complete 45-minute talk plan with 10 sections, 8 history vignettes, 7 formulas, BankBot running gag, cartoons, and speaker notes. Includes the storytelling arc, timing philosophy, and backup strategies.
View Talk Plan →Technical blueprint for this GitHub Pages website — file structure, design system, task breakdown, KaTeX integration, responsive CSS architecture, and deployment verification checklist.
View Deployment Plan →Comprehensive visual specification: 40 slides, 17 data visualizations, BankBot character bible, cartoon briefs, historical imagery, physical props, live demo UI, and production pipeline.
View Visual Plan →| Section | Time | Duration | Content |
|---|---|---|---|
| 1. Opening Hook | 0:00 – 3:00 | 3 min | 50–200 AI decisions factoid, BankBot intro, hand raise |
| 2. Think Like an AI | 3:00 – 7:00 | 4 min | Pattern recognition, cafeteria analogy, Nightingale |
| 3. Fraud Detection | 7:00 – 15:30 | 8.5 min | Bayes, bell curve, sigmoid, Gauss/Bayes/Rosenblatt |
| 4. How AI Learns | 15:30 – 17:00 | 1.5 min | Feedback loops, gradient descent, Lovelace |
| 5. Spot the Fraud | 17:00 – 21:00 | 4 min | Interactive voting, Wald, BankBot false positive |
| 6. Credit Scoring | 21:00 – 28:00 | 7 min | Weighted average, linear regression, Markov |
| 7. Live Demo | 28:00 – 32:00 | 4 min | 3-tier demo, formula callbacks |
| 8. Recommendations | 32:00 – 34:00 | 2 min | Dot product, cosine similarity, 47 houseplants |
| 9. Design Your AI | 34:00 – 37:00 | 3 min | USE IT / SKIP IT exercise |
| 10. Bigger Picture | 37:00 – 42:00 | 5 min | Ethics, careers, BankBot finale, call to action |
| Buffer / Q&A | 42:00 – 45:00 | 3 min | Questions, overflow |
Approximately 32 slides total, averaging about 1.3 minutes per slide. Heavier sections (Fraud Detection, Credit Scoring) may use 5–7 slides each; lighter sections (Opening, Closing) use 2–3.