"The Neuron"
| Applicant | Income (k€) | Debt Ratio | Credit Score |
|---|---|---|---|
| Anna | 45 | 0.3 | 720 |
| Ben | 80 | 0.1 | 680 |
| Clara | 35 | 0.5 | 590 |
| David | 60 | 0.2 | 750 |
| Eva | 25 | 0.7 | 520 |
Your Task
- Using weights $w_{\text{income}} = 0.01$, $w_{\text{debt}} = -2.0$, $w_{\text{credit}} = 0.005$, and bias $b = -3.5$, compute the weighted sum for Anna: $z = 0.01 \times 45 + (-2.0) \times 0.3 + 0.005 \times 720 + (-3.5)$
- Apply the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ to your result. Should Anna be approved (output > 0.5)?
- Which applicant would get the LOWEST score? Why?
Reveal Solution
Anna's weighted sum: $z = 0.45 - 0.6 + 3.6 - 3.5 = -0.05$. Sigmoid: $\sigma(-0.05) \approx 0.49$. Just below 0.5 — borderline reject!
- This is exactly how a single neuron (or perceptron) works: weighted sum of inputs plus bias, passed through an activation function.
- Eva gets the lowest score because of her low income, high debt ratio, and low credit score.
- A real neural network stacks many neurons to capture more complex patterns.
"Layers of Learning"
| Layer | What It Sees | Example (Fraud Detection) |
|---|---|---|
| Input (Raw Data) | Individual numbers | Amount: €847, Time: 3am, Location: abroad |
| Hidden Layer 1 | Simple patterns | "Large amount" + "unusual time" |
| Hidden Layer 2 | Complex combinations | "Unusual spending pattern abroad at night" |
| Output | Decision | Fraud probability: 87% |
Your Task
- Why can't a single neuron detect the fraud pattern "large amount AND unusual time AND abroad"? What would it miss?
- How does adding layers help? What does each layer "learn" that the previous one couldn't?
- What happens if we add 100 layers instead of 2? Is deeper always better?
Reveal Solution
A single neuron computes a weighted sum — it can only draw a straight line through the data. Complex patterns like "large AND unusual AND abroad" require non-linear combinations of features.
- Each hidden layer creates new, more abstract representations: raw features → simple patterns → complex patterns → decision.
- More layers ≠ always better: too many layers cause vanishing gradients (signals fade) and overfitting.
- Practical networks for tabular data use 2–5 layers.
This layered learning is what makes deep learning "deep."
"Learning from Mistakes"
| Stock | Predicted Return (%) | Actual Return (%) | Squared Error |
|---|---|---|---|
| Apple | 8.2 | 7.5 | ? |
| BMW | 3.1 | 5.0 | ? |
| Nestlé | -1.5 | -2.0 | ? |
| HSBC | 4.0 | 1.2 | ? |
| Tesla | 12.0 | 15.3 | ? |
Your Task
- Compute the squared error $(y_i - \hat{y}_i)^2$ for each stock and the Mean Squared Error (MSE) across all five.
- Which stock contributes the MOST to the total error? What does this tell the network?
- If you could adjust the network's weights by a tiny amount, which prediction would you try to improve first?
Reveal Solution
Squared errors: Apple $(7.5-8.2)^2 = 0.49$, BMW $(5.0-3.1)^2 = 3.61$, Nestlé $(-2.0-(-1.5))^2 = 0.25$, HSBC $(1.2-4.0)^2 = 7.84$, Tesla $(15.3-12.0)^2 = 10.89$. MSE $= \frac{0.49 + 3.61 + 0.25 + 7.84 + 10.89}{5} = 4.616$.
- Tesla and HSBC contribute the most error — large individual mistakes are magnified by squaring.
- The loss function (here MSE: $L = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$) tells the network HOW wrong it is overall.
- Gradient descent adjusts weights: $w \leftarrow w - \alpha \frac{\partial L}{\partial w}$, nudging them to reduce the biggest errors first.
"The Chain Rule Trick"
| Layer | Input | Weight | Output |
|---|---|---|---|
| Layer 1 | x = 2.0 | $w_1$ = 0.5 | $h_1$ = 2.0 $\times$ 0.5 = 1.0 |
| Layer 2 | $h_1$ = 1.0 | $w_2$ = 3.0 | $h_2$ = 1.0 $\times$ 3.0 = 3.0 |
| Layer 3 | $h_2$ = 3.0 | $w_3$ = 0.2 | $\hat{y}$ = 3.0 $\times$ 0.2 = 0.6 |
| Target: y = 1.0 | Error: $(1.0 - 0.6)^2$ = 0.16 | ||
Your Task
- The prediction is 0.6 but the target is 1.0. We need to increase the output. Which weight ($w_1$, $w_2$, or $w_3$) would have the BIGGEST effect if increased slightly?
- If you increase $w_3$ by 0.1 (from 0.2 to 0.3), what is the new output? How much does the error decrease?
- Why can't we just change $w_1$ and ignore the other weights?
Reveal Solution
Increasing $w_3$ to 0.3: new output = 3.0 $\times$ 0.3 = 0.9, error drops from 0.16 to 0.01 — a dramatic improvement!
- $w_3$ has the biggest effect only because $h_2$ is large (3.0). The impact of each weight depends on the values flowing through the network.
- This is backpropagation: compute the error at the output, then trace it backward through each layer using the chain rule.
- Each weight gets a gradient proportional to its contribution to the error — all weights are updated simultaneously.
This is how a network with millions of weights learns: one tiny coordinated nudge at a time.
"Too Good to Be True"
| Model | Parameters | Train Accuracy | Test Accuracy | Verdict |
|---|---|---|---|---|
| Tiny | 10 | 72% | 70% | ? |
| Small | 100 | 88% | 85% | ? |
| Medium | 1,000 | 95% | 91% | ? |
| Large | 10,000 | 99% | 82% | ? |
| Huge | 100,000 | 100% | 65% | ? |
Your Task
- Fill in the "Verdict" column: which models are underfitting, which are overfitting, and which is the sweet spot?
- The "Huge" model gets 100% on training data but only 65% on test data. Explain what happened in plain English.
- A bank builds a credit model that scores 99% accuracy on historical data but fails on new applications. What went wrong?
Reveal Solution
Verdicts: Tiny — underfitting (too simple, poor on both). Small/Medium — good generalization. Large — starting to overfit (17% gap). Huge — severe overfitting (35% gap).
- The "Huge" model memorized the noise and individual quirks of the training data instead of learning general patterns.
- The bank's credit model did the same: it memorized historical applications rather than learning what creditworthiness actually means.
- Dropout (randomly disabling neurons during training) and regularization (penalizing large weights) help prevent this.
- Early stopping — halting training when test accuracy starts to fall — is the simplest and most effective remedy.
"The Deep Learning Zoo"
| Architecture | Superpower | Best For | Finance Use Case |
|---|---|---|---|
| CNN (Convolutional) | Detects spatial patterns | Images, documents | Check signature verification, chart pattern recognition |
| RNN / LSTM | Remembers sequences | Time series, text | Stock price forecasting, earnings call analysis |
| Transformer | Pays attention to what matters | NLP, any sequence | FinBERT sentiment, GPT-based analysis, fraud detection |
| GAN | Generates realistic fakes | Synthetic data | Generating realistic but private transaction data for testing |
Your Task
- A bank wants to automatically read handwritten checks. Which architecture would you recommend and why?
- An investment firm wants to predict tomorrow's stock price from the last 30 days of prices. Which architecture fits? What are the risks?
- You've built a fraud detection model but need more training data without violating privacy regulations. Which architecture could help?
Reveal Solution
Each task maps cleanly to one architecture in the zoo:
- Check reading → CNN: convolutional networks scan images for local patterns (edges, curves, characters) regardless of where on the page they appear.
- Stock prediction → RNN/LSTM or Transformer: both handle sequences, but Transformers capture longer-range dependencies better. Risk: no architecture can reliably predict prices — markets are adversarial.
- Synthetic data → GAN: Generative Adversarial Networks create realistic transaction records that preserve statistical properties without exposing real customer data.
The Transformer has become the dominant architecture since 2018 (BERT, GPT) and powers most modern NLP in finance, including FinBERT from L06.