Pre-Class Discovery: Deep Learning | Methods & Algorithms

Activity 1

"The Neuron"

Applicant	Income (k€)	Debt Ratio	Credit Score
Anna	45	0.3	720
Ben	80	0.1	680
Clara	35	0.5	590
David	60	0.2	750
Eva	25	0.7	520

Your Task

Using weights $w_{\text{income}} = 0.01$, $w_{\text{debt}} = -2.0$, $w_{\text{credit}} = 0.005$, and bias $b = -3.5$, compute the weighted sum for Anna: $z = 0.01 \times 45 + (-2.0) \times 0.3 + 0.005 \times 720 + (-3.5)$
Apply the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ to your result. Should Anna be approved (output > 0.5)?
Which applicant would get the LOWEST score? Why?

Reveal Solution

Anna's weighted sum: $z = 0.45 - 0.6 + 3.6 - 3.5 = -0.05$. Sigmoid: $\sigma(-0.05) \approx 0.49$. Just below 0.5 — borderline reject!

This is exactly how a single neuron (or perceptron) works: weighted sum of inputs plus bias, passed through an activation function.
Eva gets the lowest score because of her low income, high debt ratio, and low credit score.
A real neural network stacks many neurons to capture more complex patterns.

Activity 2

"Layers of Learning"

Layer	What It Sees	Example (Fraud Detection)
Input (Raw Data)	Individual numbers	Amount: €847, Time: 3am, Location: abroad
Hidden Layer 1	Simple patterns	"Large amount" + "unusual time"
Hidden Layer 2	Complex combinations	"Unusual spending pattern abroad at night"
Output	Decision	Fraud probability: 87%

Your Task

Why can't a single neuron detect the fraud pattern "large amount AND unusual time AND abroad"? What would it miss?
How does adding layers help? What does each layer "learn" that the previous one couldn't?
What happens if we add 100 layers instead of 2? Is deeper always better?

Reveal Solution

A single neuron computes a weighted sum — it can only draw a straight line through the data. Complex patterns like "large AND unusual AND abroad" require non-linear combinations of features.

Each hidden layer creates new, more abstract representations: raw features → simple patterns → complex patterns → decision.
More layers ≠ always better: too many layers cause vanishing gradients (signals fade) and overfitting.
Practical networks for tabular data use 2–5 layers.

This layered learning is what makes deep learning "deep."

Activity 3

"Learning from Mistakes"

Stock	Predicted Return (%)	Actual Return (%)	Squared Error
Apple	8.2	7.5	?
BMW	3.1	5.0	?
Nestlé	-1.5	-2.0	?
HSBC	4.0	1.2	?
Tesla	12.0	15.3	?

Your Task

Compute the squared error $(y_i - \hat{y}_i)^2$ for each stock and the Mean Squared Error (MSE) across all five.
Which stock contributes the MOST to the total error? What does this tell the network?
If you could adjust the network's weights by a tiny amount, which prediction would you try to improve first?

Reveal Solution

Squared errors: Apple $(7.5-8.2)^2 = 0.49$, BMW $(5.0-3.1)^2 = 3.61$, Nestlé $(-2.0-(-1.5))^2 = 0.25$, HSBC $(1.2-4.0)^2 = 7.84$, Tesla $(15.3-12.0)^2 = 10.89$. MSE $= \frac{0.49 + 3.61 + 0.25 + 7.84 + 10.89}{5} = 4.616$.

Tesla and HSBC contribute the most error — large individual mistakes are magnified by squaring.
The loss function (here MSE: $L = \frac{1}{n}\sum(y_i - \hat{y}_i)^2$) tells the network HOW wrong it is overall.
Gradient descent adjusts weights: $w \leftarrow w - \alpha \frac{\partial L}{\partial w}$, nudging them to reduce the biggest errors first.

Activity 4

"The Chain Rule Trick"

Layer	Input	Weight	Output
Layer 1	x = 2.0	$w_1$ = 0.5	$h_1$ = 2.0 $\times$ 0.5 = 1.0
Layer 2	$h_1$ = 1.0	$w_2$ = 3.0	$h_2$ = 1.0 $\times$ 3.0 = 3.0
Layer 3	$h_2$ = 3.0	$w_3$ = 0.2	$\hat{y}$ = 3.0 $\times$ 0.2 = 0.6
Target: y = 1.0			Error: $(1.0 - 0.6)^2$ = 0.16

Your Task

The prediction is 0.6 but the target is 1.0. We need to increase the output. Which weight ($w_1$, $w_2$, or $w_3$) would have the BIGGEST effect if increased slightly?
If you increase $w_3$ by 0.1 (from 0.2 to 0.3), what is the new output? How much does the error decrease?
Why can't we just change $w_1$ and ignore the other weights?

Reveal Solution

Increasing $w_3$ to 0.3: new output = 3.0 $\times$ 0.3 = 0.9, error drops from 0.16 to 0.01 — a dramatic improvement!

$w_3$ has the biggest effect only because $h_2$ is large (3.0). The impact of each weight depends on the values flowing through the network.
This is backpropagation: compute the error at the output, then trace it backward through each layer using the chain rule.
Each weight gets a gradient proportional to its contribution to the error — all weights are updated simultaneously.

This is how a network with millions of weights learns: one tiny coordinated nudge at a time.

Activity 5

"Too Good to Be True"

Model	Parameters	Train Accuracy	Test Accuracy	Verdict
Tiny	10	72%	70%	?
Small	100	88%	85%	?
Medium	1,000	95%	91%	?
Large	10,000	99%	82%	?
Huge	100,000	100%	65%	?

Your Task

Fill in the "Verdict" column: which models are underfitting, which are overfitting, and which is the sweet spot?
The "Huge" model gets 100% on training data but only 65% on test data. Explain what happened in plain English.
A bank builds a credit model that scores 99% accuracy on historical data but fails on new applications. What went wrong?

Reveal Solution

Verdicts: Tiny — underfitting (too simple, poor on both). Small/Medium — good generalization. Large — starting to overfit (17% gap). Huge — severe overfitting (35% gap).

The "Huge" model memorized the noise and individual quirks of the training data instead of learning general patterns.
The bank's credit model did the same: it memorized historical applications rather than learning what creditworthiness actually means.
Dropout (randomly disabling neurons during training) and regularization (penalizing large weights) help prevent this.
Early stopping — halting training when test accuracy starts to fall — is the simplest and most effective remedy.

Activity 6

"The Deep Learning Zoo"

Architecture	Superpower	Best For	Finance Use Case
CNN (Convolutional)	Detects spatial patterns	Images, documents	Check signature verification, chart pattern recognition
RNN / LSTM	Remembers sequences	Time series, text	Stock price forecasting, earnings call analysis
Transformer	Pays attention to what matters	NLP, any sequence	FinBERT sentiment, GPT-based analysis, fraud detection
GAN	Generates realistic fakes	Synthetic data	Generating realistic but private transaction data for testing

Your Task

A bank wants to automatically read handwritten checks. Which architecture would you recommend and why?
An investment firm wants to predict tomorrow's stock price from the last 30 days of prices. Which architecture fits? What are the risks?
You've built a fraud detection model but need more training data without violating privacy regulations. Which architecture could help?

Reveal Solution

Each task maps cleanly to one architecture in the zoo:

Check reading → CNN: convolutional networks scan images for local patterns (edges, curves, characters) regardless of where on the page they appear.
Stock prediction → RNN/LSTM or Transformer: both handle sequences, but Transformers capture longer-range dependencies better. Risk: no architecture can reliably predict prices — markets are adversarial.
Synthetic data → GAN: Generative Adversarial Networks create realistic transaction records that preserve statistical properties without exposing real customer data.

The Transformer has become the dominant architecture since 2018 (BERT, GPT) and powers most modern NLP in finance, including FinBERT from L06.