Lecture 5: Gradient Descent and Backpropagation

Duration: ~45 minutes

Slides: 38

Prerequisites: Lecture 4

Learning Objectives

After completing this lecture, you should be able to:

Explain gradient descent as an optimization algorithm
Visualize loss landscapes and local minima
Understand the role of learning rate in training
Apply the chain rule to compute gradients
Trace backpropagation through a simple network
Explain how credit assignment works in neural networks

Key Concepts

1. The Optimization Problem

Training a neural network means finding weights that minimize the loss function.

Formal statement:

Find W* = argmin_W L(W)

Where:

W = all weights and biases in the network
L = loss function
W* = optimal weights

The challenge: Neural networks can have millions of parameters. We can’t try all possible combinations!

2. Loss Landscapes

The loss function creates a “landscape” over the weight space.

Visualization (2D slice):

x-axis, y-axis: Two weights
z-axis (height): Loss value
Goal: Find the lowest point

Key features:

Global minimum: The absolute lowest point (what we want)
Local minima: Low points that aren’t the lowest
Saddle points: Points that are minima in some directions, maxima in others
Plateaus: Flat regions where gradients are tiny

Click chart to view Python source code

3. Gradient Descent: The Core Idea

Intuition: Imagine standing on a hill in fog. To go down, you feel the slope under your feet and step in the direction that goes down most steeply.

Algorithm:

Start at a random position (initialize weights)
Compute the gradient (direction of steepest increase)
Take a step in the opposite direction (gradient descent, not ascent!)
Repeat until convergence

Update rule:

W_new = W_old - learning_rate * gradient

Or more precisely:

w_i = w_i - eta * (dL/dw_i)

Where:

eta (learning rate): How big each step is
dL/dw_i: Partial derivative of loss with respect to weight w_i

Click chart to view Python source code

4. The Learning Rate

The learning rate controls step size and is crucial for successful training.

Too large:

Overshoots the minimum
Can diverge (loss increases!)
Unstable training

Too small:

Very slow convergence
May get stuck in local minima
Wastes computational resources

Just right:

Steady progress toward minimum
Smooth decrease in loss

Typical values: 0.001 to 0.1 (often requires experimentation)

Click chart to view Python source code

5. The Chain Rule (Calculus Review)

Backpropagation relies on the chain rule for computing derivatives of composite functions.

The rule:

If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)

Example:

y = (3x + 2)^2

Let u = 3x + 2, so y = u^2

dy/dx = (dy/du) * (du/dx)
      = 2u * 3
      = 6(3x + 2)

For neural networks: The loss depends on the output, which depends on hidden layers, which depend on weights. We apply the chain rule repeatedly!

6. Backpropagation: The Algorithm

Backpropagation efficiently computes all gradients by working backward through the network.

Key insight: Many gradients share common sub-computations. By working backward, we compute each term only once.

The process:

Forward pass: Compute and store all intermediate values (z, a for each layer)
Compute output error: delta_L = dL/da_L * f’(z_L)
Backpropagate error: For each layer l from L-1 to 1:
```
delta_l = (W_{l+1}^T * delta_{l+1}) * f'(z_l)
```

Compute gradients:

dL/dW_l = delta_l * a_{l-1}^T
dL/db_l = delta_l

Click chart to view Python source code

7. Credit Assignment

The problem: When the network makes a mistake, which weights are responsible?

The solution: Backpropagation assigns “credit” (or blame) to each weight based on how much it contributed to the error.

Intuition:

Weights on connections that carried large signals get more blame
Weights connected to neurons with large gradients get more blame
The further a weight is from the output, the more its credit is diluted

Click chart to view Python source code

8. A Worked Example

Consider a simple 2-1-1 network:

Input: x = [1, 2]
Hidden: 1 neuron with sigmoid
Output: 1 neuron with sigmoid
Target: y = 1

Weights:

W1 = [0.5, 0.5] (input to hidden)
b1 = 0
W2 = 0.5 (hidden to output)
b2 = 0

Forward pass:

z1 = 0.5*1 + 0.5*2 + 0 = 1.5
a1 = sigmoid(1.5) = 0.82
z2 = 0.5*0.82 + 0 = 0.41
a2 = sigmoid(0.41) = 0.60

Loss (MSE):

L = (1 - 0.60)^2 = 0.16

Backward pass:

dL/da2 = -2*(1 - 0.60) = -0.80
da2/dz2 = sigmoid(0.41)*(1-sigmoid(0.41)) = 0.24
delta2 = -0.80 * 0.24 = -0.19

dL/dW2 = delta2 * a1 = -0.19 * 0.82 = -0.16
dL/db2 = delta2 = -0.19

delta1 = (W2 * delta2) * sigmoid'(z1)
       = (0.5 * -0.19) * (0.82 * 0.18)
       = -0.095 * 0.15 = -0.014

dL/dW1 = delta1 * x^T = -0.014 * [1, 2] = [-0.014, -0.028]

Update (learning rate = 0.5):

W2_new = 0.5 - 0.5*(-0.16) = 0.58
W1_new = [0.5, 0.5] - 0.5*[-0.014, -0.028] = [0.507, 0.514]

9. Historical Context: 1986-2012

1986: Rumelhart, Hinton & Williams publish “Learning Representations by Back-propagating Errors” - making backpropagation widely known.

Key insight: Backpropagation made training multi-layer networks practical.

Challenges remained:

Vanishing gradients in deep networks
Local minima concerns
Limited computing power

2012: AlexNet wins ImageNet competition, sparking the deep learning revolution.

10. The Vanishing Gradient Problem

When using sigmoid/tanh activations in deep networks, gradients can become extremely small.

Why it happens:

Sigmoid derivative max = 0.25 (at z=0)
Each layer multiplies gradients by activation derivatives
With 10 layers: 0.25^10 = 0.000001

Result: Early layers learn extremely slowly or not at all.

Solutions:

Use ReLU activations (derivative = 1 for positive inputs)
Better weight initialization
Skip connections (ResNets)
Batch normalization

Key Formulas

Gradient Descent Update

w = w - eta * dL/dw

Chain Rule

dL/dw = dL/da * da/dz * dz/dw

Backpropagation (Output Layer)

delta_L = (a_L - y) * f'(z_L)

For sigmoid output with MSE loss.

Backpropagation (Hidden Layers)

delta_l = (W_{l+1}^T * delta_{l+1}) * f'(z_l)

Weight Gradient

dL/dW_l = delta_l * a_{l-1}^T

Finance Application: Optimizing Trading Strategies

Gradient descent in finance is like optimizing a trading strategy:

Analogy:

Weights = Strategy parameters (entry threshold, position size, etc.)
Loss function = Negative Sharpe ratio (we want to maximize Sharpe)
Gradient = How to adjust parameters to improve performance
Learning rate = How aggressively to change the strategy

Challenges specific to finance:

Non-stationary data (market regimes change)
Limited data (can’t generate more historical data)
Transaction costs not differentiable
Look-ahead bias danger

Practice Questions

Mathematical Understanding

Q1: If the loss with respect to the output is dL/da = -2 and the sigmoid derivative at the output is f’(z) = 0.2, what is the output layer delta?

Answer

delta = dL/da * f'(z) delta = -2 * 0.2 = -0.4

Q2: A weight is currently w = 1.5. The gradient dL/dw = 0.3. With learning rate eta = 0.1, what is the new weight?

Answer

w_new = w_old - eta * dL/dw w_new = 1.5 - 0.1 * 0.3 w_new = 1.5 - 0.03 = 1.47

Q3: Why do we subtract the gradient (gradient descent) rather than add it?

Answer

The gradient points in the direction of steepest increase of the loss function. We want to decrease the loss, so we move in the opposite direction. Subtracting the gradient moves us "downhill" in the loss landscape.

Conceptual Understanding

Q4: Why is backpropagation more efficient than computing each gradient separately?

Answer

Backpropagation reuses intermediate computations. When computing dL/dw for weights in early layers, we need terms like dL/da_L, dL/da_{L-1}, etc. By computing these once and propagating backward, we avoid redundant calculations. If we computed each gradient from scratch, we'd repeat the same chain rule calculations many times. Backpropagation computes all gradients in O(n) time, where n is the number of parameters.

Q5: What happens if we set the learning rate to 0?

Answer

With learning rate = 0, the update rule becomes: w_new = w_old - 0 * gradient = w_old The weights never change! The network never learns. This is why learning rate is crucial - it must be positive for any learning to occur.

Q6: Explain the vanishing gradient problem in terms of the chain rule.

Answer

By the chain rule, the gradient for an early layer involves multiplying many terms: dL/dw_1 = (dL/da_L) * (da_L/dz_L) * (dz_L/da_{L-1}) * ... * (da_2/dz_2) * (dz_2/da_1) * (da_1/dz_1) * (dz_1/dw_1) Each (da/dz) term is the activation derivative. For sigmoid, max is 0.25. Multiplying many numbers < 1 together gives a tiny result. This is why deep networks with sigmoid had gradients that "vanished" to nearly zero for early layers.

Application

Q7: You’re training a network and notice the loss is oscillating wildly without decreasing. What’s likely wrong and how would you fix it?

Answer

The learning rate is likely too high, causing the optimizer to overshoot the minimum and bounce around. Fixes: 1. Reduce the learning rate (e.g., divide by 10) 2. Use learning rate scheduling (start high, decrease over time) 3. Use adaptive learning rate methods (Adam, RMSprop) 4. Check for data preprocessing issues (features not normalized)

Q8: Why is it important to store intermediate values (z, a) during the forward pass?

Answer

Backpropagation needs these values to compute gradients: - Activation values (a) are needed to compute dL/dW = delta * a^T - Pre-activation values (z) are needed to compute activation derivatives f'(z) Without storing these, we'd need to recompute them during backpropagation, doubling the computational cost. This is a classic space-time tradeoff.

Reading List

Essential Reading

Rumelhart, Hinton & Williams (1986) - “Learning Representations by Back-propagating Errors” - The seminal paper
Nielsen, Chapter 2 - “How the backpropagation algorithm works” (online)

Mathematical Foundation

Goodfellow et al., Chapter 6.5 - Deep Learning - “Back-Propagation and Other Differentiation Algorithms”

Practical Insights

Karpathy (2019) - “A Recipe for Training Neural Networks” (blog)

Video Resources

3Blue1Brown - “What is backpropagation really doing?” (YouTube)
Stanford CS231n - Lecture 4: Backpropagation

Historical Perspective

Schmidhuber (2015) - “Deep Learning in Neural Networks: An Overview”

Summary

This lecture covered:

Optimization problem - Finding weights that minimize loss
Loss landscapes - Visualizing the space we’re searching
Gradient descent - Walking downhill by following gradients
Learning rate - Critical hyperparameter controlling step size
Chain rule - Mathematical foundation for computing gradients
Backpropagation - Efficient algorithm for gradient computation
Credit assignment - How networks attribute error to weights

Key Takeaway: Backpropagation + gradient descent is the engine that makes neural network learning possible. Understanding these algorithms is essential for debugging and improving networks.

Next Lecture: Training Dynamics and Regularization - We’ll learn about practical training considerations and how to prevent overfitting.

Previous: Activations Home Next: Training