Lecture 4: Activation and Loss Functions

Duration: ~45 minutes

Slides: 23

Prerequisites: Lecture 3

Learning Objectives

After completing this lecture, you should be able to:

Explain why activation functions are necessary
Compare sigmoid, tanh, and ReLU activation functions
Choose appropriate activation functions for different layers
Understand the purpose of loss functions
Apply MSE and cross-entropy loss to appropriate problems
Connect loss functions to the optimization process

Key Concepts

1. Why Activation Functions?

Recall from Lecture 3: Without non-linear activation functions, any multi-layer network collapses to a single linear transformation.

Activation functions provide:

Non-linearity (essential for learning complex patterns)
Bounded output (for some functions)
Differentiability (needed for gradient-based learning)

2. The Sigmoid Function

The sigmoid (logistic) function was historically the most popular activation.

Formula:

sigmoid(z) = 1 / (1 + e^(-z))

Properties:

Output range: (0, 1)
Smooth, differentiable everywhere
Centered at 0.5

Derivative:

sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z))

Advantages:

Outputs interpretable as probabilities
Smooth gradient

Disadvantages:

Vanishing gradient for large z
Not zero-centered
Computationally expensive (exponentials)

Click chart to view Python source code

3. The Tanh Function

Tanh is a scaled and shifted sigmoid.

Formula:

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Or equivalently:

tanh(z) = 2 * sigmoid(2z) - 1

Properties:

Output range: (-1, 1)
Zero-centered
Smooth, differentiable everywhere

Derivative:

tanh'(z) = 1 - tanh^2(z)

Advantages:

Zero-centered (better gradient flow)
Stronger gradients than sigmoid

Disadvantages:

Still suffers from vanishing gradients
Computationally expensive

Click chart to view Python source code

4. The ReLU Function

ReLU (Rectified Linear Unit) is the most popular modern activation.

Formula:

ReLU(z) = max(0, z)

Properties:

Output range: [0, infinity)
Not bounded above
Not differentiable at z = 0 (use subgradient)

Derivative:

ReLU'(z) = { 1 if z > 0
           { 0 if z < 0
           { undefined at z = 0 (typically use 0 or 1)

Advantages:

No vanishing gradient for positive values
Computationally efficient (no exponentials)
Sparse activation (many zeros)

Disadvantages:

“Dead ReLU” problem: neurons can get stuck at 0
Not zero-centered
Unbounded (can cause exploding values)

Click chart to view Python source code

5. Activation Function Comparison

Function	Formula	Range	Pros	Cons
Sigmoid	1/(1+e^(-z))	(0,1)	Probabilistic interpretation	Vanishing gradient
Tanh	(e^z-e^(-z))/(e^z+e^(-z))	(-1,1)	Zero-centered	Vanishing gradient
ReLU	max(0,z)	[0,inf)	Fast, no vanishing gradient	Dead neurons

Click chart to view Python source code

Modern best practice:

Hidden layers: ReLU (or variants like Leaky ReLU, ELU)
Output layer: Depends on task (see below)

6. Choosing Output Activation

The output layer activation depends on the problem type:

Problem Type	Output Activation	Output Range
Binary classification	Sigmoid	(0, 1) - probability
Multi-class classification	Softmax	(0, 1) per class, sum to 1
Regression	None (linear)	(-inf, inf)
Bounded regression	Sigmoid or tanh	Scaled to target range

7. The Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik, 1991): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n, given appropriate activation functions and sufficient neurons.

Implications:

Neural networks are theoretically capable of learning any pattern
BUT: The theorem doesn’t tell us how many neurons are needed
AND: It doesn’t tell us how to find the right weights

Practical reality:

Deeper networks often work better than very wide shallow ones
Finding good weights requires proper training algorithms

8. Loss Functions: Measuring Mistakes

A loss function quantifies how wrong the network’s predictions are.

Purpose:

Provides a single number to minimize
Guides the optimization process
Different losses for different problems

Notation:

y_true (or y): The correct answer
y_pred (or y-hat): The network’s prediction
L: The loss value

9. Mean Squared Error (MSE)

MSE is the standard loss for regression problems.

Formula:

MSE = (1/n) * sum((y_true - y_pred)^2)

For a single example:

L = (y_true - y_pred)^2

Properties:

Always non-negative
Zero only when predictions are perfect
Heavily penalizes large errors (quadratic)

Derivative:

dL/dy_pred = -2 * (y_true - y_pred)

Use for:

Stock price prediction
Portfolio return forecasting
Any continuous target variable

Click chart to view Python source code

10. Binary Cross-Entropy Loss

Cross-entropy is the standard loss for classification problems.

Formula (for binary classification):

BCE = -[y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)]

Intuition:

When y_true = 1: Loss = -log(y_pred) -> want y_pred close to 1
When y_true = 0: Loss = -log(1 - y_pred) -> want y_pred close to 0

Properties:

Always non-negative
Penalizes confident wrong predictions heavily
Works well with sigmoid output

Derivative:

dL/dy_pred = -y_true/y_pred + (1-y_true)/(1-y_pred)

Use for:

Buy/sell classification
Fraud detection
Any binary outcome

Click chart to view Python source code

11. Choosing the Right Loss Function

Problem	Loss Function	Output Activation
Regression	MSE	Linear
Binary classification	Binary cross-entropy	Sigmoid
Multi-class (one-hot)	Categorical cross-entropy	Softmax
Multi-label	Binary cross-entropy per label	Sigmoid per output

Finance examples:

Predicting next-day return: MSE + Linear output
Predicting up/down: Cross-entropy + Sigmoid output
Predicting sector (11 classes): Categorical cross-entropy + Softmax output

Key Formulas

Activation Functions

Function	Formula	Derivative
Sigmoid	sigma(z) = 1/(1+e^(-z))	sigma(z)(1-sigma(z))
Tanh	tanh(z) = (e^z-e^(-z))/(e^z+e^(-z))	1-tanh^2(z)
ReLU	max(0, z)	1 if z>0, 0 otherwise

Loss Functions

Mean Squared Error:

L_MSE = (1/n) * sum_{i=1}^{n} (y_i - y_hat_i)^2

Binary Cross-Entropy:

L_BCE = -(1/n) * sum_{i=1}^{n} [y_i*log(y_hat_i) + (1-y_i)*log(1-y_hat_i)]

Finance Application: Output Design

When building financial prediction models, output design matters:

Stock Direction Prediction:

Output: Sigmoid (probability of going up)
Loss: Binary cross-entropy
Interpretation: P(up) = 0.7 means 70% confidence in upward movement

Return Prediction:

Output: Linear (unbounded)
Loss: MSE
Interpretation: Predicted return of 0.02 means expected 2% return

Risk Level Classification (Low/Medium/High):

Output: Softmax (3 neurons)
Loss: Categorical cross-entropy
Interpretation: [0.1, 0.3, 0.6] means 60% probability of high risk

Practice Questions

Mathematical Understanding

Q1: Calculate sigmoid(0), sigmoid(2), and sigmoid(-2).

Answer

sigmoid(0) = 1/(1+e^0) = 1/(1+1) = 0.5 sigmoid(2) = 1/(1+e^(-2)) = 1/(1+0.135) = 0.88 sigmoid(-2) = 1/(1+e^2) = 1/(1+7.39) = 0.12

Q2: Given y_true = 1 and y_pred = 0.9, calculate the binary cross-entropy loss.

Answer

BCE = -[y_true * log(y_pred) + (1-y_true) * log(1-y_pred)] BCE = -[1 * log(0.9) + 0 * log(0.1)] BCE = -log(0.9) BCE = -(-0.105) = 0.105 This is a small loss because the prediction (0.9) is close to the true value (1).

Q3: For the same prediction (y_pred = 0.9), what would the BCE loss be if y_true = 0?

Answer

BCE = -[0 * log(0.9) + 1 * log(0.1)] BCE = -log(0.1) BCE = -(-2.303) = 2.303 This is a much higher loss! The model confidently predicted 1 (0.9 probability) but the true answer was 0 - a confident wrong prediction is heavily penalized.

Conceptual Understanding

Q4: Why does ReLU help with the vanishing gradient problem?

Answer

For positive inputs, ReLU has a derivative of exactly 1. This means gradients pass through unchanged, avoiding the "shrinking" that happens with sigmoid/tanh where derivatives are always < 1. With sigmoid: If derivative is 0.25 and you have 4 layers, the gradient becomes 0.25^4 = 0.004 (tiny!) With ReLU: If derivative is 1 for positive inputs, gradient stays at 1^4 = 1 (preserved!)

Q5: What is the “dead ReLU” problem and how might it be addressed?

Answer

Dead ReLU: If a neuron's weighted input becomes consistently negative, its output is always 0, and its gradient is always 0. The neuron never learns and is effectively "dead." Solutions: 1. Leaky ReLU: f(z) = max(0.01z, z) - small gradient for negative values 2. ELU: f(z) = z if z>0, else alpha*(e^z - 1) - smooth transition 3. Proper weight initialization (He initialization) 4. Lower learning rates

Q6: Why is cross-entropy preferred over MSE for classification?

Answer

1. Cross-entropy penalizes confident wrong predictions more heavily 2. The gradient of cross-entropy with sigmoid is simpler: (y_pred - y_true) 3. MSE + sigmoid has very small gradients when predictions are wrong but confident 4. Cross-entropy is derived from maximum likelihood estimation for Bernoulli distributions, making it theoretically appropriate for classification

Application

Q7: You’re building a model to predict stock volatility (always positive). What output activation and loss would you use?

Answer

Since volatility is always positive and continuous: Option 1 (Simple): - Output activation: ReLU or Softplus (ensures positive output) - Loss: MSE Option 2 (If predicting log-volatility): - Output activation: Linear - Loss: MSE - Then exponentiate to get actual volatility Option 3 (If volatility is bounded, e.g., 0-100%): - Output activation: Sigmoid scaled to range - Loss: MSE

Reading List

Essential Reading

Nielsen, Chapter 3 - “Improving the way neural networks learn” (online)
Goodfellow et al., Chapter 6 - Deep Learning - “Deep Feedforward Networks”

Activation Functions

Nair & Hinton (2010) - “Rectified Linear Units Improve Restricted Boltzmann Machines”
Glorot et al. (2011) - “Deep Sparse Rectifier Neural Networks”

Theoretical Foundation

Cybenko (1989) - “Approximation by Superpositions of a Sigmoidal Function”
Hornik (1991) - “Approximation Capabilities of Multilayer Feedforward Networks”

Finance Applications

Heaton et al. (2016) - “Deep Learning for Finance” - Discusses output design for financial problems

Summary

This lecture covered:

Why activation functions - Enable non-linearity in neural networks
Sigmoid - Smooth, bounded (0,1), but vanishing gradients
Tanh - Zero-centered, but still vanishing gradients
ReLU - Fast, no vanishing gradient, but dead neurons possible
Loss functions - MSE for regression, cross-entropy for classification
Output design - Match activation and loss to the problem type

Key Takeaway: The combination of activation function and loss function determines how well your network can learn and what problems it can solve.

Next Lecture: Gradient Descent and Backpropagation - We’ll learn how neural networks actually find good weights.

Previous: MLP Home Next: Backprop