Lecture 3: Multi-Layer Perceptron Architecture

Duration: ~45 minutes

Slides: 32

Prerequisites: Lecture 2

Learning Objectives

After completing this lecture, you should be able to:

Explain why hidden layers are necessary
Describe the architecture of a multi-layer perceptron (MLP)
Perform forward propagation calculations
Understand matrix notation for neural networks
Explain how MLPs solve the XOR problem
Design network architectures for different problems

Key Concepts

1. From Perceptron to Multi-Layer Networks

Recall from Lecture 2: A single perceptron cannot solve XOR because it’s not linearly separable.

The Solution: Stack multiple perceptrons into layers!

By combining simple linear classifiers, we can create complex non-linear decision boundaries.

Click chart to view Python source code

2. MLP Architecture

An MLP consists of:

Layer Type	Description
Input Layer	Receives raw features (not neurons, just data)
Hidden Layer(s)	Intermediate processing layers
Output Layer	Produces final prediction

Notation: A network with 2 inputs, 3 hidden neurons, and 1 output is written as 2-3-1.

Key Properties:

Each neuron in layer L connects to ALL neurons in layer L+1 (fully connected)
Information flows forward only (feedforward network)
No connections within a layer or backward

3. Why “Hidden” Layers?

Hidden layers are “hidden” because:

We don’t directly observe their values during training
They automatically learn useful intermediate representations
They enable non-linear transformations

Intuition: Hidden layers transform the input space into a new representation where the problem becomes linearly separable.

4. Forward Propagation

Forward propagation computes the network output given an input.

Layer-by-layer computation:

Layer 0 (Input): x
Layer 1 (Hidden): h = f(W_1 * x + b_1)
Layer 2 (Output): y = f(W_2 * h + b_2)

Where:

W_1, W_2 = weight matrices
b_1, b_2 = bias vectors
f = activation function (applied element-wise)

5. Matrix Notation

For efficient computation, we use matrix operations.

Single layer forward pass:

z = W * x + b
a = f(z)

Where:

W is a (neurons_out x neurons_in) matrix
x is a (neurons_in x 1) vector
b is a (neurons_out x 1) vector
z is a (neurons_out x 1) vector (pre-activation)
a is a (neurons_out x 1) vector (post-activation)

Example: 2-3-1 Network

Input: x = [x_1, x_2]^T (2x1)

Layer 1:

W_1 is 3x2 (3 hidden neurons, 2 inputs)
b_1 is 3x1
z_1 = W_1 * x + b_1 (3x1)
h = f(z_1) (3x1)

Layer 2:

W_2 is 1x3 (1 output, 3 hidden neurons)
b_2 is 1x1
z_2 = W_2 * h + b_2 (1x1)
y = f(z_2) (1x1)

Click chart to view Python source code

6. Solving XOR with an MLP

Let’s see how a 2-2-1 network solves XOR.

The key insight: The hidden layer transforms the input space!

Architecture:

2 inputs (x_1, x_2)
2 hidden neurons (h_1, h_2)
1 output (y)

What the hidden neurons learn:

h_1 computes something like OR(x_1, x_2)
h_2 computes something like AND(x_1, x_2)

The output neuron then computes:

y = h_1 AND (NOT h_2)
Which equals XOR!

x_1	x_2	h_1 (OR-like)	h_2 (AND-like)	y (XOR)
0	0	0	0	0
0	1	1	0	1
1	0	1	0	1
1	1	1	1	0

Click chart to view Python source code

7. Hidden Layer Representations

The hidden layer creates a new feature space where the problem becomes easier.

Original space (inputs):

XOR points: (0,0), (0,1), (1,0), (1,1)
Not linearly separable

Transformed space (hidden layer):

Same points mapped to new coordinates
NOW linearly separable!

Click chart to view Python source code

Key insight: Deep networks learn hierarchical representations - each layer transforms data into more useful forms.

8. Why Activation Functions Matter

The Linear Collapse Problem:

If all neurons used linear activations (f(z) = z), then:

y = W_2 * (W_1 * x + b_1) + b_2
  = W_2 * W_1 * x + W_2 * b_1 + b_2
  = W' * x + b'

Where W’ = W_2 * W_1 and b’ = W_2 * b_1 + b_2

The result: No matter how many layers, the entire network collapses to a single linear transformation!

Solution: Non-linear activation functions break this collapse and enable learning complex patterns.

9. Designing Network Architecture

How many hidden layers?

1 hidden layer: Can approximate any continuous function (universal approximation theorem)
More layers: Can learn hierarchical features more efficiently
Modern “deep” networks: Often 10-100+ layers

How many neurons per layer?

Too few: Underfitting (can’t capture complexity)
Too many: Overfitting (memorizes training data)
Rule of thumb: Start with neurons between input and output size

10. Historical Context: 1969-1986

After the AI winter (1969), neural network research continued slowly:

Key developments:

1974: Werbos develops backpropagation (PhD thesis, largely ignored)
1982: Hopfield networks revive interest
1986: Rumelhart, Hinton & Williams publish backpropagation

The 1986 breakthrough: “Learning Representations by Back-propagating Errors” showed that MLPs could be trained effectively, ending the AI winter.

Key Formulas

Forward Propagation (General)

For layer l = 1 to L:
    z^[l] = W^[l] * a^[l-1] + b^[l]
    a^[l] = f(z^[l])

Where a^[0] = x (the input).

Weight Matrix Dimensions

W^[l] has shape (n^[l], n^[l-1])

Where n^[l] is the number of neurons in layer l.

Total Parameters

For layer l: n^[l] * n^[l-1] + n^[l] (weights + biases)

Example (2-3-1 network):

Layer 1: 3*2 + 3 = 9 parameters
Layer 2: 1*3 + 1 = 4 parameters
Total: 13 parameters

Finance Application: Multi-Factor Model

A multi-layer network can capture complex relationships between factors:

Traditional linear factor model:

Return = beta_1*Factor_1 + beta_2*Factor_2 + ... + epsilon

MLP factor model:

Can capture non-linear factor relationships
Can model interactions between factors
Can learn regime-dependent factor exposures

Architecture for stock prediction:

Input layer: 10-20 financial features
Hidden layer 1: 64 neurons (learn feature interactions)
Hidden layer 2: 32 neurons (learn higher-order patterns)
Output: 1 neuron (predicted return or probability)

Practice Questions

Mathematical Understanding

Q1: A 3-4-2 network has how many total parameters (weights + biases)?

Answer

Layer 1: 4*3 + 4 = 16 parameters (12 weights, 4 biases) Layer 2: 2*4 + 2 = 10 parameters (8 weights, 2 biases) Total: 26 parameters

Q2: Given a 2-3-1 network with:

W_1 = [[1, 1], [1, 1], [1, 1]] (3x2)
b_1 = [-0.5, -1.5, -0.5] (3x1)
Activation: step function

Compute the hidden layer output for input x = [1, 0].

Answer

z_1 = W_1 * x + b_1 z_1 = [[1, 1], [1, 1], [1, 1]] * [1, 0]^T + [-0.5, -1.5, -0.5]^T z_1 = [1, 1, 1]^T + [-0.5, -1.5, -0.5]^T z_1 = [0.5, -0.5, 0.5]^T Applying step function (>= 0 -> 1): h = [1, 0, 1]^T

Q3: Why is a 2-100-1 network more powerful than a 2-1 network for solving complex problems?

Answer

The 2-100-1 network has a hidden layer with 100 neurons, enabling it to: 1. Transform the input space through 100 different linear transformations 2. Create complex non-linear decision boundaries 3. Approximate any continuous function (universal approximation) The 2-1 network (single perceptron) can only create linear decision boundaries.

Conceptual Understanding

Q4: In your own words, explain why stacking linear layers still results in a linear transformation.

Answer

When we compose linear functions, the result is still linear. If f(x) = Ax + b and g(x) = Cx + d, then g(f(x)) = C(Ax + b) + d = CAx + Cb + d = Wx + v, which is another linear function. Mathematically, the composition of linear transformations is still a linear transformation. This is why non-linear activation functions are essential - they break this linearity and allow the network to learn complex patterns.

Q5: How does the hidden layer enable solving XOR?

Answer

The hidden layer transforms the 2D input space into a new representation. In this new space, points that were not linearly separable (the XOR pattern) become linearly separable. Each hidden neuron learns a different linear boundary, and the output neuron combines these to create a non-linear decision boundary in the original space.

Q6: What’s the trade-off between network width (neurons per layer) and depth (number of layers)?

Answer

Width: - More neurons = more capacity to model complex functions - But: More parameters, risk of overfitting, slower training Depth: - More layers = ability to learn hierarchical features - Each layer can build on representations from previous layers - But: Harder to train (vanishing gradients), more parameters Modern practice often favors deeper networks with moderate width, using techniques like batch normalization and skip connections to enable training.

Application

Q7: You’re designing an MLP for fraud detection with 50 input features. The output is binary (fraud/not fraud). Suggest an architecture and explain your choices.

Answer

Suggested architecture: 50-32-16-1 Reasoning: - Input: 50 features - Hidden 1 (32 neurons): Compress features, learn initial patterns - Hidden 2 (16 neurons): Learn higher-order feature combinations - Output (1 neuron): Binary classification with sigmoid The "funnel" shape (decreasing neurons) is common for classification - it progressively compresses information toward the decision. Starting with fewer neurons than inputs forces the network to learn efficient representations. Alternative: 50-64-32-1 if you believe the features need expansion before compression.

Reading List

Essential Reading

Rumelhart, Hinton & Williams (1986) - “Learning Representations by Back-propagating Errors” - The breakthrough paper
Nielsen, Chapter 1-2 - Neural Networks and Deep Learning (online)

Theoretical Foundation

Cybenko (1989) - “Approximation by Superpositions of a Sigmoidal Function” - Universal approximation theorem
Hornik (1991) - “Approximation Capabilities of Multilayer Feedforward Networks”

Modern Perspectives

Goodfellow et al., Chapter 6 - Deep Learning textbook - Feedforward networks

Video Resources

3Blue1Brown - “But what is a neural network?” (YouTube)
Stanford CS231n - Lecture 4: Neural Networks

Summary

This lecture covered:

Why hidden layers - Overcome linear limitations of single perceptrons
MLP architecture - Input, hidden, and output layers
Forward propagation - Computing output from input layer by layer
Matrix notation - Efficient representation for computation
Solving XOR - Hidden layers transform space to enable linear separation
Architecture design - Guidelines for choosing layers and neurons

Key Takeaway: Multi-layer networks can learn complex, non-linear patterns by transforming data through successive layers.

Next Lecture: Activation and Loss Functions - We’ll explore the activation functions that enable non-linearity and the loss functions that guide learning.

Previous: Perceptron Home Next: Activations