Lecture 2: Perceptron Fundamentals
| Duration: ~45 minutes | Slides: 32 | Prerequisites: Lecture 1 |
Learning Objectives
After completing this lecture, you should be able to:
- Define the mathematical structure of a perceptron
- Explain the role of weights, bias, and activation functions
- Visualize decision boundaries in 2D
- Implement the perceptron learning algorithm
- Identify the XOR problem and understand why it matters
- Apply perceptrons to simple classification tasks
Key Concepts
1. The Perceptron Architecture
A perceptron is the simplest possible neural network: a single artificial neuron.
Components:
| Component | Symbol | Description |
|---|---|---|
| Inputs | x_1, x_2, …, x_n | Feature values |
| Weights | w_1, w_2, …, w_n | Learned parameters |
| Bias | b | Threshold adjustment |
| Net Input | z | Weighted sum + bias |
| Activation | f(z) | Decision function |
| Output | y | Final prediction |
2. Mathematical Formulation
Step 1: Compute the Net Input (z)
The perceptron first computes a weighted sum of inputs:
z = w_1*x_1 + w_2*x_2 + ... + w_n*x_n + b
In vector notation:
z = w^T * x + b
Where:
- w = weight vector [w_1, w_2, …, w_n]
- x = input vector [x_1, x_2, …, x_n]
- b = bias (scalar)
Step 2: Apply the Activation Function
The step function converts z to a binary output:
y = f(z) = { 1 if z >= 0
{ 0 if z < 0
Alternative threshold notation:
y = { 1 if z >= threshold
{ 0 if z < threshold
The bias b effectively shifts the threshold: z >= 0 is equivalent to w^T*x >= -b.
3. Weights: The Importance of Features
Weights determine how much each input influences the output.
| Weight Value | Interpretation |
|---|---|
| Large positive | Feature strongly supports class 1 |
| Large negative | Feature strongly supports class 0 |
| Near zero | Feature has little influence |
Finance Example: Stock Screener
| Feature | Weight | Interpretation |
|---|---|---|
| P/E Ratio | -0.5 | Higher P/E slightly decreases buy signal |
| Momentum | +2.0 | Strong positive momentum strongly increases buy signal |
| Volume | +0.3 | Higher volume slightly supports buying |
| Debt/Equity | -1.5 | High debt moderately decreases buy signal |
The perceptron learns these weights from training data!
4. The Bias Term
The bias b controls the decision threshold independently of input values.
Intuition: The bias determines how “easily” the neuron fires.
- Positive bias: Lower bar to fire (more likely to output 1)
- Negative bias: Higher bar to fire (more likely to output 0)
Geometric interpretation: The bias shifts the decision boundary away from the origin.
5. Decision Boundaries
A perceptron creates a linear decision boundary - a hyperplane that separates two classes.
In 2D (two features), the decision boundary is a line defined by:
w_1*x_1 + w_2*x_2 + b = 0
Solving for x_2:
x_2 = -(w_1/w_2)*x_1 - (b/w_2)
This is a line with:
- Slope = -w_1/w_2
- Intercept = -b/w_2
Key Insight: Everything on one side of the line is classified as 1, everything on the other side as 0.
6. The Perceptron Learning Algorithm
The perceptron learns by adjusting weights when it makes mistakes.
Algorithm:
1. Initialize weights w and bias b (often to zeros or small random values)
2. For each training example (x, y_true):
a. Compute prediction: y_pred = f(w^T * x + b)
b. Compute error: error = y_true - y_pred
c. Update weights: w = w + learning_rate * error * x
d. Update bias: b = b + learning_rate * error
3. Repeat until convergence or max iterations
Update Rule Explained:
| y_true | y_pred | error | Action |
|---|---|---|---|
| 1 | 1 | 0 | No update (correct) |
| 0 | 0 | 0 | No update (correct) |
| 1 | 0 | +1 | Increase weights toward x |
| 0 | 1 | -1 | Decrease weights away from x |
Learning Rate (eta): Controls the step size of updates
- Too large: Overshoots, unstable learning
- Too small: Very slow convergence
- Typical values: 0.01 to 0.1
7. Perceptron Convergence Theorem
Theorem: If the training data is linearly separable, the perceptron learning algorithm will converge to a solution in a finite number of steps.
Implications:
- Guaranteed to find a separating hyperplane (if one exists)
- No guarantee on number of iterations
- No guarantee of finding the “best” hyperplane
Limitation: If data is NOT linearly separable, the algorithm will never converge.
8. The XOR Problem
The XOR (exclusive or) function demonstrates the fundamental limitation of single perceptrons.
XOR Truth Table:
| x_1 | x_2 | XOR(x_1, x_2) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Why can’t a perceptron solve XOR?
Plot the points:
- (0,0) -> 0 (class 0)
- (0,1) -> 1 (class 1)
- (1,0) -> 1 (class 1)
- (1,1) -> 0 (class 0)
No single straight line can separate the 0s from the 1s!
Solution preview: Multi-layer perceptrons (MLPs) can solve XOR by combining multiple perceptrons. This is covered in Lecture 3.
9. Linear Separability
Definition: A dataset is linearly separable if there exists a hyperplane that perfectly separates the two classes.
Examples of linearly separable problems:
- AND gate
- OR gate
- Simple threshold decisions
Examples of non-linearly separable problems:
- XOR gate
- Circular class boundaries
- Many real-world classification problems
Finance Connection: Many financial classification problems (e.g., “will this stock outperform?”) are NOT linearly separable, which is why we need more powerful models.
Key Formulas
Perceptron Output
y = f(sum_{i=1}^{n} w_i * x_i + b)
Step Activation Function
f(z) = { 1 if z >= 0
{ 0 otherwise
Weight Update Rule
w_i(new) = w_i(old) + eta * (y_true - y_pred) * x_i
Bias Update Rule
b(new) = b(old) + eta * (y_true - y_pred)
Decision Boundary (2D)
w_1*x_1 + w_2*x_2 + b = 0
Finance Application: Binary Stock Classifier
Problem: Given financial metrics, classify stocks as “Buy” or “Pass”
Features (inputs):
- x_1 = Normalized P/E ratio
- x_2 = 6-month momentum (%)
- x_3 = Normalized trading volume
- x_4 = Debt-to-equity ratio
- x_5 = Earnings surprise (%)
Output:
- y = 1: Buy
- y = 0: Pass
Training data: Historical stocks with known outcomes (did they outperform the benchmark?)
Learned interpretation: After training, examine the weights:
- Large positive w_2 (momentum): Momentum is a strong buy signal
- Negative w_1 (P/E): High P/E stocks are less attractive
- etc.
Practice Questions
Mathematical Understanding
Q1: Given weights w = [2, -1] and bias b = -1, what is the output for input x = [1, 0]?
Answer
z = w^T * x + b = (2)(1) + (-1)(0) + (-1) = 2 + 0 - 1 = 1 Since z = 1 >= 0, the output y = 1.Q2: For the same perceptron (w = [2, -1], b = -1), what is the equation of the decision boundary?
Answer
The decision boundary is where z = 0: 2*x_1 + (-1)*x_2 + (-1) = 0 2*x_1 - x_2 - 1 = 0 x_2 = 2*x_1 - 1 This is a line with slope 2 and y-intercept -1.Q3: A perceptron makes an error: y_true = 1 but y_pred = 0. The input is x = [3, 2] and learning rate is 0.1. If the current weights are w = [0.5, 0.5] and b = 0, what are the new weights and bias?
Answer
error = y_true - y_pred = 1 - 0 = 1 w_new = w_old + eta * error * x w_new = [0.5, 0.5] + 0.1 * 1 * [3, 2] w_new = [0.5 + 0.3, 0.5 + 0.2] w_new = [0.8, 0.7] b_new = b_old + eta * error b_new = 0 + 0.1 * 1 = 0.1Conceptual Understanding
Q4: Why is the bias term necessary? What happens if we remove it?
Answer
Without bias, the decision boundary must pass through the origin (0, 0). This severely limits which problems can be solved. With bias, we can shift the decision boundary anywhere in the feature space, making the perceptron much more flexible.Q5: Can a perceptron learn the AND function? What about the OR function?
Answer
Yes to both! AND and OR are linearly separable. For AND (both inputs must be 1): - Weights w = [1, 1], bias b = -1.5 - z = x_1 + x_2 - 1.5 - Only (1,1) gives z = 0.5 > 0 For OR (at least one input is 1): - Weights w = [1, 1], bias b = -0.5 - z = x_1 + x_2 - 0.5 - (1,0), (0,1), and (1,1) all give z > 0Q6: Why is XOR important in the history of neural networks?
Answer
XOR demonstrated a fundamental limitation of single-layer perceptrons. Minsky and Papert proved mathematically that no single perceptron can solve XOR. This contributed to the first AI winter by showing that perceptrons couldn't solve many interesting problems. However, this limitation was later overcome with multi-layer networks.Application
Q7: You’re building a perceptron for credit approval with features: income (x_1), credit score (x_2), existing loans (x_3). After training, the weights are w = [0.8, 1.2, -0.5]. Interpret these weights.
Answer
- w_1 = 0.8 (income): Higher income moderately increases approval probability - w_2 = 1.2 (credit score): Higher credit score strongly increases approval probability (most important feature) - w_3 = -0.5 (existing loans): More existing loans moderately decreases approval probability Credit score has the largest absolute weight, making it the most influential feature in the decision.Reading List
Essential Reading
- Rosenblatt (1958) - “The Perceptron: A Probabilistic Model” - The original paper
- Nielsen, Chapter 1 - Neural Networks and Deep Learning (online)
Mathematical Deep Dive
- Novikoff (1962) - “On Convergence Proofs on Perceptrons” - The convergence theorem proof
- Goodfellow et al., Chapter 6 - Deep Learning textbook
Historical Context
- Minsky & Papert (1969) - “Perceptrons” - The famous critique
Video Resources
- 3Blue1Brown - “Gradient descent, how neural networks learn” (YouTube)
Summary
This lecture covered:
- Perceptron architecture - Inputs, weights, bias, activation, output
- Mathematical formulation - z = w^T * x + b, y = f(z)
- Decision boundaries - Linear hyperplanes separating classes
- Learning algorithm - Adjust weights based on errors
- Convergence theorem - Guaranteed convergence for linearly separable data
- XOR problem - The fundamental limitation that led to multi-layer networks
Key Takeaway: A single perceptron is a powerful but limited classifier. It can only solve linearly separable problems.
Next Lecture: MLP Architecture - We’ll see how stacking perceptrons into layers overcomes the XOR limitation.