Pre-Class Discovery: Random Forests

Activity 1

"20 Questions for Fraud"

Amount	Merchant	Time	Foreign	Outcome
$15	grocery	10am	No	Legit
$800	electronics	2am	Yes	Fraud
$25	cafe	9am	No	Legit
$2500	jewelry	3am	Yes	Fraud
$10	grocery	11am	No	Legit
$950	electronics	1pm	No	Legit
$5	cafe	8am	No	Legit
$3200	online	4am	Yes	Fraud

Your Task

What is the single best yes/no question to separate fraud from legit?
After that split, ask a second question for each group.
Draw your question tree on paper.

Reveal Solution

You just built a decision tree! Good first questions: "Is it foreign?" or "Is the amount > $500?" or "Is the time between midnight and 5am?" Decision trees split data recursively using the question that best separates the classes.

Activity 2

"Which Split Is Better?"

Comparison of two possible splits showing class purity distributions

Your Task

Which split creates purer groups?
In the left split, what fraction of each group is fraud?
Why is a 90/10 split better than a 60/40 split?

Reveal Solution

Gini impurity measures how mixed a group is: $G = 1 - \sum p_i^2$. A pure group (all one class) has $G=0$. A 50/50 mix has $G=0.5$ (worst). The algorithm picks the split that reduces Gini the most -- this is information gain.

Activity 3

"One Tree vs. Many Trees"

Decision tree visualization showing splits and leaf predictions

Your Task

If you trained this tree on slightly different data, would the splits change?
Would the predictions change?
How could you make the predictions more stable?

Reveal Solution

A single tree is unstable -- small data changes cause big tree changes (high variance). Solution: train many trees on slightly different data and let them vote. This is the key insight behind Random Forests: a committee of diverse trees is more reliable than any single tree.

Activity 4

"Voting Committee"

Ensemble voting diagram showing multiple classifiers combining predictions

Your Task

If 7 out of 10 classifiers say "fraud," what do you conclude?
Why is a committee better than one expert?
What if all 10 experts were trained on the exact same data?

Reveal Solution

Ensemble voting reduces errors: even if individual trees are wrong 40% of the time, a majority vote of 10 trees is wrong far less often (by the law of large numbers). But trees must be diverse -- if trained identically, they make the same mistakes. Bootstrap sampling (random subsets with replacement) creates diversity.

Activity 5

"Which Feature Matters?"

Bar chart showing feature importance scores from a random forest model

Your Task

Which feature is most important for prediction?
Does "important" mean "causal"?
If you removed the top feature, what would happen to model accuracy?

Reveal Solution

Feature importance measures how much each feature reduces impurity across all trees. But importance does not equal causation -- a feature can be important because it's correlated with the true cause. Removing the top feature usually hurts accuracy, but other correlated features may partially compensate.

Activity 6

"Overfitting the Fraud Detector"

Bias-variance tradeoff as a function of model complexity for decision trees and random forests

Your Task

What happens when a single tree is very deep?
What happens when it's very shallow?
How does a random forest fix this problem?

Reveal Solution

Deep trees overfit (memorize noise). Shallow trees underfit (miss patterns). Random forests fix this: each tree can be deep (low bias), but averaging many trees reduces variance. The formula: $\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$ -- averaging $n$ independent estimates reduces variance by $n$.