An ensemble of decorrelated decision trees, averaged or voted.
Key idea
Ask a crowd of decision trees, then take the majority vote. Each tree alone is shaky and makes its own mistakes, but the trees disagree in different places — so when you average their answers, the errors cancel out and the agreement points to the truth.
Slide N up — watch the ensemble boundary smooth out compared to the brittle individual trees below
N = 20
Each tree below was trained on a different bootstrap sample of the data with a random feature subset at every split — so they're all biased, all brittle, all wrong in different ways. But the big panel up top is the average of their votes. The wrongness cancels; the agreement reinforces. Slide N from 1 to 80 and watch the boundary go from jagged-and-overconfident to smooth-and-stable.
A decision tree is a flowchart of yes/no questions ("Is age > 30?" → "Is income > 50k?" → …) that ends in a prediction. It's intuitive, but it's also brittle: small changes in your training data can produce a wildly different tree.
A random forest builds many such trees, each on a slightly different sample of your data, and lets them vote. It's a bit like asking 100 doctors for a second opinion instead of trusting one — the consensus is more reliable than any individual.
Reach for it when
Your data is in rows and columns (a spreadsheet)
You want something that "just works" with minimal setup
You're not sure where to start
You want a reasonable accuracy estimate without extra effort
Skip it when
You need to explain why a specific prediction came out a certain way
Your data is images, text, or sequences (use neural networks)
You need predictions outside the range you trained on
from sklearn.ensemble import RandomForestClassifier
# Train: just give it labelled data
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
Each tree is trained on a bootstrap sample with a random feature subset at each split. Both sources of randomness decorrelate the trees so the average cuts variance.
A single decision tree is high-variance — small changes to the training data produce very different trees. Random forests fight this by training many trees on slightly different views of the data, then averaging their predictions.
Two knobs do the work. Bootstrap sampling means each tree sees a different random sample (with replacement) of the training set. Feature subsetting means each split only considers a random subset of features (typically √p for classification, p/3 for regression). Together they ensure the trees disagree in independent ways.
Bonus: each bootstrap leaves out about ⅓ of the data per tree. Aggregating the trees that excluded each point gives an out-of-bag (OOB) estimate of generalization error — no cross-validation needed.
Reach for it when
Tabular data with mixed feature types
You need a strong baseline with almost no tuning
Robustness matters more than the last 2% of accuracy
You want a free OOB error estimate
Skip it when
Extrapolation outside the training range is required
Monotonicity constraints are required
You need per-prediction interpretability
High-signal tabular — gradient boosting usually wins
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
n_estimators=200,
max_features="sqrt", # √p features considered per split
oob_score=True, # free generalization estimate
n_jobs=-1,
random_state=0,
)
clf.fit(X_train, y_train)
print(f"OOB accuracy: {clf.oob_score_:.3f}")
for name, imp in sorted(
zip(X_train.columns, clf.feature_importances_),
key=lambda x: -x[1],
)[:5]:
print(f" {name:20s} {imp:.3f}")
Want the bias-variance derivation and honest diagnostics?
ρpairwise correlation between trees on the same input
Bnumber of trees
As B → ∞ the second term vanishes — variance is floored at ρσ². The whole game is to drive ρ down via per-split feature subsets without making each tree too weak (which inflates σ²).
Random forests are bagging applied to trees, with one twist: at each split, only a random subset of m features (out of p) is considered. This injects an extra source of decorrelation beyond bootstrap sampling alone, addressing the fact that standard bagging produces highly correlated trees whenever a few features dominate the splits.
The max_features hyperparameter trades off correlation (low m → low ρ) against individual tree strength (low m → higher σ²). Empirical defaults — √p for classification, p/3 for regression — work remarkably well across domains.
Out-of-bag error. Each bootstrap omits a fraction (1 − 1/N)N → e−1 ≈ 36.8% of points. Averaging predictions from trees that excluded each point gives an estimate of generalization error that is asymptotically equivalent to leave-one-out CV, at no extra training cost.
Feature importance. Impurity-based importance (feature_importances_) is biased toward high-cardinality features. Prefer permutation_importance on a held-out set for honest estimates.
Reach for it when
You want a robust baseline before investing in boosting
You need uncertainty estimates (via per-tree predictions)
OOB is desirable (small datasets, expensive CV)
The signal is heterogeneous — feature interactions vary across regimes
Skip it when
You need calibrated probabilities without post-hoc calibration
Memory is tight (forests are heavyweight at inference)
Smooth function approximation matters (forests are piecewise-constant)
The variable of interest is monotone in inputs and you need that respected
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
clf = RandomForestClassifier(
n_estimators=500,
max_features="sqrt",
min_samples_leaf=1,
oob_score=True,
bootstrap=True,
n_jobs=-1,
random_state=0,
).fit(X_train, y_train)
# Honest importance via permutation on held-out set
perm = permutation_importance(
clf, X_test, y_test, n_repeats=10, n_jobs=-1, random_state=0
)
ranked = sorted(zip(X_train.columns, perm.importances_mean), key=lambda x: -x[1])
print(f"OOB: {clf.oob_score_:.3f} Test: {clf.score(X_test, y_test):.3f}")
for name, imp in ranked[:10]:
print(f" {name:20s} {imp:+.4f}")