Key idea

Ask a crowd of decision trees, then take the majority vote. Each tree alone is shaky and makes its own mistakes, but the trees disagree in different places — so when you average their answers, the errors cancel out and the agreement points to the truth.

Slide N up — watch the ensemble boundary smooth out compared to the brittle individual trees below
N = 20

Each tree below was trained on a different bootstrap sample of the data with a random feature subset at every split — so they're all biased, all brittle, all wrong in different ways. But the big panel up top is the average of their votes. The wrongness cancels; the agreement reinforces. Slide N from 1 to 80 and watch the boundary go from jagged-and-overconfident to smooth-and-stable.

A decision tree is a flowchart of yes/no questions ("Is age > 30?" → "Is income > 50k?" → …) that ends in a prediction. It's intuitive, but it's also brittle: small changes in your training data can produce a wildly different tree.

A random forest builds many such trees, each on a slightly different sample of your data, and lets them vote. It's a bit like asking 100 doctors for a second opinion instead of trusting one — the consensus is more reliable than any individual.

Reach for it when

  • Your data is in rows and columns (a spreadsheet)
  • You want something that "just works" with minimal setup
  • You're not sure where to start
  • You want a reasonable accuracy estimate without extra effort

Skip it when

  • You need to explain why a specific prediction came out a certain way
  • Your data is images, text, or sequences (use neural networks)
  • You need predictions outside the range you trained on
from sklearn.ensemble import RandomForestClassifier

# Train: just give it labelled data
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
Want to know how it actually works?
Key idea $$ \hat{y}(x) \;=\; \frac{1}{B} \sum_{b=1}^{B} T_b(x) $$
  • ŷ(x)the forest's prediction for input x
  • Bthe number of trees (typically 100–500)
  • Tb(x)the prediction from the b-th tree

Each tree is trained on a bootstrap sample with a random feature subset at each split. Both sources of randomness decorrelate the trees so the average cuts variance.

A single decision tree is high-variance — small changes to the training data produce very different trees. Random forests fight this by training many trees on slightly different views of the data, then averaging their predictions.

Two knobs do the work. Bootstrap sampling means each tree sees a different random sample (with replacement) of the training set. Feature subsetting means each split only considers a random subset of features (typically √p for classification, p/3 for regression). Together they ensure the trees disagree in independent ways.

Bonus: each bootstrap leaves out about ⅓ of the data per tree. Aggregating the trees that excluded each point gives an out-of-bag (OOB) estimate of generalization error — no cross-validation needed.

Reach for it when

  • Tabular data with mixed feature types
  • You need a strong baseline with almost no tuning
  • Robustness matters more than the last 2% of accuracy
  • You want a free OOB error estimate

Skip it when

  • Extrapolation outside the training range is required
  • Monotonicity constraints are required
  • You need per-prediction interpretability
  • High-signal tabular — gradient boosting usually wins
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    n_estimators=200,
    max_features="sqrt",   # √p features considered per split
    oob_score=True,        # free generalization estimate
    n_jobs=-1,
    random_state=0,
)
clf.fit(X_train, y_train)

print(f"OOB accuracy: {clf.oob_score_:.3f}")
for name, imp in sorted(
    zip(X_train.columns, clf.feature_importances_),
    key=lambda x: -x[1],
)[:5]:
    print(f"  {name:20s} {imp:.3f}")
Want the bias-variance derivation and honest diagnostics?
Variance reduction $$ \mathrm{Var}(\hat{y}) \;=\; \rho\,\sigma^2 \;+\; \frac{1-\rho}{B}\,\sigma^2 $$
  • σ²variance of an individual tree's prediction
  • ρpairwise correlation between trees on the same input
  • Bnumber of trees

As B → ∞ the second term vanishes — variance is floored at ρσ². The whole game is to drive ρ down via per-split feature subsets without making each tree too weak (which inflates σ²).

Random forests are bagging applied to trees, with one twist: at each split, only a random subset of m features (out of p) is considered. This injects an extra source of decorrelation beyond bootstrap sampling alone, addressing the fact that standard bagging produces highly correlated trees whenever a few features dominate the splits.

The max_features hyperparameter trades off correlation (low m → low ρ) against individual tree strength (low m → higher σ²). Empirical defaults — √p for classification, p/3 for regression — work remarkably well across domains.

Out-of-bag error. Each bootstrap omits a fraction (1 − 1/N)N → e−1 ≈ 36.8% of points. Averaging predictions from trees that excluded each point gives an estimate of generalization error that is asymptotically equivalent to leave-one-out CV, at no extra training cost.

Feature importance. Impurity-based importance (feature_importances_) is biased toward high-cardinality features. Prefer permutation_importance on a held-out set for honest estimates.

Reach for it when

  • You want a robust baseline before investing in boosting
  • You need uncertainty estimates (via per-tree predictions)
  • OOB is desirable (small datasets, expensive CV)
  • The signal is heterogeneous — feature interactions vary across regimes

Skip it when

  • You need calibrated probabilities without post-hoc calibration
  • Memory is tight (forests are heavyweight at inference)
  • Smooth function approximation matters (forests are piecewise-constant)
  • The variable of interest is monotone in inputs and you need that respected
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

clf = RandomForestClassifier(
    n_estimators=500,
    max_features="sqrt",
    min_samples_leaf=1,
    oob_score=True,
    bootstrap=True,
    n_jobs=-1,
    random_state=0,
).fit(X_train, y_train)

# Honest importance via permutation on held-out set
perm = permutation_importance(
    clf, X_test, y_test, n_repeats=10, n_jobs=-1, random_state=0
)
ranked = sorted(zip(X_train.columns, perm.importances_mean), key=lambda x: -x[1])

print(f"OOB: {clf.oob_score_:.3f}   Test: {clf.score(X_test, y_test):.3f}")
for name, imp in ranked[:10]:
    print(f"  {name:20s} {imp:+.4f}")
Too dense?