Key idea

A series of yes/no questions that funnel data to a prediction. "Is income > £30k?" → "Yes" → "Is age > 40?" → "Yes" → predict "high credit". The algorithm picks the questions and the order, learning them from training data.

Slide the depth — watch the tree carve axis-aligned regions and the right-hand diagram grow
max depth = 3

Every split is axis-aligned — so a single tree can only carve feature space into rectangles. That's fine for the Linear boundary dataset (one split does it) but watch what happens on XOR: depth 1 can't separate the classes at all, but depth 2 nails it with two perpendicular splits. Crank to depth 8 on a noisy dataset and you'll see overfitting — every training point gets its own tiny region.

Decision trees are perhaps the most intuitive ML model — you can read one off and explain exactly why it made each prediction. They handle mixed feature types (numeric + categorical), don't need normalisation, and don't make assumptions about the shape of the data.

The big downside: a single tree is brittle. Small changes in the training data produce very different trees, so they overfit easily. The fix is ensembles (random forests, gradient boosting) — many trees combined. But you have to understand a single tree first.

Reach for it when

  • You need a model you can read off and explain
  • Quick baseline with no feature engineering
  • Mixed numeric / categorical features
  • You'll feed it into a random forest or boosted ensemble

Skip it when

  • You want best possible accuracy — use random forests / GBM instead
  • Smooth function approximation matters (trees are piecewise-constant)
  • Linear relationships dominate — a linear model is more elegant
  • Very high-dim or sparse data
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

clf = DecisionTreeClassifier(max_depth=4, random_state=0)
clf.fit(X_train, y_train)

# Visualise the actual decisions
plot_tree(clf, feature_names=X_train.columns, filled=True)
plt.show()
Want the splitting math?
Splitting criterion (Gini impurity) $$ G(S) \;=\; 1 \;-\; \sum_{k=1}^{K} p_k^2, \qquad \Delta G \;=\; G(\text{parent}) - \sum_{\text{children}} \tfrac{|S_i|}{|S|}\,G(S_i) $$
  • Sset of training points at a node
  • pkfraction of class k in S
  • ΔGimpurity decrease from a candidate split — pick the split that maximises this

CART (Classification And Regression Trees) is the canonical algorithm. Greedy: at each node, search over all features and all thresholds for the split that most decreases impurity. Recurse on each child. Stop when a depth limit is reached or impurity decrease is too small.

Splitting criteria. For classification: Gini (default in sklearn) or entropy (information gain). They're nearly equivalent in practice. For regression: variance reduction — pick the split that most reduces within-node variance of y.

Controlling overfitting. Unpruned trees memorise the training set. Three knobs: max_depth (hard limit), min_samples_leaf (refuse splits that produce tiny leaves), min_impurity_decrease (refuse splits below a threshold). Cost-complexity pruning (CCP) is the principled approach: grow a deep tree, then prune back nodes whose contribution to accuracy is too small relative to their complexity.

Categorical features. sklearn's trees don't natively handle categoricals — they expect numeric inputs (use one-hot or target encoding). LightGBM and XGBoost handle them natively, often via partition-based splits which are more efficient than one-hot.

Reach for it when

  • Need an interpretable baseline you can show non-technical stakeholders
  • Quick exploration on a new dataset (no scaling needed)
  • Stage 1 of a tree-based pipeline (RF / boosting build on this)
  • Decisions that should be a literal flowchart

Skip it when

  • Continuous target where smooth predictions matter
  • You'll use an ensemble anyway — go straight to that
  • High-dim sparse text — linear models or NB win
  • Strong distributional assumptions you'd rather encode explicitly
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Reasonable hyperparameter sweep
params = {
    "max_depth":        [None, 4, 6, 8, 10],
    "min_samples_leaf": [1, 5, 10, 20],
    "criterion":        ["gini", "entropy"],
}
gs = GridSearchCV(
    DecisionTreeClassifier(random_state=0),
    params, cv=5, scoring="f1_macro", n_jobs=-1,
).fit(X_train, y_train)

print(f"Best params: {gs.best_params_}")
print(f"Best CV F1:  {gs.best_score_:.3f}")
Want cost-complexity pruning and the bias side of the story?
Cost-complexity pruning $$ R_\alpha(T) \;=\; R(T) \;+\; \alpha\,|T| $$
  • R(T)misclassification cost of tree T
  • |T|number of leaves
  • αregularization strength — sweep this to trace out a sequence of pruned subtrees

Greedy splits are myopic. CART picks the best local split at each node, which isn't always the globally optimal tree. Finding the optimal tree is NP-hard. Modern interpretable-ML methods (OCT, MurTree) use mathematical programming to find globally optimal trees for small problems — useful for high-stakes applications where you want to defend the structure.

Bias and feature importance. Impurity-based importance (sklearn's feature_importances_) is biased toward high-cardinality features — they offer more split candidates. Permutation importance on a held-out set gives more honest estimates. For random forests this matters even more — see the discussion on the RF page.

Missing values. CART originally used surrogate splits — alternate splits learned to handle missing values gracefully. sklearn doesn't implement this; LightGBM/XGBoost route missing values to whichever child improves the loss most. The simplest workaround in sklearn: impute first, or use HistGradientBoostingClassifier which handles missing natively.

Monotonicity constraints. Sometimes you need the model to be monotone in a feature (e.g. credit risk should not decrease with income). XGBoost and LightGBM both support monotone constraints; vanilla CART doesn't. When this matters, switch frameworks.

Where trees go to die. Decision trees alone are rarely state-of-the-art today. They're foundations for ensembles (RF, GBM). The interesting research is now in: differentiable trees (soft decision trees, tree-MLPs), neural-tree hybrids (NODE, TabNet), and globally-optimal trees for interpretability-critical settings.

Reach for it when

  • Regulatory / interpretability requirements that need a literal flowchart
  • Cost-complexity pruning gives you a principled accuracy/complexity trade-off
  • You're building or debugging an ensemble and need to inspect base learners
  • Globally-optimal trees on small problems — for defensibility

Skip it when

  • You're chasing accuracy — ensembles dominate
  • Smooth function approximation needed
  • Adversarial robustness — trees have known attack surfaces
  • Online / streaming setting — incremental tree algorithms exist but are niche
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Cost-complexity pruning path
tree = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
path = tree.cost_complexity_pruning_path(X_train, y_train)
alphas, impurities = path.ccp_alphas, path.impurities

# Train one tree per alpha; pick the smallest tree within 1 SE of the best
trees = [
    DecisionTreeClassifier(ccp_alpha=a, random_state=0).fit(X_train, y_train)
    for a in alphas
]
val_scores = np.array([t.score(X_val, y_val) for t in trees])

best_i  = val_scores.argmax()
se      = val_scores.std() / np.sqrt(len(val_scores))
within  = np.where(val_scores >= val_scores[best_i] - se)[0]
chosen  = trees[within[-1]]   # simplest tree within 1 SE — Occam's razor

print(f"Chose tree with {chosen.get_n_leaves()} leaves, "
      f"val acc = {val_scores[within[-1]]:.3f}")
Too dense?