Key idea

A relay race of trees. Train a small tree, see where it makes mistakes, then train the next tree to fix those mistakes. Repeat. Sum them up. The resulting model is usually the strongest thing you can put on tabular data.

Each Step fits one depth-2 tree to the current residuals — watch the orange fit chase the dashed target
η = 0.40 Tree 0

Round 0 is just the mean — the bottom panel shows every point's residual from it. Each Step fits a depth-2 tree (4 leaves) to those residuals and adds it (scaled by η) to the ensemble. After ~20 trees on the Sine target the fit is essentially perfect; the residuals shrink toward zero. The Step function target is a great demo of why trees are excellent on non-smooth targets — boosting nails the jumps where smooth methods would oscillate.

Random forests train trees in parallel and average them. Gradient boosting trains them in sequence: tree 2 is trained on the errors tree 1 made, tree 3 is trained on the errors trees 1+2 made, and so on. Each new tree is small (just a few splits) but specifically targeted at what's still going wrong.

The libraries you've heard of — XGBoost, LightGBM, CatBoost — are heavily-optimized implementations of this idea. They're still the default winning approach on Kaggle tabular competitions.

Reach for it when

  • Tabular data and you want the best possible accuracy
  • You have time to tune hyperparameters (learning rate, depth, etc.)
  • Mixed numeric / categorical features
  • You're competing on Kaggle

Skip it when

  • You need a quick zero-tuning baseline (random forests are friendlier)
  • Data is images, text, or audio (use neural networks)
  • Tiny data — overfits easily without careful tuning
  • You need a fully interpretable model
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=200, random_state=0)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
Want to see how the gradients work?
Key idea $$ F_m(x) \;=\; F_{m-1}(x) \;+\; \nu \cdot h_m(x), \qquad h_m \approx -\nabla_F \mathcal{L} $$
  • Fmthe ensemble after m trees
  • hmthe new tree at step m, fit to the negative gradient of the loss
  • νlearning rate (shrinkage), typically 0.01–0.1

At each step, fit a small tree to the pseudo-residuals (negative gradient of the loss w.r.t. current predictions). Add it to the ensemble with a small learning rate. Repeat.

The "gradient" in gradient boosting is the derivative of the loss with respect to the predictions. For squared loss, the gradient is just y − F(x) — the residuals — so each new tree is trained to predict the leftover error. For other losses (log loss, hinge), the pseudo-residuals differ but the recipe is the same.

Three hyperparameters do most of the work: n_estimators (more trees → more capacity), max_depth (deeper trees → more interaction modelling), and learning_rate (smaller → slower but better generalization). Early stopping on a validation set is the standard way to choose n_estimators.

Unlike random forests, gradient boosting is not bagged — every tree sees all the data. The decorrelation comes from each tree fitting a different residual signal, not from data sampling.

Reach for it when

  • You want SOTA accuracy on tabular data
  • You can afford hyperparameter tuning
  • You need a flexible loss (regression, classification, ranking)
  • Feature importance / SHAP values are useful

Skip it when

  • Data is non-tabular
  • You're under-resourced — sklearn's default GB is slow; reach for LightGBM
  • You need probability calibration without post-hoc adjustment
  • Strict latency budget at inference (forests can be heavy)
import lightgbm as lgb

train_ds = lgb.Dataset(X_train, label=y_train)
val_ds   = lgb.Dataset(X_val,   label=y_val, reference=train_ds)

params = {
    "objective":     "binary",
    "metric":        "auc",
    "learning_rate": 0.05,
    "num_leaves":    63,
    "feature_fraction": 0.9,
    "verbosity":    -1,
}

model = lgb.train(
    params, train_ds, num_boost_round=2000,
    valid_sets=[val_ds],
    callbacks=[lgb.early_stopping(50)],
)
Want the functional-gradient view and the XGBoost regularizer?
Regularized objective (XGBoost) $$ \mathcal{L}^{(m)} \;=\; \sum_{i} \ell\!\left(y_i, F_{m-1}(x_i) + h_m(x_i)\right) \;+\; \Omega(h_m), \quad \Omega(h) = \gamma\,T + \tfrac{1}{2}\lambda \|w\|^2 $$
  • Tnumber of leaves in the tree
  • wvector of leaf weights
  • γ, λregularization strengths on leaf count and leaf magnitude

Gradient boosting can be viewed as functional gradient descent in function space: at each step, move F in the direction that most decreases the loss, where the "direction" is parameterized as a regression tree. XGBoost adds explicit regularization on tree complexity.

Friedman's original formulation (2001) interprets boosting as gradient descent in function space, with each weak learner an approximate steepest-descent step. The Newton-style variant used by XGBoost expands the loss to second order, fitting trees to g/h instead of just g — this is what the "XGBoost gain" formula computes at every candidate split.

Why LightGBM is faster. Histogram binning (bucket features into ~256 bins before considering splits) reduces split-finding from O(N) to O(bins). Leaf-wise (not level-wise) growth picks the highest-loss-reducing split anywhere in the tree, which converges faster but overfits if depth is uncapped.

CatBoost. Categorical features get ordered target statistics computed on permuted prefixes to avoid target leakage. Often dominates when you have lots of categorical features and not much time to tune.

Common failure mode. Default learning rate (0.1) is often too high. Halve it and double n_estimators — usually a free generalization gain.

Reach for it when

  • Heterogeneous tabular features (numeric + categorical + missing)
  • You need monotone constraints (LightGBM / XGBoost support them)
  • You're confident enough to tune and have a held-out set
  • Quantile / ranking objectives matter (NDCG, Huber, etc.)

Skip it when

  • You're chasing the last 0.5% — try CatBoost-vs-LightGBM ensemble first
  • Data has strong sequential / spatial structure (use specialized nets)
  • You can't tune — random forests are more forgiving defaults
  • Tiny labelled set — boosting overfits faster than RF
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dval   = xgb.DMatrix(X_val,   label=y_val)

params = {
    "objective":      "binary:logistic",
    "eval_metric":    "auc",
    "eta":            0.05,
    "max_depth":      6,
    "min_child_weight": 1,
    "gamma":          0.1,    # γ in Ω(h)
    "reg_lambda":     1.0,    # λ in Ω(h)
    "subsample":      0.8,
    "colsample_bytree": 0.8,
    "tree_method":    "hist",
}

booster = xgb.train(
    params, dtrain, num_boost_round=2000,
    evals=[(dval, "val")],
    early_stopping_rounds=50,
    verbose_eval=False,
)
print(f"Best AUC: {booster.best_score:.4f} at round {booster.best_iteration}")
Too dense?