Key idea

There are two ways a model can be wrong. Bias: the model is too simple to capture the pattern. Variance: the model has memorized the training data, including its noise. Most fixes trade one for the other.

Slide the polynomial degree — 30 fits on different samples overlay; the bias-variance U-curve sits on the right
deg = 4 σ = 0.18

Left: 30 polynomials of the chosen degree, each fit to a different noisy sample of the target. The bold orange line is their pointwise mean. Bias is how far that mean misses the target (dashed). Variance is the spread of the faded indigo curves around the mean. Right: those two quantities as the degree changes — bias drops, variance rises, and total error makes a U. Sliding the degree slider moves a vertical line through that plot so you can see exactly where you are on the trade-off curve.

Picture trying to fit a straight line through data that's actually curved. The line is biased — no matter how much data you give it, it can't bend enough to match. Now picture a wiggly curve that passes through every single training point. Zero training error, but it's memorized the noise — show it new points and they'll be all over the map. That's variance.

The art of training a model is finding the sweet spot in between: complex enough to capture the real pattern, simple enough to ignore the noise.

High bias (underfitting)

  • Training error is poor
  • Test error is also poor
  • The model can't even fit the data it's trained on
  • Fix: more features, a more complex model, less regularization

High variance (overfitting)

  • Training error is tiny
  • Test error is much larger
  • The model "memorized" the training set
  • Fix: more data, a simpler model, regularization, ensembling
# Quickest diagnostic: compare training vs test accuracy
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test,  y_test):.3f}")

# If both are low      -> high bias (underfitting)
# If train >> test     -> high variance (overfitting)
# If both reasonable   -> you're in the sweet spot
Want to see the math?
The decomposition $$ \mathbb{E}\!\left[(y - \hat{f}(x))^2\right] \;=\; \mathrm{Bias}[\hat{f}(x)]^{\,2} \;+\; \mathrm{Var}[\hat{f}(x)] \;+\; \sigma^2 $$
  • E[…]expected error over all possible training sets & inputs
  • Biashow far the average prediction sits from the truth
  • Varhow much predictions wobble as you re-sample the training data
  • σ²irreducible noise — error you can't escape

Total expected error splits cleanly into three terms. Two are your problem; one isn't.

Bias is error from picking the wrong model class. A linear model has high bias on a quadratic problem no matter how much data you feed it — the hypothesis space simply doesn't contain the right answer.

Variance is error from being too sensitive to the specific training sample. A deep, unpruned tree picks up on noise; re-fit it on a slightly different sample and you get a noticeably different model.

Irreducible noise (σ²) is the floor: it's noise in y given x. No model can do better.

The trade-off comes from how knobs move bias and variance together:

Increases bias, reduces variance

  • Simpler model class (linear → polynomial degree ↓)
  • Stronger regularization (L1, L2, dropout)
  • Smaller trees, more pruning
  • Smaller neural networks

Reduces variance, bias unchanged

  • More training data
  • Ensembling (bagging, random forests)
  • Averaging across initial seeds
from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
)

# Convergence behaviour of the two curves diagnoses the regime:
#   gap between them is large and stable -> high variance
#   both plateau at a high error          -> high bias
#   both converge low                     -> you're in the sweet spot
Curious how this breaks down for modern deep nets?
Formal decomposition (squared loss) $$ \begin{aligned} \mathbb{E}_{D,\varepsilon}\!\left[(y - \hat{f}_D(x))^2\right] &\;=\; \underbrace{\left(\bar{f}(x) - f(x)\right)^2}_{\text{bias}^2} \;+\; \underbrace{\mathbb{E}_D\!\left[\left(\hat{f}_D(x) - \bar{f}(x)\right)^2\right]}_{\text{variance}} \;+\; \sigma^2 \end{aligned} $$
  • Drandom training sample drawn from the data distribution
  • f(x)true underlying function
  • D(x)the model fit on a particular dataset D
  • f̄(x)shorthand for ED[f̂D(x)] — the average prediction across training sets
  • εnoise; y = f(x) + ε with Var(ε) = σ²

The decomposition holds for squared loss only. It does not generalize cleanly to 0-1 loss or cross-entropy; the most common attempt at a unified treatment is Domingos (2000). For classification, the practical analogue is the trade-off between training and test error, but "bias" and "variance" become heuristic labels rather than algebraic terms.

Double descent. The classical U-shape in test error vs. capacity describes only the under-parameterized regime. For modern interpolating models — wide neural networks, kernel methods at the interpolation threshold — test error often decreases again as capacity grows past the point where training error hits zero. Belkin et al. (2019) named this the double descent phenomenon.

Implication. "Increasing capacity inflates variance" is a sound principle for classical ML but a poor description of deep nets. The implicit bias of the optimizer (SGD with weight decay) does most of the regularization work even when explicit regularization is minimal. Bias-variance is still a useful frame, but only one of several — see also benign overfitting (Bartlett et al.) and norm-based generalization bounds.

Classical regime

  • Number of params ≪ training points
  • Linear models, trees, classical ML
  • Decomposition is quantitatively useful
  • Capacity ↑ → variance ↑ holds reliably

Modern overparameterized regime

  • Models interpolate the training set (zero training loss)
  • Deep nets, wide kernels, transfer-learned models
  • Double descent: test error drops past interpolation
  • Generalization governed by optimizer's implicit bias
import numpy as np
from sklearn.utils import resample
from sklearn.base import clone

# Empirical bias-variance estimate via bootstrap
def bias_variance(estimator, X_train, y_train, X_test, y_test, n_boot=200, seed=0):
    rng = np.random.RandomState(seed)
    preds = np.zeros((n_boot, len(X_test)))
    for b in range(n_boot):
        Xb, yb = resample(X_train, y_train, random_state=rng.randint(1e9))
        preds[b] = clone(estimator).fit(Xb, yb).predict(X_test)

    mean_pred = preds.mean(axis=0)
    bias_sq  = ((mean_pred - y_test) ** 2).mean()
    variance =  preds.var(axis=0).mean()
    noise    =  max(0.0, ((y_test - mean_pred) ** 2).mean() - bias_sq)
    return bias_sq, variance, noise
Too dense?