Bias-Variance & Overfitting — ML Resources Hub

Key idea

There are two ways a model can be wrong. Bias: the model is too simple to capture the pattern. Variance: the model has memorized the training data, including its noise. Most fixes trade one for the other.

Slide the polynomial degree — 30 fits on different samples overlay; the bias-variance U-curve sits on the right

Target Degree σ deg = 4 σ = 0.18

Left: 30 polynomials of the chosen degree, each fit to a different noisy sample of the target. The bold orange line is their pointwise mean. Bias is how far that mean misses the target (dashed). Variance is the spread of the faded indigo curves around the mean. Right: those two quantities as the degree changes — bias drops, variance rises, and total error makes a U. Sliding the degree slider moves a vertical line through that plot so you can see exactly where you are on the trade-off curve.

Picture trying to fit a straight line through data that's actually curved. The line is biased — no matter how much data you give it, it can't bend enough to match. Now picture a wiggly curve that passes through every single training point. Zero training error, but it's memorized the noise — show it new points and they'll be all over the map. That's variance.

The art of training a model is finding the sweet spot in between: complex enough to capture the real pattern, simple enough to ignore the noise.

High bias (underfitting)

Training error is poor
Test error is also poor
The model can't even fit the data it's trained on
Fix: more features, a more complex model, less regularization

High variance (overfitting)

Training error is tiny
Test error is much larger
The model "memorized" the training set
Fix: more data, a simpler model, regularization, ensembling

# Quickest diagnostic: compare training vs test accuracy
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test,  y_test):.3f}")

# If both are low      -> high bias (underfitting)
# If train >> test     -> high variance (overfitting)
# If both reasonable   -> you're in the sweet spot

Want to see the math?

The decomposition $$ \mathbb{E}\!\left[(y - \hat{f}(x))^2\right] \;=\; \mathrm{Bias}[\hat{f}(x)]^{\,2} \;+\; \mathrm{Var}[\hat{f}(x)] \;+\; \sigma^2 $$

E[…]expected error over all possible training sets & inputs
Biashow far the average prediction sits from the truth
Varhow much predictions wobble as you re-sample the training data
σ²irreducible noise — error you can't escape

Total expected error splits cleanly into three terms. Two are your problem; one isn't.

Bias is error from picking the wrong model class. A linear model has high bias on a quadratic problem no matter how much data you feed it — the hypothesis space simply doesn't contain the right answer.

Variance is error from being too sensitive to the specific training sample. A deep, unpruned tree picks up on noise; re-fit it on a slightly different sample and you get a noticeably different model.

Irreducible noise (σ²) is the floor: it's noise in y given x. No model can do better.

The trade-off comes from how knobs move bias and variance together:

Increases bias, reduces variance

Simpler model class (linear → polynomial degree ↓)
Stronger regularization (L1, L2, dropout)
Smaller trees, more pruning
Smaller neural networks

Reduces variance, bias unchanged

More training data
Ensembling (bagging, random forests)
Averaging across initial seeds

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
)

# Convergence behaviour of the two curves diagnoses the regime:
#   gap between them is large and stable -> high variance
#   both plateau at a high error          -> high bias
#   both converge low                     -> you're in the sweet spot

Curious how this breaks down for modern deep nets?

Formal decomposition (squared loss) $$ \begin{aligned} \mathbb{E}_{D,\varepsilon}\!\left[(y - \hat{f}_D(x))^2\right] &\;=\; \underbrace{\left(\bar{f}(x) - f(x)\right)^2}_{\text{bias}^2} \;+\; \underbrace{\mathbb{E}_D\!\left[\left(\hat{f}_D(x) - \bar{f}(x)\right)^2\right]}_{\text{variance}} \;+\; \sigma^2 \end{aligned} $$

Drandom training sample drawn from the data distribution
f(x)true underlying function
f̂_D(x)the model fit on a particular dataset D
f̄(x)shorthand for E_D[f̂_D(x)] — the average prediction across training sets
εnoise; y = f(x) + ε with Var(ε) = σ²

The decomposition holds for squared loss only. It does not generalize cleanly to 0-1 loss or cross-entropy; the most common attempt at a unified treatment is Domingos (2000). For classification, the practical analogue is the trade-off between training and test error, but "bias" and "variance" become heuristic labels rather than algebraic terms.

Double descent. The classical U-shape in test error vs. capacity describes only the under-parameterized regime. For modern interpolating models — wide neural networks, kernel methods at the interpolation threshold — test error often decreases again as capacity grows past the point where training error hits zero. Belkin et al. (2019) named this the double descent phenomenon.

Implication. "Increasing capacity inflates variance" is a sound principle for classical ML but a poor description of deep nets. The implicit bias of the optimizer (SGD with weight decay) does most of the regularization work even when explicit regularization is minimal. Bias-variance is still a useful frame, but only one of several — see also benign overfitting (Bartlett et al.) and norm-based generalization bounds.

Classical regime

Number of params ≪ training points
Linear models, trees, classical ML
Decomposition is quantitatively useful
Capacity ↑ → variance ↑ holds reliably

Modern overparameterized regime

Models interpolate the training set (zero training loss)
Deep nets, wide kernels, transfer-learned models
Double descent: test error drops past interpolation
Generalization governed by optimizer's implicit bias

import numpy as np
from sklearn.utils import resample
from sklearn.base import clone

# Empirical bias-variance estimate via bootstrap
def bias_variance(estimator, X_train, y_train, X_test, y_test, n_boot=200, seed=0):
    rng = np.random.RandomState(seed)
    preds = np.zeros((n_boot, len(X_test)))
    for b in range(n_boot):
        Xb, yb = resample(X_train, y_train, random_state=rng.randint(1e9))
        preds[b] = clone(estimator).fit(Xb, yb).predict(X_test)

    mean_pred = preds.mean(axis=0)
    bias_sq  = ((mean_pred - y_test) ** 2).mean()
    variance =  preds.var(axis=0).mean()
    noise    =  max(0.0, ((y_test - mean_pred) ** 2).mean() - bias_sq)
    return bias_sq, variance, noise

Too dense?

Where to learn more

ESL, chapter 7 The canonical treatment of bias-variance and model assessment. Read alongside chapter 2 for context.
Belkin et al. (2019) "Reconciling modern ML practice and the classical bias-variance trade-off" — the double-descent paper. Short, accessible.
Domingos (2000) Unified bias-variance decomposition for 0-1, squared, and other losses. Read when squared-loss isn't enough.
MLU-Explain — Bias-Variance Interactive visualization showing how complexity moves bias and variance. Best intuition builder.
Wikipedia Quick reference for the derivation and connections to regularization.