The trade-off at the heart of why models fail to generalize.
Key idea
There are two ways a model can be wrong. Bias: the model is too simple to capture the pattern. Variance: the model has memorized the training data, including its noise. Most fixes trade one for the other.
Slide the polynomial degree — 30 fits on different samples overlay; the bias-variance U-curve sits on the right
deg = 4σ = 0.18
Left: 30 polynomials of the chosen degree, each fit to a different noisy sample of the target. The bold orange line is their pointwise mean. Bias is how far that mean misses the target (dashed). Variance is the spread of the faded indigo curves around the mean. Right: those two quantities as the degree changes — bias drops, variance rises, and total error makes a U. Sliding the degree slider moves a vertical line through that plot so you can see exactly where you are on the trade-off curve.
Picture trying to fit a straight line through data that's actually curved. The line is biased — no matter how much data you give it, it can't bend enough to match. Now picture a wiggly curve that passes through every single training point. Zero training error, but it's memorized the noise — show it new points and they'll be all over the map. That's variance.
The art of training a model is finding the sweet spot in between: complex enough to capture the real pattern, simple enough to ignore the noise.
High bias (underfitting)
Training error is poor
Test error is also poor
The model can't even fit the data it's trained on
Fix: more features, a more complex model, less regularization
High variance (overfitting)
Training error is tiny
Test error is much larger
The model "memorized" the training set
Fix: more data, a simpler model, regularization, ensembling
# Quickest diagnostic: compare training vs test accuracy
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test: {model.score(X_test, y_test):.3f}")
# If both are low -> high bias (underfitting)
# If train >> test -> high variance (overfitting)
# If both reasonable -> you're in the sweet spot
E[…]expected error over all possible training sets & inputs
Biashow far the average prediction sits from the truth
Varhow much predictions wobble as you re-sample the training data
σ²irreducible noise — error you can't escape
Total expected error splits cleanly into three terms. Two are your problem; one isn't.
Bias is error from picking the wrong model class. A linear model has high bias on a quadratic problem no matter how much data you feed it — the hypothesis space simply doesn't contain the right answer.
Variance is error from being too sensitive to the specific training sample. A deep, unpruned tree picks up on noise; re-fit it on a slightly different sample and you get a noticeably different model.
Irreducible noise (σ²) is the floor: it's noise in y given x. No model can do better.
The trade-off comes from how knobs move bias and variance together:
Increases bias, reduces variance
Simpler model class (linear → polynomial degree ↓)
Stronger regularization (L1, L2, dropout)
Smaller trees, more pruning
Smaller neural networks
Reduces variance, bias unchanged
More training data
Ensembling (bagging, random forests)
Averaging across initial seeds
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring="neg_mean_squared_error",
n_jobs=-1,
)
# Convergence behaviour of the two curves diagnoses the regime:
# gap between them is large and stable -> high variance
# both plateau at a high error -> high bias
# both converge low -> you're in the sweet spot
Curious how this breaks down for modern deep nets?
Drandom training sample drawn from the data distribution
f(x)true underlying function
f̂D(x)the model fit on a particular dataset D
f̄(x)shorthand for ED[f̂D(x)] — the average prediction across training sets
εnoise; y = f(x) + ε with Var(ε) = σ²
The decomposition holds for squared loss only. It does not generalize cleanly to 0-1 loss or cross-entropy; the most common attempt at a unified treatment is Domingos (2000). For classification, the practical analogue is the trade-off between training and test error, but "bias" and "variance" become heuristic labels rather than algebraic terms.
Double descent. The classical U-shape in test error vs. capacity describes only the under-parameterized regime. For modern interpolating models — wide neural networks, kernel methods at the interpolation threshold — test error often decreases again as capacity grows past the point where training error hits zero. Belkin et al. (2019) named this the double descent phenomenon.
Implication. "Increasing capacity inflates variance" is a sound principle for classical ML but a poor description of deep nets. The implicit bias of the optimizer (SGD with weight decay) does most of the regularization work even when explicit regularization is minimal. Bias-variance is still a useful frame, but only one of several — see also benign overfitting (Bartlett et al.) and norm-based generalization bounds.
Classical regime
Number of params ≪ training points
Linear models, trees, classical ML
Decomposition is quantitatively useful
Capacity ↑ → variance ↑ holds reliably
Modern overparameterized regime
Models interpolate the training set (zero training loss)
Deep nets, wide kernels, transfer-learned models
Double descent: test error drops past interpolation
Generalization governed by optimizer's implicit bias