The assumptions baked into a model — what it can express, and what it can't.
Key idea
Every model is a guess about how the world is shaped. A linear model guesses "straight lines"; a tree guesses "axis-aligned splits"; a neural network with ReLUs guesses "piecewise-linear regions". You can't learn without making these guesses — that's the no free lunch theorem. The art is choosing assumptions that match the world you're modelling.
Same data, four model families — each draws a wildly different boundary because of its built-in assumptions
The same dataset shown to four different model families. Each draws its decision boundary in a characteristic shape — a linear model can only draw straight lines; a tree only axis-aligned rectangles; k-NN draws jagged regions reflecting local neighbours; RBF / kernel methods draw smooth curves. None of these is "correct" — each has assumed something different about how the world is shaped. Try the XOR dataset and watch which models can express it.
A model's hypothesis space is the set of functions it can possibly represent. Linear regression's hypothesis space is "all straight lines"; a polynomial of degree 5's is "all degree-5 polynomials"; a deep network's is "all piecewise-linear functions of a certain depth and width". The hypothesis space is the model's capacity.
Within a hypothesis space, the inductive bias is the preferences the learning algorithm uses to pick a specific function when many fit the data. Linear regression prefers the line minimising squared error. k-NN prefers the function that matches its neighbours. Without an inductive bias, infinitely many functions fit any finite dataset — you need a tie-breaker.
Choose a strong inductive bias when
You have little data — strong assumptions compensate
You know the underlying structure (linearity, smoothness, periodicity)
You need interpretability
You need fast inference
Choose a weak inductive bias when
You have lots of data — the data can lead
The underlying structure is unknown or highly nonlinear
Highest accuracy matters more than understanding
You can afford the compute
# Same data, different inductive biases — wildly different generalization
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
models = {
"linear": LogisticRegression(),
"tree": DecisionTreeClassifier(max_depth=4),
"k-NN": KNeighborsClassifier(n_neighbors=5),
"RBF kernel": SVC(kernel="rbf", gamma=1.5),
}
for name, m in models.items():
m.fit(X, y)
print(f"{name:12s} train={m.score(X, y):.3f} test={m.score(X_t, y_t):.3f}")
Want the no-free-lunch theorem and capacity bounds?
Larger ℋ ⇒ more flexibility, but a worse generalization bound
No free lunch. Wolpert (1996) showed that averaged over all possible target functions, every learning algorithm has the same expected performance. A model only beats random because its inductive bias matches the structure of the problems we actually care about. Don't expect a single model to win everywhere.
Common inductive biases in ML.Smoothness (nearby inputs → nearby outputs) is assumed by k-NN, kernel methods, and most regularized models.
Compositionality (complex functions built from simple ones) is assumed by deep networks.
Translation equivariance is assumed by CNNs.
Permutation invariance is assumed by set / graph networks.
Sparsity (few active features) is assumed by L1-regularized models.
Capacity, complexity, generalization. Larger hypothesis spaces fit the training set better but generalize worse — that's the bias-variance trade-off seen through the lens of capacity. Classical bounds (VC dimension, Rademacher complexity) make this rigorous; the strange thing is that modern over-parameterized deep nets seem to violate them, which has become an active research area (double descent, benign overfitting).
Choosing inductive biases is feature engineering. Coordinates, kernels, network architectures, augmentations, and even loss functions all encode assumptions. Choosing them well is the under-appreciated half of doing ML.
# Inductive biases as feature engineering
import numpy as np
# Periodic signal: encode it as sin/cos features → linear model can fit
def periodic_features(t, periods=(1, 7, 30)):
feats = []
for p in periods:
feats.append(np.sin(2 * np.pi * t / p))
feats.append(np.cos(2 * np.pi * t / p))
return np.stack(feats, axis=-1)
# Translation equivariance: a CNN encodes "the same pattern at any location"
# Without it, you'd need exponentially more data to see every shift.
# Permutation invariance: for sets, use a model like Σ φ(x_i)
# A vanilla MLP would have to learn the same thing for every permutation.
Want PAC-learnability, VC dim, and double descent?
"Shattered" = ℋ can realise every possible labelling of those n points
The largest such n is the VC dimension — a measure of effective capacity
PAC learnability. A hypothesis class is PAC-learnable if there's an algorithm that, for any ε and δ, produces a hypothesis with error ≤ ε with probability ≥ 1 − δ from polynomially many samples in 1/ε, 1/δ, and the problem's complexity. Finite VC dimension is sufficient for PAC learnability under realizability assumptions; the modern view is broader (e.g., margin-based bounds, online learning).
VC and Rademacher complexity. VC dimension bounds how complex a binary classifier can be: linear classifiers in d dimensions have VC dim d + 1; a depth-L neural net with p parameters has VC dim roughly O(pL). Rademacher complexity is data-dependent and tighter in practice — it measures how well ℋ correlates with random labels on your specific dataset.
Double descent. Belkin et al. (2019) showed test error in modern models has a second descent beyond the classical interpolation threshold: as you increase capacity past the point where the training set is fit exactly, test error drops again. Inductive biases of SGD and over-parameterization (the network implicitly chooses the "simplest" of the many zero-loss solutions) seem to drive this.
Implicit regularization. SGD on an over-parameterized model doesn't pick any zero-training-loss solution — it picks one with specific structure (often, the minimum-norm one). Architectures, optimizers, and even data augmentations all bias the search through hypothesis space; the explicit regularizer in the loss is only part of the story.
Neural Tangent Kernel. In the infinite-width limit, neural networks trained by gradient descent behave like kernel regression with a specific kernel (the NTK). This connects deep learning to classical generalization theory — but the connection isn't tight at finite width, where feature learning kicks in.
# Implicit regularization: SGD on a deeply over-parameterized network often
# picks a minimum-norm-ish solution. Empirically: increase width and watch
# test error drop *past* the interpolation point — that's double descent.
import torch, torch.nn as nn
def make_net(width):
return nn.Sequential(
nn.Linear(d_in, width), nn.ReLU(),
nn.Linear(width, width), nn.ReLU(),
nn.Linear(width, d_out),
)
# Sweep widths past the "param count = data count" line
for w in [16, 32, 64, 128, 256, 512, 1024, 2048]:
net = make_net(w)
# train to (near-)zero training loss …
print(w, evaluate(net, X_test, y_test))