How to keep a model from memorising its training data — L1, L2, dropout, early stopping, and friends.
Key idea
Penalise complexity along with error. An unregularized model is free to fit noise. Add a penalty on the size of the parameters (L2), on the count of nonzero parameters (L1), or on the network's expressiveness (dropout, weight decay, early stopping) — and the model trades a little training fit for much less overfitting.
Slide λ — watch L2 shrink every coefficient and L1 zero some out completely
λ = 0.10
A linear model is fit to data with 12 features — only 4 are actually informative, the rest are noise. The bars show each coefficient's value. Slide λ from 0 to high under L1 and watch noise features snap to zero (sparsity). Under L2 every coefficient shrinks smoothly (no exact zeros, just smaller). Elastic Net does some of both — most useful when correlated features collapse into one under pure L1.
L2 (Ridge / weight decay). Add λ·||w||² to the loss. Every coefficient gets pulled toward zero, smoothly. No coefficient ever becomes exactly zero, but they all get smaller in proportion to λ. The default for neural networks (often called "weight decay") and the safest general-purpose regularizer.
L1 (LASSO). Add λ·||w||₁. The non-differentiability at zero makes coefficients exactly zero — the model does feature selection while it fits. Useful when you suspect most features are irrelevant.
Dropout. Randomly drop a fraction of neurons at each training step. Forces the network not to over-rely on any single unit. The original "neural network regularization" trick, still useful for MLPs and some Transformers.
Early stopping. Stop training when validation loss stops decreasing. Effectively limits how far the optimizer can wander; equivalent to a particular form of L2 for deep networks under some assumptions.
Reach for it when
You see train ≪ test error (overfitting)
You have many features and suspect most are noise → L1
You want smooth, stable solutions → L2
Deep network — start with weight decay 1e-4 to 1e-5
Be careful when
You're under-fitting — regularization makes it worse
Features are on different scales — standardise first, or the penalty is uneven
L1 with correlated features picks arbitrarily — use Elastic Net
The regularization path is the solution as λ sweeps from 0 to ∞
Bayesian interpretation. L2 regularization is equivalent to a Gaussian prior on the weights centred at zero (MAP estimation under that prior matches ridge regression). L1 is equivalent to a Laplace prior — heavier tails toward zero, which is why it produces sparsity. This framing makes "choose λ" the same as "choose your prior precision".
Why L1 gives exact zeros. The L1 ball has corners on the axes; the L2 ball is round. A quadratic loss-contour touching the L1 ball is most likely to touch at a corner (where some coordinates are zero). That geometric argument is the reason LASSO does feature selection while ridge doesn't.
Weight decay in deep learning. Adding L2 to the loss and using Adam is subtly wrong — Adam's scaling makes the effective decay rate parameter-dependent. AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient: instead of adding λ·w to the gradient, it multiplies w by (1 − η·λ) directly. The fix matters; transformer recipes assume AdamW.
Dropout as ensembling. At inference time, dropout is off and the network uses all neurons — but the training behaviour is equivalent to averaging exponentially many sub-networks. This is one of several reasons dropout works.
Batch / layer norm as implicit regularization. Normalization layers don't penalise weight magnitude, but they make the loss less sensitive to weight magnitude (because the activations are rescaled). The effect is a form of implicit regularization — and it interacts oddly with explicit weight decay. Modern transformer recipes often use very small or zero weight decay.
Data augmentation as regularization. Random crops, flips, mix-ups, and adversarial perturbations all impose smoothness or invariance on the function the model can express — without changing the loss. Often the highest-leverage regularization in practice, especially for vision.
import torch, torch.nn as nn
import torch.nn.functional as F
# Mixup — strong implicit regularizer for classification
def mixup(x, y, alpha=0.2):
lam = torch.distributions.Beta(alpha, alpha).sample()
idx = torch.randperm(x.size(0))
x_m = lam * x + (1 - lam) * x[idx]
return x_m, y, y[idx], lam
def mixup_loss(logits, y_a, y_b, lam):
return lam * F.cross_entropy(logits, y_a) + (1 - lam) * F.cross_entropy(logits, y_b)
# Label smoothing — keeps the model from collapsing on a single class
loss = F.cross_entropy(logits, targets, label_smoothing=0.1)
# Stochastic depth — drop whole residual blocks at random
class StochasticDepth(nn.Module):
def __init__(self, p): super().__init__(); self.p = p
def forward(self, x):
if not self.training or torch.rand(1).item() > self.p: return x
return torch.zeros_like(x)
Pprior over parameters; Q posterior after training
Generalization gap is bounded by KL between prior and posterior
"Stay close to the prior" ⇔ "generalize well" ⇔ regularize
Regularization path. The set of solutions as λ varies from 0 to ∞. For LASSO, the path is piecewise linear in λ — you can compute the whole path in one pass (LARS, Efron et al. 2004). For ridge, the solution is a closed-form function of λ — useful for fast cross-validation across many λ values.
Group LASSO & structured sparsity. Penalize the L2 norm of groups of coefficients, then the L1 norm across groups. Forces entire groups to be zero or active together. Useful when features come in natural blocks (e.g., one-hot encoded categories, time-series lags).
Spectral norm regularization. Constrain the largest singular value of each weight matrix — useful for stable adversarial training, GAN discriminators (Miyato et al. 2018 — spectral normalization), and Lipschitz-bounded networks for verified robustness.
Sharpness-Aware Minimization (SAM). Foret et al. (2021) explicitly seek flat minima. Equivalent to penalising the gradient norm in a neighbourhood. Reliably improves generalization, especially on smaller models or vision tasks.
Implicit regularization of SGD. The noise in stochastic gradient descent acts like a diffusion process — it preferentially picks "flat" minima (low Hessian eigenvalues). This is one explanation for why SGD-trained networks sometimes generalize better than Adam-trained ones, even at higher training loss.
MDL & information bottleneck. Minimum Description Length frames regularization as compression — a model is a good fit if (model + residuals) describes the data in fewer bits than the raw data. The Information Bottleneck (Tishby) extends this: compress the input while preserving information about the target.
import torch
from torch.nn.utils import spectral_norm
# Spectral normalization — bounds the Lipschitz constant of each layer
discriminator = torch.nn.Sequential(
spectral_norm(torch.nn.Conv2d(3, 64, 4, 2, 1)),
torch.nn.LeakyReLU(0.2),
spectral_norm(torch.nn.Conv2d(64, 128, 4, 2, 1)),
torch.nn.LeakyReLU(0.2),
)
# Group LASSO via proximal gradient — penalise blocks of weights together
def group_lasso_step(weights, lr, lam, groups):
# groups is a list of index sets that should shrink together
with torch.no_grad():
for g in groups:
w_g = weights[g]
norm = w_g.norm()
shrink = max(1 - lr * lam / (norm + 1e-12), 0.0)
weights[g] = w_g * shrink