Key idea

The loss is the question you're asking. "How wrong was that prediction?" has many possible answers — squared error, absolute error, cross-entropy, hinge. Each answer pulls the model in a different direction. Choosing the loss is choosing what mistakes you care about.

Hover the chart to read the loss at each point · toggle Regression / Classification · slide the outlier to see its weight
δ = 0.5

In regression mode the x-axis is the residual y − ŷ: MSE punishes big errors quadratically (sensitive to outliers); MAE punishes linearly (robust); Huber blends them, quadratic near zero and linear far away. In classification mode the x-axis is the margin y·f(x): cross-entropy keeps penalising even confident-correct predictions a little, hinge stops at margin 1, and the 0/1 loss is what you really want but can't optimise (it's flat almost everywhere).

For regression. The default is mean squared error — it's smooth, differentiable everywhere, and Bayes-optimal for predicting the mean under Gaussian noise. Mean absolute error is more robust to outliers (it predicts the median instead). Huber loss splits the difference: quadratic for small errors, linear for big ones.

For classification. The default is cross-entropy (also called log-loss) — it pairs with softmax, has well-behaved gradients, and corresponds to maximum-likelihood for a categorical model. Hinge loss is the SVM choice — it stops penalising once you're past the margin. Focal loss down-weights easy examples and is useful for highly imbalanced data.

The loss you pick changes what the model learns to do: minimise MSE and you predict means; minimise MAE and you predict medians; minimise cross-entropy and you predict calibrated probabilities (in the limit).

Common defaults

  • Regression: MSE if noise is Gaussian-ish, Huber if there are outliers
  • Binary classification: binary cross-entropy with sigmoid
  • Multi-class: categorical cross-entropy with softmax
  • Ranking: pairwise / listwise (NDCG, ListNet)

Worth a closer look when

  • Heavy class imbalance — try focal loss
  • Heavy outliers — try Huber or quantile losses
  • You care about a custom business metric — write a custom loss
  • The mean / median isn't the right summary — try quantile regression
import torch
import torch.nn.functional as F

# Regression
mse   = F.mse_loss(y_pred, y_true)                          # default
mae   = F.l1_loss(y_pred, y_true)                           # robust to outliers
huber = F.smooth_l1_loss(y_pred, y_true, beta=1.0)          # blend

# Binary classification
bce   = F.binary_cross_entropy_with_logits(logits, y_true)   # numerically stable

# Multi-class
ce    = F.cross_entropy(logits, y_true)                      # logits, NOT softmax-ed

# Class-imbalanced — focal loss
def focal(logits, y, gamma=2.0):
    bce = F.binary_cross_entropy_with_logits(logits, y, reduction='none')
    p   = torch.sigmoid(logits)
    pt  = y * p + (1 - y) * (1 - p)
    return ((1 - pt).pow(gamma) * bce).mean()
Want the probabilistic story and proper scoring rules?
Loss as negative log-likelihood $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} \log p_\theta(y_i \mid x_i) $$
  • For Gaussian p(y|x) with fixed variance, this is MSE up to a constant
  • For Bernoulli p(y|x), this is binary cross-entropy
  • For categorical p(y|x), this is multi-class cross-entropy

Every "standard" loss is really a likelihood. MSE = Gaussian negative log-likelihood. Binary cross-entropy = Bernoulli NLL. Multi-class cross-entropy = categorical NLL. Picking a loss is implicitly picking a noise model — and that's clarifying. Heavy-tailed errors? Use a Laplace likelihood (gives MAE) or Student-t. Counts? Poisson. Multiple outputs with correlation? A multivariate Gaussian with a learned covariance.

Proper scoring rules. A loss is "proper" if its minimum is at the true distribution — log-loss and Brier score are proper for probabilistic classification, accuracy isn't. Using a proper loss is what makes calibrated probabilities come out of training.

Gradient behaviour matters. MSE's gradient grows with error (large mistakes pull harder); MAE's gradient is constant (steady push regardless of magnitude); cross-entropy's gradient has the nice "logit − target" form. These shape what training looks like — MAE is harder for gradient descent because the gradient is non-zero at the minimum from one side.

Loss vs. metric. The loss is what you optimise (must be differentiable); the metric is what you report. They're often different — you optimise log-loss but evaluate accuracy or F1. Misalignment is fine when the loss is a good proxy, but worth watching: a model with great log-loss can still be poorly calibrated, etc.

Custom losses. When the off-the-shelf options don't match your business problem, write one. Common patterns: weighted losses (per-class or per-sample), penalised losses (add a regularizer), quantile losses (predict any quantile, not just the mean).

import torch, torch.nn.functional as F

# Quantile regression — predict the τ-th quantile, not the mean
def quantile_loss(y_pred, y_true, tau=0.5):
    e = y_true - y_pred
    return torch.maximum(tau * e, (tau - 1) * e).mean()

# Train three models at τ = 0.1, 0.5, 0.9 and you get prediction intervals.

# Per-class weighted cross-entropy for class imbalance
weights = torch.tensor([1.0, 3.0, 1.0])      # class 1 is rare
loss    = F.cross_entropy(logits, y, weight=weights)

# Brier score — proper, well-calibrated, good for early stopping
def brier(probs, y_onehot):
    return ((probs - y_onehot) ** 2).sum(dim=-1).mean()
Want robust losses, MAML inner losses, and adversarial losses?
Influence function $$ \mathcal{I}(z; \theta) = -H_\theta^{-1} \, \nabla_\theta \ell(z; \theta) $$
  • HHessian of the average loss at the trained parameters
  • How much does the trained parameter change if we up-weight training point z?
  • A measurable consequence of the loss's gradient shape near the optimum

Robust statistics. The influence function quantifies how much a single training point can move the trained parameter. MSE's quadratic shape gives unbounded influence — one extreme point can swing the fit arbitrarily. MAE's bounded influence is what makes it robust. Huber, Tukey's biweight, and other M-estimators sit on this spectrum, trading efficiency at the Gaussian model for robustness against contamination.

Focal loss. Lin et al. (2017) introduced FL(p) = −(1 − p)γ log p for object detection's extreme class imbalance. The (1 − p)γ factor down-weights easy examples (where the model is already confident), letting the gradient focus on the hard ones. γ = 2 is a strong default.

Adversarial losses. The generator's objective in a GAN is a function of another model's output — the loss landscape becomes a moving target. WGAN, hinge loss, and least-squares GAN are all attempts to give the generator stable gradients. The lesson generalises: any loss whose target depends on another learned thing gets the same difficulty.

Surrogate losses. Hinge, log-loss, exponential, etc. are all surrogates for the 0/1 loss you actually want for classification. Bartlett, Jordan & McAuliffe (2006) characterised which surrogates are "calibrated" — their minimiser agrees with the Bayes classifier. Bad surrogates can converge to the wrong thing even with infinite data.

Auxiliary and contrastive losses. Modern self-supervised learning is built on auxiliary losses — InfoNCE, triplet, supervised contrastive, all encode different ideas of "similar" and "different". The form of the contrastive loss determines what representation emerges (alignment, uniformity, dimensional collapse).

import torch, torch.nn.functional as F

# InfoNCE — the canonical contrastive loss
def info_nce(query, positives, temperature=0.07):
    # query: (B, d). positives: (B, d) — same indexing.
    # negatives are everyone else in the batch.
    logits = (query @ positives.t()) / temperature       # (B, B)
    labels = torch.arange(len(query), device=query.device)
    return F.cross_entropy(logits, labels)

# Triplet loss — anchor, positive, negative
def triplet(anchor, pos, neg, margin=0.2):
    d_p = (anchor - pos).norm(dim=-1)
    d_n = (anchor - neg).norm(dim=-1)
    return F.relu(d_p - d_n + margin).mean()
Too dense?