Calibration — ML Resources Hub

Key idea

Accuracy says "right or wrong". Calibration says "well-aimed". A model is calibrated if, among the cases where it predicts 70%, exactly 70% turn out positive. Many models — especially neural nets — are over-confident: they say 99% on cases where the true rate is closer to 85%. The fix is usually post-hoc — apply a calibrator to the model's output without retraining.

Toggle the miscalibration · watch the reliability diagram skew · "Temperature scale" fixes it post-hoc

The reliability diagram bins predicted probabilities along the x-axis and plots the observed positive rate on the y-axis. A perfectly calibrated model sits on the diagonal. Over-confident models bow below it (say 90%, actually 75% positive); under-confident models bow above. "Temperature scaling" divides logits by a learned scalar — a one-parameter post-hoc fix that often nails it.

Modern neural networks are typically over-confident. Cross-entropy training, particularly with strong models on easy data, pushes predicted probabilities toward 0 or 1 — past what the data warrants.

Temperature scaling: divide logits by a learned T > 0 before softmax, fit T on validation data to minimise log-loss. Single parameter, doesn't change predictions (the argmax is preserved), fixes most cases.

Platt scaling: fit a logistic regression on the model's scores. Best for SVMs and other models whose output isn't a probability.

Isotonic regression: fit a non-decreasing step function from scores to probabilities. Strictly more flexible than Platt; needs more calibration data.

Calibrate when

You're going to use the probability (cost-sensitive decisions, ensembles, downstream models)
The model is a neural network with cross-entropy loss
You're combining multiple classifiers (calibration is a prerequisite for proper averaging)
Reporting probabilities to a human decision-maker

Skip calibration when

You only care about ranking, not absolute probabilities (AUC)
You only need the argmax (just accuracy or top-k)
You'll already use a downstream decision threshold

from sklearn.calibration import CalibratedClassifierCV
import torch, torch.nn.functional as F

# Sklearn — wrap any classifier, fit Platt or isotonic on a held-out fold
cal = CalibratedClassifierCV(base_model, method="isotonic", cv=5)
cal.fit(X_train, y_train)
p_cal = cal.predict_proba(X_test)

# Temperature scaling for a neural net — one learnable parameter
class TempScale(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.T = torch.nn.Parameter(torch.ones(1))
    def forward(self, logits): return logits / self.T

# Fit T on a validation set with LBFGS, minimising NLL
ts  = TempScale().cuda()
opt = torch.optim.LBFGS([ts.T], lr=0.01, max_iter=50)
def closure():
    opt.zero_grad()
    loss = F.cross_entropy(ts(logits_val), y_val)
    loss.backward()
    return loss
opt.step(closure)
print(f"Learned T = {ts.T.item():.3f}")        # > 1 ⇒ was over-confident

Want ECE, MCE, beta calibration, and conformal alternatives?

Expected calibration error $$ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left|\text{acc}(B_m) - \text{conf}(B_m)\right| $$

B_mpredictions binned by predicted probability (e.g. 15 equal-width bins)
acc(B_m)observed positive rate in the bin
conf(B_m)average predicted probability in the bin
Weighted average gap — ECE = 0 is perfect calibration

ECE and its limits. ECE is the canonical calibration metric, but it depends on the binning scheme — different bin counts can give very different ECEs. MCE (max calibration error) reports the worst bin. Adaptive binning (equal-mass instead of equal-width) is fairer when probabilities are concentrated.

Temperature scaling intuition. Dividing logits by T > 1 softens the softmax — pushes probabilities away from the corners (0/1) toward uniform. Neural networks trained with cross-entropy push toward extremes; T learns to dial that back. Doesn't change which class is predicted, just the confidence.

Beta calibration. Kull et al. (2017) — a 2-parameter generalisation of Platt scaling with better empirical performance. Useful when isotonic over-fits.

Conformal prediction. Instead of fixing the probability, fix the set: produce prediction sets that cover the true label with guaranteed probability. Distribution-free; works on top of any base model.

Class-conditional calibration. A model can be globally calibrated but mis-calibrated on a specific class. Report per-class calibration in multi-class problems.

Why this matters in production. Many ML systems consume probabilities downstream: cost-sensitive decisions, expected-value computations, fraud-detection cascades. Mis-calibration means the downstream code is doing arithmetic on numbers that don't mean what they should.

import numpy as np
from sklearn.calibration import calibration_curve

# Reliability diagram data
prob_true, prob_pred = calibration_curve(y_val, p_pred, n_bins=15, strategy="quantile")

# ECE with adaptive binning
def adaptive_ece(y, p, n_bins=15):
    order = np.argsort(p)
    y, p = y[order], p[order]
    bin_size = len(y) // n_bins
    e = 0
    for i in range(n_bins):
        s = slice(i * bin_size, (i + 1) * bin_size)
        e += abs(y[s].mean() - p[s].mean()) * bin_size / len(y)
    return e

Want proper scoring, multi-class calibration, and OOD calibration?

Brier decomposition $$ \text{Brier} = \underbrace{\text{Reliability}}_{\text{miscalibration}} - \underbrace{\text{Resolution}}_{\text{spread of cond. freq.}} + \underbrace{\text{Uncertainty}}_{\text{base rate}} $$

Brier as a sum of three orthogonal terms
Reliability ↓ as calibration improves
Resolution ↑ as the model usefully separates positives from negatives

Multi-class calibration is harder. Class-conditional calibration, top-label calibration, and full-distribution calibration all have different definitions. Temperature scaling addresses top-label calibration well; matrix scaling and vector scaling extend it but can over-fit. Dirichlet calibration (Kull et al. 2019) is a principled multi-class generalisation.

OOD and calibration. Calibration on the training distribution doesn't imply calibration under shift. Neural networks are dramatically over-confident on out-of-distribution inputs — this is one of the harder open problems in ML safety. Deep ensembles and Bayesian neural nets help; large-margin training and outlier exposure help too.

Calibrated probabilities ≠ Bayesian. Calibration is a frequentist consistency property: the long-run frequency in a bin matches the predicted probability. Bayesian uncertainty is something different — it tells you how much you should believe each plausible model given the data. A model can be calibrated without being Bayesian, and a Bayesian model isn't automatically calibrated under prior misspecification.

Calibration vs. selective prediction. Sometimes you'd rather abstain than predict an uncertain answer. Selective classification frames this directly: choose a confidence threshold below which you decline. Plumbing this into a deployed system is easier with calibrated probabilities than with raw logits.

Histogram binning, BBQ, ENIR. A zoo of post-hoc calibrators beyond Platt/isotonic/temperature. Histogram binning is simple; BBQ averages over multiple binnings; ENIR uses nearly-isotonic regression. All are useful when temperature scaling isn't enough but isotonic over-fits.

Calibration under deferral. When the model can defer to a human, the calibrator should know that — joint calibration of model + deferral is an active research area (e.g. learn-to-defer literature).

import numpy as np
import torch, torch.nn.functional as F

# Matrix temperature scaling for multi-class
class MatrixScale(torch.nn.Module):
    def __init__(self, K):
        super().__init__()
        self.W = torch.nn.Parameter(torch.eye(K))
        self.b = torch.nn.Parameter(torch.zeros(K))
    def forward(self, logits): return logits @ self.W + self.b

# Deep-ensemble calibration: average probabilities from multiple seeds
def ensemble_probs(models, x):
    ps = [F.softmax(m(x), dim=-1) for m in models]
    return torch.stack(ps).mean(dim=0)         # better calibrated + better accuracy

Too dense?

Where to learn more

Guo et al. (2017) — On Calibration of Modern Neural Networks The paper that brought "modern nets are miscalibrated" into the mainstream and proposed temperature scaling. Required reading.
Minderer et al. (2021) — Revisiting Calibration Empirical study showing modern vision models (ViT, MLP-Mixer) are actually fairly well-calibrated out of the box, unlike the convnets in Guo et al.
scikit-learn — Calibration Guide Practical reference for Platt, isotonic, and reliability diagrams with code.
Angelopoulos & Bates — Conformal Prediction Intro Alternative to calibrating probabilities: produce prediction sets with guaranteed coverage. Increasingly the preferred answer for high-stakes deployment.