Classification Metrics — ML Resources Hub

Key idea

Accuracy alone hides too much. On a 95% / 5% imbalanced dataset, a model that predicts the majority class always is 95% accurate — and useless. You need at least two numbers: precision (when I say positive, am I right?) and recall (of all the positives, how many did I find?). Threshold matters too: slide it and every metric moves.

Slide the threshold — confusion matrix, precision, recall, F1, and the operating point on the ROC all move together

threshold Imbalance τ = 0.50

Left: predicted score distributions for the two classes. The vertical line is the threshold — everything to its right is predicted positive. Top-right: the confusion matrix. Bottom-right: the ROC curve with the current operating point marked. Slide τ and watch everything move — drop τ to catch more positives (higher recall, lower precision); raise τ for the opposite. Try the 10% imbalance preset and notice how the optimal F1 threshold drifts away from 0.5.

Accuracy = (TP + TN) / total. Useful when classes are balanced and costs are equal. Often misleading otherwise.

Precision = TP / (TP + FP). "When I predict positive, how often am I right?" Care when false positives are expensive (spam → important email in junk).

Recall (sensitivity) = TP / (TP + FN). "Of all the actual positives, how many did I find?" Care when false negatives are expensive (cancer screening, fraud).

F1 = harmonic mean of precision and recall. Single number; punishes the worse of the two. Default for imbalanced classification reports.

ROC-AUC = probability that a random positive is scored higher than a random negative. Threshold-independent; useful for ranking and overall discriminative power. Use PR-AUC for highly imbalanced data — ROC-AUC can stay high even when the model is useless on the minority class.

Reach for it when

Accuracy: balanced classes, equal costs
Precision / Recall: care about FP vs FN asymmetry
F1: imbalanced classes, single-number summary
ROC-AUC: ranking quality, threshold not fixed
PR-AUC: highly imbalanced classes (1% positives)

Watch out when

Accuracy on imbalanced data is misleading
ROC-AUC on very imbalanced data is misleading
F1 gives equal weight to precision and recall — use Fβ if you don't
Macro vs micro vs weighted matters for multi-class

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, classification_report,
    confusion_matrix,
)

# One-number summaries
acc = accuracy_score(y_true, y_pred)
p   = precision_score(y_true, y_pred)
r   = recall_score(y_true, y_pred)
f1  = f1_score(y_true, y_pred)

# Threshold-independent (need probabilities, not labels)
auc      = roc_auc_score(y_true, y_prob)
ap       = average_precision_score(y_true, y_prob)   # PR-AUC; better for imbalance

# The big picture
print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

# Multi-class: macro vs micro vs weighted
f1_macro    = f1_score(y_true, y_pred, average="macro")     # mean of per-class F1
f1_micro    = f1_score(y_true, y_pred, average="micro")     # = accuracy for multiclass
f1_weighted = f1_score(y_true, y_pred, average="weighted")  # by support

Want PR-curves, cost-sensitive thresholds, and multi-class metrics?

Fβ score $$ F_\beta = (1 + \beta^2) \cdot \frac{\text{precision} \cdot \text{recall}}{\beta^2 \cdot \text{precision} + \text{recall}} $$

β = 1standard F1 — equal weight
β > 1weights recall more (e.g. F2)
β < 1weights precision more (e.g. F0.5)

Picking the right threshold. The model outputs a probability; you choose where to cut it. 0.5 is rarely optimal — pick the threshold that maximises F1 (or Fβ for asymmetric costs) on a validation set. For deployment, freeze that threshold and report performance at it.

Cost-sensitive thresholds. If FP costs c_FP and FN costs c_FN, the optimal threshold (under a Bayes-decision argument) is τ* = c_FP / (c_FP + c_FN). For cancer screening (FN is catastrophic), τ shifts way below 0.5.

PR vs ROC. ROC plots TPR vs FPR; PR plots precision vs recall. Both span the same threshold range. On heavily imbalanced data, ROC can be deceptive — a high AUC can hide poor minority-class performance. PR-AUC is the right summary there.

Multi-class. Macro-averaged F1 = mean of per-class F1 — treats every class equally. Micro-averaged F1 = aggregate TP/FP/FN across classes first — dominated by big classes. Weighted = macro but weighted by class support. Pick macro if you care equally about every class (rare-class performance matters); micro if you don't (mass matters).

Multi-label vs multi-class. Multi-class: one label per example, K options. Multi-label: any subset of K. Metrics generalise differently — Hamming loss, subset accuracy, per-label F1.

Top-k accuracy. Was the right class in the top k predictions? Standard for ImageNet (top-5). Useful when there's structural ambiguity ("dog breed" could be any of several reasonable guesses).

import numpy as np
from sklearn.metrics import precision_recall_curve, roc_curve

# Find the threshold that maximises F1 on validation
p, r, thresholds = precision_recall_curve(y_val, p_val)
f1_arr           = 2 * p * r / (p + r + 1e-12)
best_idx         = np.argmax(f1_arr)
tau_best         = thresholds[best_idx]
print(f"Best τ = {tau_best:.2f}  F1 = {f1_arr[best_idx]:.3f}")

# Cost-sensitive choice — FN costs 10x FP
c_fp, c_fn = 1, 10
tau_cost   = c_fp / (c_fp + c_fn)             # → 0.09
y_hat_cost = (p_test >= tau_cost).astype(int)

Want expected calibration, Brier decomposition, & proper scoring?

Brier score decomposition $$ \text{Brier} = \underbrace{\text{Reliability}}_{\text{miscalibration}} - \underbrace{\text{Resolution}}_{\text{discrimination}} + \underbrace{\text{Uncertainty}}_{\text{class entropy}} $$

Lower Brier score is better
Reliability ↓ if predicted probabilities match observed frequencies
Resolution ↑ if the model assigns different probabilities to different cases

Brier score & proper scoring. Brier = mean squared error between predicted probabilities and labels (one-hot). A "proper" scoring rule is one whose expected value is minimised by the true probability — Brier and log-loss are both proper; accuracy isn't. Always optimise a proper scoring rule.

Expected calibration error (ECE). Bin predictions by predicted probability; in each bin compare the average predicted probability to the fraction actually positive. Average the gap. Useful — but its dependence on binning makes it brittle; alternatives include MCE (max calibration error), and calibration-aware scoring rules.

Calibration vs discrimination. A model can have great AUC but terrible calibration (good ranking but wrong probabilities) and vice versa. They're orthogonal. Post-hoc calibration (Platt scaling, isotonic regression) fixes calibration without changing the model's discrimination — useful for deploying a model whose scores you want to interpret.

Class-conditional metrics. For very imbalanced problems, look at per-class precision / recall / F1 separately. The overall macro-F1 hides cases where one class is great and another is awful.

Threshold-independent metrics under shift. AUC and AP are invariant to monotone score transformations — useful for comparing models that score differently. But neither is invariant to base-rate shift. If the production prior changes from training, post-hoc re-calibration is required.

Cohen's κ and MCC. Cohen's kappa adjusts accuracy for the chance level. Matthews Correlation Coefficient is a balanced metric on the confusion matrix; symmetric, robust to imbalance. Both are sometimes preferred for biological / medical applications.

import numpy as np
from sklearn.metrics import brier_score_loss, matthews_corrcoef

# Brier score — lower is better, proper, sensitive to calibration
brier = brier_score_loss(y_true, p_pred)

# Matthews correlation coefficient — balanced, robust to imbalance
mcc = matthews_corrcoef(y_true, y_pred)

# Expected calibration error (ECE) — quick implementation
def ece(y_true, p_pred, n_bins=15):
    bins = np.linspace(0, 1, n_bins + 1)
    e = 0.0
    for lo, hi in zip(bins[:-1], bins[1:]):
        mask = (p_pred >= lo) & (p_pred < hi)
        if mask.sum() == 0: continue
        conf = p_pred[mask].mean()                 # average predicted prob
        acc  = (y_true[mask] == 1).mean()          # observed positive rate
        e += (mask.sum() / len(y_true)) * abs(conf - acc)
    return e

Too dense?

Where to learn more

scikit-learn — Model Evaluation Guide Best practical reference for which metric to use when. Has well-chosen examples and pitfalls.
Kaggle — Plotting a Confusion Matrix Pragmatic primer with multi-class examples — useful for teaching new colleagues.
Frank Harrell — Classification vs Prediction Opinionated argument that we should report calibrated probabilities, not thresholded classifications. Worth reading even (especially) if you disagree.
Guo et al. (2017) — Calibration of Modern Neural Nets Showed modern networks are miscalibrated despite high accuracy, and recommended temperature scaling as a fix.