Accuracy, precision, recall, F1, ROC, AUC — when each one is the right thing to report.
Key idea
Accuracy alone hides too much. On a 95% / 5% imbalanced dataset, a model that predicts the majority class always is 95% accurate — and useless. You need at least two numbers: precision (when I say positive, am I right?) and recall (of all the positives, how many did I find?). Threshold matters too: slide it and every metric moves.
Slide the threshold — confusion matrix, precision, recall, F1, and the operating point on the ROC all move together
τ = 0.50
Left: predicted score distributions for the two classes. The vertical line is the threshold — everything to its right is predicted positive. Top-right: the confusion matrix. Bottom-right: the ROC curve with the current operating point marked. Slide τ and watch everything move — drop τ to catch more positives (higher recall, lower precision); raise τ for the opposite. Try the 10% imbalance preset and notice how the optimal F1 threshold drifts away from 0.5.
Accuracy = (TP + TN) / total. Useful when classes are balanced and costs are equal. Often misleading otherwise.
Precision = TP / (TP + FP). "When I predict positive, how often am I right?" Care when false positives are expensive (spam → important email in junk).
Recall (sensitivity) = TP / (TP + FN). "Of all the actual positives, how many did I find?" Care when false negatives are expensive (cancer screening, fraud).
F1 = harmonic mean of precision and recall. Single number; punishes the worse of the two. Default for imbalanced classification reports.
ROC-AUC = probability that a random positive is scored higher than a random negative. Threshold-independent; useful for ranking and overall discriminative power. Use PR-AUC for highly imbalanced data — ROC-AUC can stay high even when the model is useless on the minority class.
Reach for it when
Accuracy: balanced classes, equal costs
Precision / Recall: care about FP vs FN asymmetry
F1: imbalanced classes, single-number summary
ROC-AUC: ranking quality, threshold not fixed
PR-AUC: highly imbalanced classes (1% positives)
Watch out when
Accuracy on imbalanced data is misleading
ROC-AUC on very imbalanced data is misleading
F1 gives equal weight to precision and recall — use Fβ if you don't
Macro vs micro vs weighted matters for multi-class
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score, classification_report,
confusion_matrix,
)
# One-number summaries
acc = accuracy_score(y_true, y_pred)
p = precision_score(y_true, y_pred)
r = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
# Threshold-independent (need probabilities, not labels)
auc = roc_auc_score(y_true, y_prob)
ap = average_precision_score(y_true, y_prob) # PR-AUC; better for imbalance
# The big picture
print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))
# Multi-class: macro vs micro vs weighted
f1_macro = f1_score(y_true, y_pred, average="macro") # mean of per-class F1
f1_micro = f1_score(y_true, y_pred, average="micro") # = accuracy for multiclass
f1_weighted = f1_score(y_true, y_pred, average="weighted") # by support
Want PR-curves, cost-sensitive thresholds, and multi-class metrics?
Picking the right threshold. The model outputs a probability; you choose where to cut it. 0.5 is rarely optimal — pick the threshold that maximises F1 (or Fβ for asymmetric costs) on a validation set. For deployment, freeze that threshold and report performance at it.
Cost-sensitive thresholds. If FP costs cFP and FN costs cFN, the optimal threshold (under a Bayes-decision argument) is τ* = cFP / (cFP + cFN). For cancer screening (FN is catastrophic), τ shifts way below 0.5.
PR vs ROC. ROC plots TPR vs FPR; PR plots precision vs recall. Both span the same threshold range. On heavily imbalanced data, ROC can be deceptive — a high AUC can hide poor minority-class performance. PR-AUC is the right summary there.
Multi-class.Macro-averaged F1 = mean of per-class F1 — treats every class equally.
Micro-averaged F1 = aggregate TP/FP/FN across classes first — dominated by big classes.
Weighted = macro but weighted by class support.
Pick macro if you care equally about every class (rare-class performance matters); micro if you don't (mass matters).
Multi-label vs multi-class. Multi-class: one label per example, K options. Multi-label: any subset of K. Metrics generalise differently — Hamming loss, subset accuracy, per-label F1.
Top-k accuracy. Was the right class in the top k predictions? Standard for ImageNet (top-5). Useful when there's structural ambiguity ("dog breed" could be any of several reasonable guesses).
import numpy as np
from sklearn.metrics import precision_recall_curve, roc_curve
# Find the threshold that maximises F1 on validation
p, r, thresholds = precision_recall_curve(y_val, p_val)
f1_arr = 2 * p * r / (p + r + 1e-12)
best_idx = np.argmax(f1_arr)
tau_best = thresholds[best_idx]
print(f"Best τ = {tau_best:.2f} F1 = {f1_arr[best_idx]:.3f}")
# Cost-sensitive choice — FN costs 10x FP
c_fp, c_fn = 1, 10
tau_cost = c_fp / (c_fp + c_fn) # → 0.09
y_hat_cost = (p_test >= tau_cost).astype(int)
Reliability ↓ if predicted probabilities match observed frequencies
Resolution ↑ if the model assigns different probabilities to different cases
Brier score & proper scoring. Brier = mean squared error between predicted probabilities and labels (one-hot). A "proper" scoring rule is one whose expected value is minimised by the true probability — Brier and log-loss are both proper; accuracy isn't. Always optimise a proper scoring rule.
Expected calibration error (ECE). Bin predictions by predicted probability; in each bin compare the average predicted probability to the fraction actually positive. Average the gap. Useful — but its dependence on binning makes it brittle; alternatives include MCE (max calibration error), and calibration-aware scoring rules.
Calibration vs discrimination. A model can have great AUC but terrible calibration (good ranking but wrong probabilities) and vice versa. They're orthogonal. Post-hoc calibration (Platt scaling, isotonic regression) fixes calibration without changing the model's discrimination — useful for deploying a model whose scores you want to interpret.
Class-conditional metrics. For very imbalanced problems, look at per-class precision / recall / F1 separately. The overall macro-F1 hides cases where one class is great and another is awful.
Threshold-independent metrics under shift. AUC and AP are invariant to monotone score transformations — useful for comparing models that score differently. But neither is invariant to base-rate shift. If the production prior changes from training, post-hoc re-calibration is required.
Cohen's κ and MCC. Cohen's kappa adjusts accuracy for the chance level. Matthews Correlation Coefficient is a balanced metric on the confusion matrix; symmetric, robust to imbalance. Both are sometimes preferred for biological / medical applications.
import numpy as np
from sklearn.metrics import brier_score_loss, matthews_corrcoef
# Brier score — lower is better, proper, sensitive to calibration
brier = brier_score_loss(y_true, p_pred)
# Matthews correlation coefficient — balanced, robust to imbalance
mcc = matthews_corrcoef(y_true, y_pred)
# Expected calibration error (ECE) — quick implementation
def ece(y_true, p_pred, n_bins=15):
bins = np.linspace(0, 1, n_bins + 1)
e = 0.0
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (p_pred >= lo) & (p_pred < hi)
if mask.sum() == 0: continue
conf = p_pred[mask].mean() # average predicted prob
acc = (y_true[mask] == 1).mean() # observed positive rate
e += (mask.sum() / len(y_true)) * abs(conf - acc)
return e