When the rare class is the one you care about — fraud, disease, anomalies.
Key idea
"99% accuracy" is meaningless when 99% of cases are negative. A constant predictor wins. The interesting class is the rare one — fraud, cancer, anomalies — and the model needs to actually find it. The fixes: re-weight the loss, resample the data, change the decision threshold, or change the metric you optimise.
Slide the class ratio · toggle the fix · watch a logistic regression go from "predict everything negative" to "actually find the rare class"
5%
A logistic-regression boundary on a 2D dataset with adjustable class imbalance. With no fix and 5% positives, the model often collapses to a tiny region (or misses the positives entirely) while reporting 95% accuracy. Class weights tells the loss "false negatives cost more". Oversample duplicates positives to even out the loss. Undersample drops negatives. All three pull the boundary back into useful territory.
Re-weight the loss. Multiply each positive example's contribution by N_neg / N_pos (or use class_weight="balanced" in sklearn). Cheapest fix; usually the first thing to try.
Oversample the minority. Duplicate (or SMOTE-synthesize) positive examples until the classes are balanced. Risk: with naive duplication, the model can memorize the few real positives. SMOTE interpolates between nearest neighbours, less risky.
Undersample the majority. Drop most negatives. Fast, but you lose information. Useful when negatives are abundant and computation is the bottleneck.
Change the threshold. Train the model normally, then choose the operating point that maximises your real metric (F1, recall at fixed precision, expected cost). Often the simplest fix and the only one that matters at deployment time.
Change the metric. Stop reporting accuracy. Use F1, precision-recall AUC, or a cost-sensitive metric that reflects your real-world preferences over FP vs FN.
Reach for it when
One class is < 10% of the data
False negatives are much more expensive than false positives (or vice versa)
The model is reporting high accuracy but useless recall
cFPcost of a false positive; cFN of a false negative
If cFN is high (catching fraud), threshold drops well below 0.5
Optimal decisions just need calibrated probabilities + costs
SMOTE and its descendants. SMOTE interpolates between minority points and their k-nearest minority neighbours. Borderline-SMOTE focuses on the decision boundary; ADASYN weights synthetic samples toward harder cases. All assume continuous, meaningful Euclidean distance — they don't work for categorical or text features.
Focal loss. Lin et al. (2017). Multiplies cross-entropy by (1 − p_t)^γ, which down-weights easy examples. Lets the model concentrate gradient on hard cases. Originally for object detection; useful for any extreme imbalance.
Two-stage cascades. Train a fast, conservative classifier to filter out obvious negatives; pass the rest to an expensive precise model. Lets you use different fixes at each stage. Standard in real-world fraud and anomaly detection.
Anomaly detection framing. When positives are very rare (< 0.1%), classification is the wrong frame. Treat positives as anomalies — train on negatives only, flag points the model can't reconstruct or that have low density. See the Anomaly Detection page.
Beware naive metrics. ROC-AUC can stay high even when a model is useless on the minority class — flip to PR-AUC, which is sensitive to base-rate shifts. F1 weighs precision and recall equally; F2 weights recall more (use it when missing positives is worse).
Calibration after resampling. Oversampling and class weights change the model's base rate and its output probabilities are biased toward the resampled ratio. Post-hoc calibration on a held-out fold (Platt or isotonic) fixes this.
import torch, torch.nn.functional as F
# Focal loss for binary classification with extreme imbalance
def focal_loss(logits, y, gamma=2.0, alpha=0.25):
bce = F.binary_cross_entropy_with_logits(logits, y, reduction="none")
p = torch.sigmoid(logits)
pt = y * p + (1 - y) * (1 - p)
w = alpha * y + (1 - alpha) * (1 - y)
return (w * (1 - pt).pow(gamma) * bce).mean()
# Calibrate after resampling — Platt on a held-out fold
from sklearn.calibration import CalibratedClassifierCV
cal = CalibratedClassifierCV(resampled_model, method="isotonic", cv="prefit")
cal.fit(X_calib, y_calib)
Want recursive sub-sampling, MWMOTE, one-class, and synthetic data?
The posterior under resampling needs a Bayes correction to recover true probabilities
This is why models trained on resampled data are mis-calibrated by default
Beyond SMOTE. Variants like Borderline-SMOTE-1/2, MWMOTE (majority-weighted minority), and SVMSMOTE focus synthetic examples near the boundary or where the model is uncertain. All share the same Euclidean assumption.
Cluster-based methods. Cluster the majority class, keep only representatives from each cluster (e.g. NearMiss, TomekLinks). Reduces dataset size without losing as much information as random undersampling.
Cost-sensitive learning. Build the cost matrix directly into the loss / split criterion. MetaCost (Domingos, 1999) and instance-weighted boosting are classical examples; modern deep nets use class-weighted or cost-sensitive loss functions.
One-class learning. When positives are essentially absent at training time, model only the negatives and flag anything that doesn't fit. One-class SVM, Isolation Forest, deep autoencoders for reconstruction error — all standard anomaly-detection tools.
Synthetic / generative data. When real positives are rare and expensive, GANs or diffusion models trained on the small positive set can supply additional training data. Quality is hard to verify; the practical danger is generating subtly off-distribution samples that the model overfits to.
Subgroup imbalance. A model can be well-calibrated overall but mis-calibrated on a subgroup defined by sensitive attributes. Imbalance compounds with fairness concerns; both have to be analysed together (see the Fairness page).
from imblearn.over_sampling import BorderlineSMOTE, ADASYN
from imblearn.combine import SMOTETomek
# Borderline-SMOTE — interpolate only near the decision boundary
b = BorderlineSMOTE(kind="borderline-2", k_neighbors=5)
X_bal, y_bal = b.fit_resample(X_train, y_train)
# SMOTE + Tomek-link cleaning — generate synthetic, then drop noisy pairs
st = SMOTETomek(sampling_strategy="auto")
X_clean, y_clean = st.fit_resample(X_train, y_train)
# One-class anomaly detection
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(contamination=0.01, random_state=0)
y_score = iforest.fit(X_neg).score_samples(X_test)