Find the points that don't look like the rest — fraud, defects, intrusions, the rare and weird.
Key idea
Learn what "normal" looks like; flag what doesn't. Anomaly detection is unsupervised — you usually don't have labelled anomalies, just the assumption that most of your data is normal and a few rare points are not.
Slide the threshold — every point below the cutoff density gets flagged in terracotta
τ = 0.25h = 0.12
The cream-to-indigo heatmap is a kernel density estimate of "where normal lives". Points are flagged as anomalies when their local density score drops below the threshold τ. Drop the bandwidth h and the model becomes sensitive to local quirks (over-fits); raise it and only the most isolated points stand out. The Ring dataset is a classic — the genuinely anomalous points live inside the ring, where most distance-based methods would happily call them "central and normal".
Three classical strategies. Density-based: a point with low probability under a model of "normal" is anomalous. Distance-based: a point far from its nearest neighbours is anomalous. Reconstruction-based: train a model to compress and reconstruct normal data; points it reconstructs badly are anomalous.
The right choice depends on what "anomalous" means in your domain — a fraudster looks different from a manufacturing defect looks different from a network intrusion.
Reach for it when
Fraud / intrusion / defect detection
You have plenty of normal data but few or no labelled anomalies
Monitoring sensor data for unusual patterns
Cleaning a dataset of outliers before modelling
Skip it when
You have labels for both classes — train a regular classifier (with class weights)
Anomalies are common enough to balance — it's just classification
"Anomalous" isn't well-defined and changes over time
You need to explain why a specific point was flagged
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=0).fit(X_train)
# -1 = anomaly, 1 = normal
labels = iso.predict(X_test)
scores = iso.score_samples(X_test) # lower = more anomalous
Each strategy assigns a continuous score — threshold to decide normal vs. anomalous
Choice of strategy = assumption about what makes anomalies "weird"
Isolation Forest. Builds random trees that split features at random thresholds. Anomalies — being in sparse regions — get isolated in fewer splits. The score is average path length to isolation. Cheap, scales well, and works in moderate dimensions. Default for tabular anomaly detection.
One-Class SVM. Fits a decision boundary around the "normal" data in a kernel feature space, treating the origin as the "anomaly side". Good with small data and a sensible kernel; doesn't scale to big data.
Local Outlier Factor (LOF). Compares each point's local density to its neighbours' local densities. Catches anomalies in heterogeneous-density data that global methods miss. Score > 1 means lower density than neighbours.
Autoencoder reconstruction. Train an autoencoder on normal data; at inference time, flag points with high reconstruction error. Scales to images and high-dim data where other methods struggle.
Threshold setting. All methods give continuous scores; the threshold is a business decision (precision vs. recall trade-off). With no labels, use a quantile of training scores; with some labels, calibrate against the validation set.
One-Class SVM: small / clean dataset with kernel intuition
LOF: heterogeneous local density (clusters of different sizes)
Autoencoder: images, time series, high-dim structured data
Skip it when
"Anomalous" depends on labelled examples and you have plenty of them
The score distribution is bi-modal and no threshold cleanly separates classes
Anomalies arrive in groups (contextual) — single-point methods miss the group
The data-generating process drifts over time — model becomes stale
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
X_scaled = StandardScaler().fit_transform(X)
methods = {
"iforest": IsolationForest(contamination=0.05, random_state=0),
"lof": LocalOutlierFactor(n_neighbors=20, contamination=0.05),
"ocsvm": OneClassSVM(nu=0.05, kernel="rbf", gamma="scale"),
}
for name, m in methods.items():
if hasattr(m, "fit_predict"):
labels = m.fit_predict(X_scaled) # for LOF
else:
labels = m.fit(X_scaled).predict(X_scaled)
print(f"{name:10s} flagged {(labels == -1).sum():,} points as anomalous")
Want deep methods and contextual / collective anomalies?
Three flavours of anomaly$$ \text{point} \;\subset\; \text{contextual} \;\subset\; \text{collective} $$
Point: one observation is anomalous regardless of context
Contextual: an observation is anomalous given a context (time, location)
Collective: a sequence / group is anomalous as a whole
Deep one-class methods. Deep SVDD (Ruff et al.) replaces the kernel feature map with a learned neural network and shrinks the data into a small hypersphere. Trained end-to-end; works for images and time series. Watch for representation collapse (every input maps to the centre).
Generative anomaly detection. Train a generative model (GAN, normalizing flow, diffusion) on normal data; anomalies have low likelihood or low-quality reconstructions. State of the art on industrial defect detection (MVTec AD). Caveat: deep generative models do not reliably assign low likelihood to out-of-distribution inputs — see Nalisnick et al. 2019.
Sequence and time-series. Forecast-based methods flag points where the prediction error exceeds a threshold (ARIMA residuals, Prophet, deep forecast models). Reconstruction-based methods (LSTM autoencoders, transformer denoisers) work for collective anomalies — flag a window whose reconstruction is poor.
Calibration. Anomaly scores are not probabilities. Convert via percentile-based mapping or fit a tail distribution (Generalized Pareto). PR-AUC is the right summary metric in the heavily-imbalanced regime; ROC-AUC overstates performance.
Evaluation pitfalls. Without labels, you're estimating performance from synthetic anomalies — which often don't match real ones. With labels, watch out for label leakage from the threshold-setting process. Always evaluate on a held-out period for time-series.
Reach for it when
Deep SVDD: images, learned representations, end-to-end pipeline
Normalizing flows: need calibrated densities, not just scores
LSTM / transformer reconstruction: sequential data with structure
Density-ratio: compare against a known reference distribution
Skip it when
You truly have labels — use supervised methods with class weights / focal loss
"Normal" is multi-modal and rare — single-class methods overfit one mode
Anomalies must be human-interpretable — deep methods are opaque
You can't retrain regularly and the data drifts
import torch
import torch.nn as nn
class AutoencoderAD(nn.Module):
def __init__(self, d_in):
super().__init__()
self.enc = nn.Sequential(nn.Linear(d_in, 64), nn.ReLU(), nn.Linear(64, 16))
self.dec = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, d_in))
def forward(self, x):
return self.dec(self.enc(x))
# Train ONLY on normal data
model = AutoencoderAD(d_in=X_normal.shape[1])
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(100):
opt.zero_grad()
loss = ((model(X_normal) - X_normal) ** 2).mean()
loss.backward(); opt.step()
# Anomaly score = per-sample reconstruction error
with torch.no_grad():
err = ((model(X_test) - X_test) ** 2).mean(dim=1)
threshold = err[y_test == 0].quantile(0.99) # top 1% of normal training errors
flagged = err > threshold