Anomaly Detection — ML Resources Hub

Key idea

Learn what "normal" looks like; flag what doesn't. Anomaly detection is unsupervised — you usually don't have labelled anomalies, just the assumption that most of your data is normal and a few rare points are not.

Slide the threshold — every point below the cutoff density gets flagged in terracotta

Data τ h τ = 0.25 h = 0.12

The cream-to-indigo heatmap is a kernel density estimate of "where normal lives". Points are flagged as anomalies when their local density score drops below the threshold τ. Drop the bandwidth h and the model becomes sensitive to local quirks (over-fits); raise it and only the most isolated points stand out. The Ring dataset is a classic — the genuinely anomalous points live inside the ring, where most distance-based methods would happily call them "central and normal".

Three classical strategies. Density-based: a point with low probability under a model of "normal" is anomalous. Distance-based: a point far from its nearest neighbours is anomalous. Reconstruction-based: train a model to compress and reconstruct normal data; points it reconstructs badly are anomalous.

The right choice depends on what "anomalous" means in your domain — a fraudster looks different from a manufacturing defect looks different from a network intrusion.

Reach for it when

Fraud / intrusion / defect detection
You have plenty of normal data but few or no labelled anomalies
Monitoring sensor data for unusual patterns
Cleaning a dataset of outliers before modelling

Skip it when

You have labels for both classes — train a regular classifier (with class weights)
Anomalies are common enough to balance — it's just classification
"Anomalous" isn't well-defined and changes over time
You need to explain why a specific point was flagged

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=0).fit(X_train)

# -1 = anomaly, 1 = normal
labels = iso.predict(X_test)
scores = iso.score_samples(X_test)   # lower = more anomalous

Want the actual algorithms?

Three strategies $$ \text{score}(x) \;\in\; \big\{\, \underbrace{-\log p(x)}_{\text{density}},\; \underbrace{d(x, \mathrm{NN}_k(x))}_{\text{distance}},\; \underbrace{\|x - \hat{x}\|}_{\text{reconstruction}} \,\big\} $$

Each strategy assigns a continuous score — threshold to decide normal vs. anomalous
Choice of strategy = assumption about what makes anomalies "weird"

Isolation Forest. Builds random trees that split features at random thresholds. Anomalies — being in sparse regions — get isolated in fewer splits. The score is average path length to isolation. Cheap, scales well, and works in moderate dimensions. Default for tabular anomaly detection.

One-Class SVM. Fits a decision boundary around the "normal" data in a kernel feature space, treating the origin as the "anomaly side". Good with small data and a sensible kernel; doesn't scale to big data.

Local Outlier Factor (LOF). Compares each point's local density to its neighbours' local densities. Catches anomalies in heterogeneous-density data that global methods miss. Score > 1 means lower density than neighbours.

Autoencoder reconstruction. Train an autoencoder on normal data; at inference time, flag points with high reconstruction error. Scales to images and high-dim data where other methods struggle.

Threshold setting. All methods give continuous scores; the threshold is a business decision (precision vs. recall trade-off). With no labels, use a quantile of training scores; with some labels, calibrate against the validation set.

Reach for it when

Isolation Forest: moderate-dim tabular — strong default
One-Class SVM: small / clean dataset with kernel intuition
LOF: heterogeneous local density (clusters of different sizes)
Autoencoder: images, time series, high-dim structured data

Skip it when

"Anomalous" depends on labelled examples and you have plenty of them
The score distribution is bi-modal and no threshold cleanly separates classes
Anomalies arrive in groups (contextual) — single-point methods miss the group
The data-generating process drifts over time — model becomes stale

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np

X_scaled = StandardScaler().fit_transform(X)

methods = {
    "iforest":  IsolationForest(contamination=0.05, random_state=0),
    "lof":      LocalOutlierFactor(n_neighbors=20, contamination=0.05),
    "ocsvm":    OneClassSVM(nu=0.05, kernel="rbf", gamma="scale"),
}

for name, m in methods.items():
    if hasattr(m, "fit_predict"):
        labels = m.fit_predict(X_scaled)         # for LOF
    else:
        labels = m.fit(X_scaled).predict(X_scaled)
    print(f"{name:10s} flagged {(labels == -1).sum():,} points as anomalous")

Want deep methods and contextual / collective anomalies?

Three flavours of anomaly $$ \text{point} \;\subset\; \text{contextual} \;\subset\; \text{collective} $$

Point: one observation is anomalous regardless of context
Contextual: an observation is anomalous given a context (time, location)
Collective: a sequence / group is anomalous as a whole

Deep one-class methods. Deep SVDD (Ruff et al.) replaces the kernel feature map with a learned neural network and shrinks the data into a small hypersphere. Trained end-to-end; works for images and time series. Watch for representation collapse (every input maps to the centre).

Generative anomaly detection. Train a generative model (GAN, normalizing flow, diffusion) on normal data; anomalies have low likelihood or low-quality reconstructions. State of the art on industrial defect detection (MVTec AD). Caveat: deep generative models do not reliably assign low likelihood to out-of-distribution inputs — see Nalisnick et al. 2019.

Sequence and time-series. Forecast-based methods flag points where the prediction error exceeds a threshold (ARIMA residuals, Prophet, deep forecast models). Reconstruction-based methods (LSTM autoencoders, transformer denoisers) work for collective anomalies — flag a window whose reconstruction is poor.

Calibration. Anomaly scores are not probabilities. Convert via percentile-based mapping or fit a tail distribution (Generalized Pareto). PR-AUC is the right summary metric in the heavily-imbalanced regime; ROC-AUC overstates performance.

Evaluation pitfalls. Without labels, you're estimating performance from synthetic anomalies — which often don't match real ones. With labels, watch out for label leakage from the threshold-setting process. Always evaluate on a held-out period for time-series.

Reach for it when

Deep SVDD: images, learned representations, end-to-end pipeline
Normalizing flows: need calibrated densities, not just scores
LSTM / transformer reconstruction: sequential data with structure
Density-ratio: compare against a known reference distribution

Skip it when

You truly have labels — use supervised methods with class weights / focal loss
"Normal" is multi-modal and rare — single-class methods overfit one mode
Anomalies must be human-interpretable — deep methods are opaque
You can't retrain regularly and the data drifts

import torch
import torch.nn as nn

class AutoencoderAD(nn.Module):
    def __init__(self, d_in):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 64), nn.ReLU(), nn.Linear(64, 16))
        self.dec = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, d_in))
    def forward(self, x):
        return self.dec(self.enc(x))

# Train ONLY on normal data
model = AutoencoderAD(d_in=X_normal.shape[1])
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(100):
    opt.zero_grad()
    loss = ((model(X_normal) - X_normal) ** 2).mean()
    loss.backward(); opt.step()

# Anomaly score = per-sample reconstruction error
with torch.no_grad():
    err = ((model(X_test) - X_test) ** 2).mean(dim=1)
threshold = err[y_test == 0].quantile(0.99)    # top 1% of normal training errors
flagged = err > threshold

Too dense?