Key idea

Find natural groups in your data without telling the algorithm what to look for. No labels, no target variable. The algorithm decides what "similar" means based on a distance metric you provide.

Step through Lloyd's algorithm — assign points, move centroids, repeat — and watch where k-means fails
iter 0

Lloyd's algorithm alternates two steps: assign each point to its nearest centroid, then move each centroid to the mean of its assigned points. Step through and watch the centroids settle. Try the Two moons dataset to see why k-means is a poor fit for non-convex clusters — it slices the moons clean in half because it only ever draws straight Voronoi boundaries.

The most famous algorithm is k-means: pick k in advance, then iteratively assign each point to the nearest cluster center and move each center to the average of its assigned points. Simple, fast, but biased toward equal-size spherical clusters.

Other approaches: DBSCAN finds clusters of arbitrary shape based on local density (and labels sparse regions as noise). Hierarchical clustering builds a tree of clusters at every scale, letting you pick the granularity afterward.

Reach for it when

  • You suspect there are natural groups but have no labels
  • Customer / user segmentation
  • Exploratory data analysis on a new dataset
  • Vector quantization / image compression

Skip it when

  • You have labels — use supervised learning instead
  • You don't know what distance metric makes sense for your data
  • The data has no real cluster structure (then any algorithm invents some)
  • You need to "explain" the clusters — clustering doesn't justify itself
from sklearn.cluster import KMeans

km = KMeans(n_clusters=4, n_init=10, random_state=0).fit(X)

print("Cluster sizes:", [(km.labels_ == k).sum() for k in range(4)])
print("Cluster centers shape:", km.cluster_centers_.shape)

# Predict cluster for new points
new_labels = km.predict(X_new)
Want to see how these algorithms actually work?
k-means objective $$ \min_{\{\boldsymbol{\mu}_k\}} \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 $$
  • Ckthe set of points assigned to cluster k
  • μkthe centroid of cluster k (the mean of its points)
  • Lloyd's algorithm alternates assignment and centroid update — guaranteed to converge to a local minimum

k-means. Assumes spherical, equal-size, equal-variance clusters. Sensitive to initialization — use k-means++ seeding and n_init > 1 to mitigate. Picking K: elbow on inertia, silhouette score, or the gap statistic.

DBSCAN (Density-Based Spatial Clustering). Two parameters: eps (neighbourhood radius) and min_samples. A point is "core" if it has ≥ min_samples neighbours within eps; clusters grow from core points. Points outside any cluster are labelled noise. Handles arbitrary shapes; no K to choose. Struggles with varying densities.

Hierarchical / agglomerative. Start with each point its own cluster, then repeatedly merge the closest pair. Choose linkage: single (chain-like clusters), complete (compact), average, or Ward (minimum variance). Produces a dendrogram you can cut at any height.

Spectral clustering. Build a similarity graph, embed it via the eigenvectors of the graph Laplacian, then run k-means in that embedding. Handles non-convex shapes that k-means can't.

Reach for it when

  • k-means: roughly spherical groups, large data, fast iteration
  • DBSCAN: arbitrary shapes, noise present, unknown K
  • Hierarchical: you want to see structure at multiple scales
  • Spectral: data lies on a manifold, similarity matters more than Euclidean distance

Skip it when

  • Very high-dimensional data — distances become uninformative ("curse of dimensionality")
  • Mixed-type features — clustering is metric-bound
  • You need overlapping clusters (use GMM for soft assignment)
  • Cluster validity is unknown — silhouette & gap statistic only help so much
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np

# Compare three algorithms with silhouette
for name, algo in [
    ("k-means",     KMeans(n_clusters=4, n_init=10, random_state=0)),
    ("dbscan",      DBSCAN(eps=0.5, min_samples=5)),
    ("agglomerative", AgglomerativeClustering(n_clusters=4, linkage="ward")),
]:
    labels = algo.fit_predict(X)
    # silhouette is undefined if any single cluster
    if len(set(labels)) > 1:
        s = silhouette_score(X, labels)
        print(f"{name:20s} silhouette = {s:.3f}, n_clusters = {len(set(labels))}")
Want the embedding view and modern density-based methods?
Spectral clustering $$ L \;=\; D - W \quad\text{or}\quad L_{\text{sym}} = I - D^{-1/2} W D^{-1/2} $$
  • Wsimilarity matrix (e.g. Gaussian kernel of pairwise distances)
  • Ddiagonal degree matrix, Dii = Σj Wij
  • Embed via the smallest k eigenvectors of L, then cluster in that space

k-means is a Voronoi quantizer. It minimizes the variance within Voronoi cells. With high-dim data, every pair of random points is approximately equidistant (concentration of measure), which is why k-means quality degrades sharply above ~50 dimensions. Pre-embed via PCA or autoencoder before clustering.

HDBSCAN. A hierarchical variant of DBSCAN that doesn't require a global eps. Builds a tree of mutual-reachability distances and extracts stable clusters at varying densities. The de-facto modern default when you'd previously have used DBSCAN.

Mean shift. Non-parametric mode-seeking: each point climbs the density gradient toward a local mode. Number of clusters determined by the number of distinct modes. Bandwidth selection is the main hyperparameter.

Validity indices. Internal indices (silhouette, Davies-Bouldin, Calinski-Harabasz) score partitions without ground truth but assume some geometric notion of "good clustering". External indices (ARI, NMI) compare against a labelled ground truth — useful for benchmark datasets, less so in the wild.

Deep clustering. Joint learning of representations and cluster assignments (DEC, DeepCluster, contrastive methods). Outperforms classical methods on high-dim structured data (images, text) where the right embedding is task-specific.

Reach for it when

  • HDBSCAN: varying densities, unknown noise level — modern default
  • Deep clustering: images, text, audio — learn the embedding too
  • Mean shift: mode-finding in non-Gaussian densities
  • Stability is a more important property than partition quality

Skip it when

  • Very large data and you need streaming — most methods are batch
  • You can't pre-compute or cheaply approximate the full similarity matrix
  • You'd benefit from a generative model instead — fit a GMM or Bayesian non-parametric
  • The point of the analysis is causal — clustering doesn't infer causes
import hdbscan
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    cluster_selection_method="eom",   # "excess of mass" — good default
)
labels = clusterer.fit_predict(X_scaled)

# -1 = noise; cluster persistence scores indicate stability
print(f"Found {labels.max() + 1} clusters, {(labels == -1).sum()} noise points")
print(f"Cluster persistence: {clusterer.cluster_persistence_.round(3)}")
Too dense?