Group similar items together without labels — k-means, DBSCAN, hierarchical, and friends.
Key idea
Find natural groups in your data without telling the algorithm what to look for. No labels, no target variable. The algorithm decides what "similar" means based on a distance metric you provide.
Step through Lloyd's algorithm — assign points, move centroids, repeat — and watch where k-means fails
iter 0
Lloyd's algorithm alternates two steps: assign each point to its nearest centroid, then move each centroid to the mean of its assigned points. Step through and watch the centroids settle. Try the Two moons dataset to see why k-means is a poor fit for non-convex clusters — it slices the moons clean in half because it only ever draws straight Voronoi boundaries.
The most famous algorithm is k-means: pick k in advance, then iteratively assign each point to the nearest cluster center and move each center to the average of its assigned points. Simple, fast, but biased toward equal-size spherical clusters.
Other approaches: DBSCAN finds clusters of arbitrary shape based on local density (and labels sparse regions as noise). Hierarchical clustering builds a tree of clusters at every scale, letting you pick the granularity afterward.
Reach for it when
You suspect there are natural groups but have no labels
Customer / user segmentation
Exploratory data analysis on a new dataset
Vector quantization / image compression
Skip it when
You have labels — use supervised learning instead
You don't know what distance metric makes sense for your data
The data has no real cluster structure (then any algorithm invents some)
You need to "explain" the clusters — clustering doesn't justify itself
from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, n_init=10, random_state=0).fit(X)
print("Cluster sizes:", [(km.labels_ == k).sum() for k in range(4)])
print("Cluster centers shape:", km.cluster_centers_.shape)
# Predict cluster for new points
new_labels = km.predict(X_new)
μkthe centroid of cluster k (the mean of its points)
Lloyd's algorithm alternates assignment and centroid update — guaranteed to converge to a local minimum
k-means. Assumes spherical, equal-size, equal-variance clusters. Sensitive to initialization — use k-means++ seeding and n_init > 1 to mitigate. Picking K: elbow on inertia, silhouette score, or the gap statistic.
DBSCAN (Density-Based Spatial Clustering). Two parameters: eps (neighbourhood radius) and min_samples. A point is "core" if it has ≥ min_samples neighbours within eps; clusters grow from core points. Points outside any cluster are labelled noise. Handles arbitrary shapes; no K to choose. Struggles with varying densities.
Hierarchical / agglomerative. Start with each point its own cluster, then repeatedly merge the closest pair. Choose linkage: single (chain-like clusters), complete (compact), average, or Ward (minimum variance). Produces a dendrogram you can cut at any height.
Spectral clustering. Build a similarity graph, embed it via the eigenvectors of the graph Laplacian, then run k-means in that embedding. Handles non-convex shapes that k-means can't.
Reach for it when
k-means: roughly spherical groups, large data, fast iteration
DBSCAN: arbitrary shapes, noise present, unknown K
Hierarchical: you want to see structure at multiple scales
Spectral: data lies on a manifold, similarity matters more than Euclidean distance
Skip it when
Very high-dimensional data — distances become uninformative ("curse of dimensionality")
Mixed-type features — clustering is metric-bound
You need overlapping clusters (use GMM for soft assignment)
Cluster validity is unknown — silhouette & gap statistic only help so much
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np
# Compare three algorithms with silhouette
for name, algo in [
("k-means", KMeans(n_clusters=4, n_init=10, random_state=0)),
("dbscan", DBSCAN(eps=0.5, min_samples=5)),
("agglomerative", AgglomerativeClustering(n_clusters=4, linkage="ward")),
]:
labels = algo.fit_predict(X)
# silhouette is undefined if any single cluster
if len(set(labels)) > 1:
s = silhouette_score(X, labels)
print(f"{name:20s} silhouette = {s:.3f}, n_clusters = {len(set(labels))}")
Want the embedding view and modern density-based methods?
Spectral clustering$$ L \;=\; D - W \quad\text{or}\quad L_{\text{sym}} = I - D^{-1/2} W D^{-1/2} $$
Wsimilarity matrix (e.g. Gaussian kernel of pairwise distances)
Ddiagonal degree matrix, Dii = Σj Wij
Embed via the smallest k eigenvectors of L, then cluster in that space
k-means is a Voronoi quantizer. It minimizes the variance within Voronoi cells. With high-dim data, every pair of random points is approximately equidistant (concentration of measure), which is why k-means quality degrades sharply above ~50 dimensions. Pre-embed via PCA or autoencoder before clustering.
HDBSCAN. A hierarchical variant of DBSCAN that doesn't require a global eps. Builds a tree of mutual-reachability distances and extracts stable clusters at varying densities. The de-facto modern default when you'd previously have used DBSCAN.
Mean shift. Non-parametric mode-seeking: each point climbs the density gradient toward a local mode. Number of clusters determined by the number of distinct modes. Bandwidth selection is the main hyperparameter.
Validity indices. Internal indices (silhouette, Davies-Bouldin, Calinski-Harabasz) score partitions without ground truth but assume some geometric notion of "good clustering". External indices (ARI, NMI) compare against a labelled ground truth — useful for benchmark datasets, less so in the wild.
Deep clustering. Joint learning of representations and cluster assignments (DEC, DeepCluster, contrastive methods). Outperforms classical methods on high-dim structured data (images, text) where the right embedding is task-specific.
Reach for it when
HDBSCAN: varying densities, unknown noise level — modern default
Deep clustering: images, text, audio — learn the embedding too
Mean shift: mode-finding in non-Gaussian densities
Stability is a more important property than partition quality
Skip it when
Very large data and you need streaming — most methods are batch
You can't pre-compute or cheaply approximate the full similarity matrix
You'd benefit from a generative model instead — fit a GMM or Bayesian non-parametric
The point of the analysis is causal — clustering doesn't infer causes