Density estimation and soft clustering via a weighted sum of Gaussian components.
Key idea
Each point belongs to multiple clusters at once. A GMM models the data as coming from several Gaussian "blobs". Each point gets a probability of belonging to each blob — not a hard assignment like k-means.
Step through EM — ellipses rotate and stretch to fit the data, where k-means' spherical regions can't
iter 0
Each Step runs one EM iteration — first an E step assigns soft responsibilities γik to every (point, component) pair, then an M step re-estimates each component's mean, covariance, and weight from those responsibilities. Watch the ellipses rotate and stretch — that's full-covariance Gaussians earning their keep. Switch to the Tilted ellipses dataset and notice how a GMM fits the diagonal stretch that k-means would slice straight through.
K-means says: "this point is in cluster 3, period." A GMM says: "this point is 80% in cluster 3, 20% in cluster 1." That uncertainty is useful when clusters overlap or have very different shapes and sizes — k-means struggles with both.
GMMs are also a density estimator. Once fit, you can ask "how likely is this point under my model?" — useful for anomaly detection (low-likelihood points are anomalies) or for generating new samples from the learned distribution.
Reach for it when
Clusters overlap or have different shapes / sizes
You want soft assignments, not hard ones
You need a density estimate (for anomaly detection or sampling)
You can pick a reasonable number of components
Skip it when
Clusters aren't well-modelled by Gaussians (e.g. moons, spirals)
Very high-dimensional data — covariance matrices get huge
You truly want hard assignments — k-means is faster
You don't know the number of components and don't want to use model selection
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, random_state=0).fit(X)
# Soft assignments (responsibilities)
proba = gmm.predict_proba(X)
# Hard assignments (for cluster labels)
labels = gmm.predict(X)
# Density estimate: log p(x) under the fitted mixture
log_density = gmm.score_samples(X)
πkmixing weight — prior probability of component k, sums to 1
μk, Σkmean and covariance of component k
Each point is assumed to be generated by first picking a component (with probability πk), then sampling from that component's Gaussian. The parameters are fit by maximizing likelihood, but there's no closed form — we use EM.
EM algorithm. Alternates two steps until convergence:
E-step: compute responsibilities — the posterior probability that point i came from component k, holding parameters fixed.
M-step: update component parameters (μk, Σk, πk) as weighted averages, where the weights are the responsibilities from the E-step.
Each iteration is guaranteed to increase the log-likelihood (or leave it unchanged). It converges to a local maximum — sensitive to initialization, so re-run with several random restarts and keep the best.
Covariance types. sklearn offers full (each component free Σ), tied (shared Σ), diag (diagonal Σ), and spherical (scalar variance). Less flexible = fewer parameters = faster but more biased.
Reach for it when
Soft clustering with calibrated probabilities
Density estimation in low / moderate dimensions
Anomaly detection via low likelihood
Generative sampling of new points like the data
Skip it when
Non-Gaussian-shaped clusters (try kernel density or DBSCAN)
K is unknown and you don't want to fit several and compare via BIC
Very high dimensions — full covariance scales as O(d²) per component
You need outlier-robust fitting (single outliers move μ a lot)
from sklearn.mixture import GaussianMixture
import numpy as np
# Pick the number of components via BIC
ks = range(1, 11)
bics = [
GaussianMixture(n_components=k, n_init=5, random_state=0)
.fit(X).bic(X)
for k in ks
]
best_k = ks[int(np.argmin(bics))]
print(f"BIC-optimal K = {best_k}")
gmm = GaussianMixture(
n_components=best_k, covariance_type="full",
n_init=5, random_state=0,
).fit(X)
print(f"Mixing weights: {gmm.weights_.round(3)}")
print(f"Converged: {gmm.converged_}, iterations: {gmm.n_iter_}")
Want the variational / Dirichlet-process extensions?
EM as ELBO maximization$$ \log p(\mathbf{X}) \;\geq\; \mathbb{E}_{q(\mathbf{Z})}\!\left[\log p(\mathbf{X}, \mathbf{Z}) - \log q(\mathbf{Z})\right] \;=\; \mathcal{L}(q, \theta) $$
Zlatent component assignments
q(Z)variational distribution over assignments
E-step optimizes q; M-step optimizes θ
EM is coordinate ascent on the evidence lower bound (ELBO). The E-step makes the bound tight (q = posterior); the M-step pushes the bound up by re-fitting parameters.
Identifiability and label switching. The likelihood is invariant under permutation of components — there are K! equivalent maxima. Standard MLE is fine, but Bayesian inference (especially MCMC) has to handle this; it's the canonical example of a non-identifiable model.
Singularities. If a component collapses onto a single data point, its covariance → 0 and the likelihood → ∞. Regularize with a small ridge added to each Σ, or use a Bayesian prior on Σ to keep it bounded away from singular.
Bayesian GMM (Dirichlet Process). Instead of fixing K, place a Dirichlet process prior over mixing weights — the model decides how many components to use. sklearn's BayesianGaussianMixture implements this via variational inference; it can effectively prune unused components by sending their weights toward zero.
k-means as a limit. Take a GMM with πk = 1/K, Σk = σ²I, and let σ → 0. The E-step responsibilities collapse to hard 0/1 assignments and the M-step becomes the k-means centroid update. K-means is the small-noise limit of spherical GMM.
Reach for it when
Generative modelling of low-dimensional continuous data
Bayesian model selection — BIC, DIC, or a DP prior
Soft clustering with uncertainty quantification
Embedded as part of a larger probabilistic graphical model
Skip it when
Data has heavy tails — use Student-t mixtures
Strong asymmetry — Gaussian shape is wrong
You need conditional density p(y | x) — use mixture-of-experts
Highly multi-modal latent structure that's not blob-shaped
from sklearn.mixture import BayesianGaussianMixture
# Variational Bayesian GMM: set K large; unused components get tiny weight
bgmm = BayesianGaussianMixture(
n_components=20,
weight_concentration_prior_type="dirichlet_process",
weight_concentration_prior=1.0 / 20, # smaller -> sparser solution
n_init=5, random_state=0,
).fit(X)
active = (bgmm.weights_ > 0.01).sum()
print(f"Effective number of components: {active}")
print(f"Top weights: {sorted(bgmm.weights_, reverse=True)[:active]}")