Gaussian Mixture Models — ML Resources Hub

Key idea

Each point belongs to multiple clusters at once. A GMM models the data as coming from several Gaussian "blobs". Each point gets a probability of belonging to each blob — not a hard assignment like k-means.

Step through EM — ellipses rotate and stretch to fit the data, where k-means' spherical regions can't

Data iter 0

Each Step runs one EM iteration — first an E step assigns soft responsibilities γ_ik to every (point, component) pair, then an M step re-estimates each component's mean, covariance, and weight from those responsibilities. Watch the ellipses rotate and stretch — that's full-covariance Gaussians earning their keep. Switch to the Tilted ellipses dataset and notice how a GMM fits the diagonal stretch that k-means would slice straight through.

K-means says: "this point is in cluster 3, period." A GMM says: "this point is 80% in cluster 3, 20% in cluster 1." That uncertainty is useful when clusters overlap or have very different shapes and sizes — k-means struggles with both.

GMMs are also a density estimator. Once fit, you can ask "how likely is this point under my model?" — useful for anomaly detection (low-likelihood points are anomalies) or for generating new samples from the learned distribution.

Reach for it when

Clusters overlap or have different shapes / sizes
You want soft assignments, not hard ones
You need a density estimate (for anomaly detection or sampling)
You can pick a reasonable number of components

Skip it when

Clusters aren't well-modelled by Gaussians (e.g. moons, spirals)
Very high-dimensional data — covariance matrices get huge
You truly want hard assignments — k-means is faster
You don't know the number of components and don't want to use model selection

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, random_state=0).fit(X)

# Soft assignments (responsibilities)
proba = gmm.predict_proba(X)

# Hard assignments (for cluster labels)
labels = gmm.predict(X)

# Density estimate: log p(x) under the fitted mixture
log_density = gmm.score_samples(X)

Want to see EM and the math?

Key idea $$ p(\mathbf{x}) \;=\; \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) $$

Knumber of components
π_kmixing weight — prior probability of component k, sums to 1
μ_k, Σ_kmean and covariance of component k

Each point is assumed to be generated by first picking a component (with probability π_k), then sampling from that component's Gaussian. The parameters are fit by maximizing likelihood, but there's no closed form — we use EM.

EM algorithm. Alternates two steps until convergence:

E-step: compute responsibilities — the posterior probability that point i came from component k, holding parameters fixed.

M-step: update component parameters (μ_k, Σ_k, π_k) as weighted averages, where the weights are the responsibilities from the E-step.

Each iteration is guaranteed to increase the log-likelihood (or leave it unchanged). It converges to a local maximum — sensitive to initialization, so re-run with several random restarts and keep the best.

Covariance types. sklearn offers full (each component free Σ), tied (shared Σ), diag (diagonal Σ), and spherical (scalar variance). Less flexible = fewer parameters = faster but more biased.

Reach for it when

Soft clustering with calibrated probabilities
Density estimation in low / moderate dimensions
Anomaly detection via low likelihood
Generative sampling of new points like the data

Skip it when

Non-Gaussian-shaped clusters (try kernel density or DBSCAN)
K is unknown and you don't want to fit several and compare via BIC
Very high dimensions — full covariance scales as O(d²) per component
You need outlier-robust fitting (single outliers move μ a lot)

from sklearn.mixture import GaussianMixture
import numpy as np

# Pick the number of components via BIC
ks = range(1, 11)
bics = [
    GaussianMixture(n_components=k, n_init=5, random_state=0)
        .fit(X).bic(X)
    for k in ks
]
best_k = ks[int(np.argmin(bics))]
print(f"BIC-optimal K = {best_k}")

gmm = GaussianMixture(
    n_components=best_k, covariance_type="full",
    n_init=5, random_state=0,
).fit(X)

print(f"Mixing weights: {gmm.weights_.round(3)}")
print(f"Converged: {gmm.converged_}, iterations: {gmm.n_iter_}")

Want the variational / Dirichlet-process extensions?

EM as ELBO maximization $$ \log p(\mathbf{X}) \;\geq\; \mathbb{E}_{q(\mathbf{Z})}\!\left[\log p(\mathbf{X}, \mathbf{Z}) - \log q(\mathbf{Z})\right] \;=\; \mathcal{L}(q, \theta) $$

Zlatent component assignments
q(Z)variational distribution over assignments
E-step optimizes q; M-step optimizes θ

EM is coordinate ascent on the evidence lower bound (ELBO). The E-step makes the bound tight (q = posterior); the M-step pushes the bound up by re-fitting parameters.

Identifiability and label switching. The likelihood is invariant under permutation of components — there are K! equivalent maxima. Standard MLE is fine, but Bayesian inference (especially MCMC) has to handle this; it's the canonical example of a non-identifiable model.

Singularities. If a component collapses onto a single data point, its covariance → 0 and the likelihood → ∞. Regularize with a small ridge added to each Σ, or use a Bayesian prior on Σ to keep it bounded away from singular.

Bayesian GMM (Dirichlet Process). Instead of fixing K, place a Dirichlet process prior over mixing weights — the model decides how many components to use. sklearn's BayesianGaussianMixture implements this via variational inference; it can effectively prune unused components by sending their weights toward zero.

k-means as a limit. Take a GMM with π_k = 1/K, Σ_k = σ²I, and let σ → 0. The E-step responsibilities collapse to hard 0/1 assignments and the M-step becomes the k-means centroid update. K-means is the small-noise limit of spherical GMM.

Reach for it when

Generative modelling of low-dimensional continuous data
Bayesian model selection — BIC, DIC, or a DP prior
Soft clustering with uncertainty quantification
Embedded as part of a larger probabilistic graphical model

Skip it when

Data has heavy tails — use Student-t mixtures
Strong asymmetry — Gaussian shape is wrong
You need conditional density p(y | x) — use mixture-of-experts
Highly multi-modal latent structure that's not blob-shaped

from sklearn.mixture import BayesianGaussianMixture

# Variational Bayesian GMM: set K large; unused components get tiny weight
bgmm = BayesianGaussianMixture(
    n_components=20,
    weight_concentration_prior_type="dirichlet_process",
    weight_concentration_prior=1.0 / 20,    # smaller -> sparser solution
    n_init=5, random_state=0,
).fit(X)

active = (bgmm.weights_ > 0.01).sum()
print(f"Effective number of components: {active}")
print(f"Top weights: {sorted(bgmm.weights_, reverse=True)[:active]}")

Too dense?