Key idea

Each point belongs to multiple clusters at once. A GMM models the data as coming from several Gaussian "blobs". Each point gets a probability of belonging to each blob — not a hard assignment like k-means.

Step through EM — ellipses rotate and stretch to fit the data, where k-means' spherical regions can't
iter 0

Each Step runs one EM iteration — first an E step assigns soft responsibilities γik to every (point, component) pair, then an M step re-estimates each component's mean, covariance, and weight from those responsibilities. Watch the ellipses rotate and stretch — that's full-covariance Gaussians earning their keep. Switch to the Tilted ellipses dataset and notice how a GMM fits the diagonal stretch that k-means would slice straight through.

K-means says: "this point is in cluster 3, period." A GMM says: "this point is 80% in cluster 3, 20% in cluster 1." That uncertainty is useful when clusters overlap or have very different shapes and sizes — k-means struggles with both.

GMMs are also a density estimator. Once fit, you can ask "how likely is this point under my model?" — useful for anomaly detection (low-likelihood points are anomalies) or for generating new samples from the learned distribution.

Reach for it when

  • Clusters overlap or have different shapes / sizes
  • You want soft assignments, not hard ones
  • You need a density estimate (for anomaly detection or sampling)
  • You can pick a reasonable number of components

Skip it when

  • Clusters aren't well-modelled by Gaussians (e.g. moons, spirals)
  • Very high-dimensional data — covariance matrices get huge
  • You truly want hard assignments — k-means is faster
  • You don't know the number of components and don't want to use model selection
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, random_state=0).fit(X)

# Soft assignments (responsibilities)
proba = gmm.predict_proba(X)

# Hard assignments (for cluster labels)
labels = gmm.predict(X)

# Density estimate: log p(x) under the fitted mixture
log_density = gmm.score_samples(X)
Want to see EM and the math?
Key idea $$ p(\mathbf{x}) \;=\; \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) $$
  • Knumber of components
  • πkmixing weight — prior probability of component k, sums to 1
  • μk, Σkmean and covariance of component k

Each point is assumed to be generated by first picking a component (with probability πk), then sampling from that component's Gaussian. The parameters are fit by maximizing likelihood, but there's no closed form — we use EM.

EM algorithm. Alternates two steps until convergence:

E-step: compute responsibilities — the posterior probability that point i came from component k, holding parameters fixed.

M-step: update component parameters (μk, Σk, πk) as weighted averages, where the weights are the responsibilities from the E-step.

Each iteration is guaranteed to increase the log-likelihood (or leave it unchanged). It converges to a local maximum — sensitive to initialization, so re-run with several random restarts and keep the best.

Covariance types. sklearn offers full (each component free Σ), tied (shared Σ), diag (diagonal Σ), and spherical (scalar variance). Less flexible = fewer parameters = faster but more biased.

Reach for it when

  • Soft clustering with calibrated probabilities
  • Density estimation in low / moderate dimensions
  • Anomaly detection via low likelihood
  • Generative sampling of new points like the data

Skip it when

  • Non-Gaussian-shaped clusters (try kernel density or DBSCAN)
  • K is unknown and you don't want to fit several and compare via BIC
  • Very high dimensions — full covariance scales as O(d²) per component
  • You need outlier-robust fitting (single outliers move μ a lot)
from sklearn.mixture import GaussianMixture
import numpy as np

# Pick the number of components via BIC
ks = range(1, 11)
bics = [
    GaussianMixture(n_components=k, n_init=5, random_state=0)
        .fit(X).bic(X)
    for k in ks
]
best_k = ks[int(np.argmin(bics))]
print(f"BIC-optimal K = {best_k}")

gmm = GaussianMixture(
    n_components=best_k, covariance_type="full",
    n_init=5, random_state=0,
).fit(X)

print(f"Mixing weights: {gmm.weights_.round(3)}")
print(f"Converged: {gmm.converged_}, iterations: {gmm.n_iter_}")
Want the variational / Dirichlet-process extensions?
EM as ELBO maximization $$ \log p(\mathbf{X}) \;\geq\; \mathbb{E}_{q(\mathbf{Z})}\!\left[\log p(\mathbf{X}, \mathbf{Z}) - \log q(\mathbf{Z})\right] \;=\; \mathcal{L}(q, \theta) $$
  • Zlatent component assignments
  • q(Z)variational distribution over assignments
  • E-step optimizes q; M-step optimizes θ

EM is coordinate ascent on the evidence lower bound (ELBO). The E-step makes the bound tight (q = posterior); the M-step pushes the bound up by re-fitting parameters.

Identifiability and label switching. The likelihood is invariant under permutation of components — there are K! equivalent maxima. Standard MLE is fine, but Bayesian inference (especially MCMC) has to handle this; it's the canonical example of a non-identifiable model.

Singularities. If a component collapses onto a single data point, its covariance → 0 and the likelihood → ∞. Regularize with a small ridge added to each Σ, or use a Bayesian prior on Σ to keep it bounded away from singular.

Bayesian GMM (Dirichlet Process). Instead of fixing K, place a Dirichlet process prior over mixing weights — the model decides how many components to use. sklearn's BayesianGaussianMixture implements this via variational inference; it can effectively prune unused components by sending their weights toward zero.

k-means as a limit. Take a GMM with πk = 1/K, Σk = σ²I, and let σ → 0. The E-step responsibilities collapse to hard 0/1 assignments and the M-step becomes the k-means centroid update. K-means is the small-noise limit of spherical GMM.

Reach for it when

  • Generative modelling of low-dimensional continuous data
  • Bayesian model selection — BIC, DIC, or a DP prior
  • Soft clustering with uncertainty quantification
  • Embedded as part of a larger probabilistic graphical model

Skip it when

  • Data has heavy tails — use Student-t mixtures
  • Strong asymmetry — Gaussian shape is wrong
  • You need conditional density p(y | x) — use mixture-of-experts
  • Highly multi-modal latent structure that's not blob-shaped
from sklearn.mixture import BayesianGaussianMixture

# Variational Bayesian GMM: set K large; unused components get tiny weight
bgmm = BayesianGaussianMixture(
    n_components=20,
    weight_concentration_prior_type="dirichlet_process",
    weight_concentration_prior=1.0 / 20,    # smaller -> sparser solution
    n_init=5, random_state=0,
).fit(X)

active = (bgmm.weights_ > 0.01).sum()
print(f"Effective number of components: {active}")
print(f"Top weights: {sorted(bgmm.weights_, reverse=True)[:active]}")
Too dense?