Bayesian Inference — ML Resources Hub

Key idea

Start with a belief, see some data, update. Classical statistics gives you a single best estimate of a parameter (and maybe a confidence interval). Bayesian inference gives you a full distribution over the parameter — capturing everything you know and don't know about it after seeing the data.

Flip coins · adjust the prior · watch the posterior emerge as prior × likelihood (normalised)

α₀ β₀ α₀ = 2.0 β₀ = 2.0

The dashed indigo curve is your prior — Beta(α₀, β₀). Each "Flip H" / "Flip T" click adds an observation. The bold orange curve is the posterior; it's literally prior × likelihood normalised. With no data the posterior is the prior; with a lot of data the posterior is dominated by the likelihood (the prior is "swamped"). The dashed grey curve in the middle is the likelihood at the current data, plotted to scale.

Suppose you want to know a coin's bias. After 10 flips you saw 7 heads. A classical estimate is 0.7 — period. A Bayesian says: "Before flipping, I believed the bias was probably around 0.5 (the prior). Now I've updated to a distribution centered slightly above 0.5 but with a lot of spread (the posterior) — I'm not very sure yet."

That distribution is the answer. Want a point estimate? Take its mean. Want uncertainty? Read off the credible interval. Want to predict? Average predictions over the posterior. The framework gives you principled answers to all of these.

Reach for it when

You have small data and strong prior knowledge
Honest uncertainty matters (clinical trials, A/B tests, forecasts)
You need to combine evidence from multiple sources
Decision-theoretic framing (loss × posterior → optimal action)

Skip it when

You have a lot of data and the prior doesn't matter
You only need a point estimate and don't care about uncertainty
The model is complex and you can't afford MCMC / VI
The audience won't accept "subjective" priors

# Coin-flip example: Beta(α, β) prior + Binomial likelihood -> Beta posterior
from scipy.stats import beta

alpha_prior, beta_prior = 2, 2          # weak prior around p=0.5
heads, tails = 7, 3                     # observed data

alpha_post = alpha_prior + heads
beta_post  = beta_prior  + tails

posterior = beta(alpha_post, beta_post)
print(f"Posterior mean: {posterior.mean():.3f}")
print(f"95% credible interval: ({posterior.ppf(0.025):.3f}, {posterior.ppf(0.975):.3f})")

Want Bayes' rule and conjugate priors?

Bayes' rule $$ \underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} \;=\; \frac{\overbrace{p(\mathcal{D} \mid \theta)}^{\text{likelihood}} \;\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}} $$

θparameters or hypotheses
Dobserved data
p(D)marginal likelihood — usually intractable, but it's just a normalizer over θ

Conjugate priors. If the prior and likelihood are a "conjugate pair", the posterior is in the same family as the prior — closed-form update, no numerical integration needed. Beta-Binomial (coin flips), Gaussian-Gaussian (mean estimation with known variance), Dirichlet-Multinomial (category proportions). Use these when you can.

When conjugacy fails. Most real models aren't conjugate. Two options:

MCMC (Markov chain Monte Carlo) generates samples from the posterior without computing the normalizer. Modern variants — Hamiltonian Monte Carlo (HMC), the No-U-Turn Sampler (NUTS) — handle high-dim continuous parameters well.

Variational inference (VI) turns inference into optimization: pick a tractable approximating family q(θ), then optimize q to minimize KL divergence to the posterior. Faster than MCMC but biased — q can't capture posterior shapes it doesn't have the flexibility for.

Predictive distributions. Instead of plugging in a point estimate, average predictions over the posterior: p(x_new | D) = ∫ p(x_new | θ) p(θ | D) dθ. This automatically accounts for parameter uncertainty.

Reach for it when

Small / heterogeneous data — priors regularize gracefully
You need a coherent uncertainty propagation through several modelling steps
Mixed-effects / hierarchical models — Bayesian framework is natural
Decision making — combine posteriors with utility functions

Skip it when

Likelihood is too expensive to evaluate repeatedly (no MCMC budget)
Prior is hard to justify and the audience wants "no priors"
Posterior is multi-modal and you don't have specialized samplers
Real-time inference — MCMC is slow

import pymc as pm

# Bayesian linear regression: model = α + β·x + noise
with pm.Model() as model:
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta  = pm.Normal("beta",  mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=1)

    mu = alpha + beta * x_obs
    pm.Normal("y", mu=mu, sigma=sigma, observed=y_obs)

    # NUTS sampler — adaptive HMC
    trace = pm.sample(2000, tune=1000, chains=4, target_accept=0.9)

# Posterior summaries
import arviz as az
print(az.summary(trace, var_names=["alpha", "beta", "sigma"]))

Want the MCMC mechanics, VI bounds, and the diagnostics?

Variational lower bound (ELBO) $$ \log p(\mathcal{D}) \;\geq\; \mathbb{E}_{q(\theta)}\!\left[\log p(\mathcal{D}, \theta) - \log q(\theta)\right] \;=\; \mathcal{L}(q) $$

Maximize ℒ over q — equivalent to minimizing KL(q ∥ posterior)
Gap = KL divergence — when zero, q = posterior

MCMC choices. Random-walk Metropolis is robust but mixes slowly. Gibbs sampling factorizes posterior into conditionals — efficient when conjugacy is partial. Hamiltonian Monte Carlo uses gradients to make informed proposals — much better mixing on continuous high-dim problems. NUTS auto-tunes HMC step size and trajectory length. Use NUTS by default in modern PPLs (PyMC, Stan, NumPyro).

Convergence diagnostics. R-hat compares within-chain to between-chain variance — should be near 1 for convergence. Effective sample size measures autocorrelation-adjusted information per draw. Energy plots reveal HMC pathologies. Trust nothing; always check.

VI methods. Mean-field VI factorizes q(θ) = ∏ q(θ_i) — fast but blind to correlations. Black-box VI / SVI uses gradient-based optimization with Monte Carlo gradients (works for any model in a PPL). Normalizing flows give expressive q while keeping density evaluation tractable — closing the gap on MCMC for high-dim continuous posteriors.

Prior sensitivity. Sensitivity to priors should be checked. Tighter priors = stronger regularization. For hierarchical models, weakly informative priors on top-level scales (HalfNormal, HalfCauchy) prevent funnel pathologies in MCMC. Reparameterize centered models to non-centered when geometry is hostile.

Model comparison. Bayes factors are tempting but unstable for diffuse priors. Use information criteria — WAIC, LOO-CV via importance sampling (PSIS-LOO) — for predictive comparison.

Reach for it when

Hierarchical / multilevel structure to exploit
You need to propagate uncertainty through a pipeline of models
Sparse signals where priors do the regularizing work
Probabilistic programming gives you compositional model construction

Skip it when

Big data & deep models — VI / MCMC don't scale; use SGD-friendly point estimates with explicit uncertainty (deep ensembles, MC dropout)
Likelihood-free / simulation-based setting — use ABC or SBI methods
Strict latency budget at inference time
Posterior is sharply multi-modal and you can't engineer good chains

import numpyro
import numpyro.distributions as dist
from jax import random
from numpyro.infer import MCMC, NUTS

# Same Bayesian regression in NumPyro — JAX-accelerated NUTS
def model(x, y=None):
    alpha = numpyro.sample("alpha", dist.Normal(0., 10.))
    beta  = numpyro.sample("beta",  dist.Normal(0., 10.))
    sigma = numpyro.sample("sigma", dist.HalfNormal(1.))
    mu = alpha + beta * x
    numpyro.sample("obs", dist.Normal(mu, sigma), obs=y)

kernel = NUTS(model, target_accept_prob=0.9)
mcmc   = MCMC(kernel, num_warmup=1000, num_samples=2000, num_chains=4)
mcmc.run(random.PRNGKey(0), x=x_obs, y=y_obs)
mcmc.print_summary()

Too dense?

Where to learn more

McElreath — Statistical Rethinking The friendliest Bayesian textbook. Builds intuition with concrete examples, every chapter has Stan/PyMC code. Lectures on YouTube too.
PyMC learning resources PyMC's curated tutorial collection — A/B testing, GLMs, hierarchical models, time series. Hands-on, well-maintained.
Blei et al. (2017) — VI Review "Variational Inference: A Review for Statisticians" — the standard reference for understanding VI.
Brooks et al. — MCMC Handbook Encyclopedic reference for MCMC algorithms and diagnostics. Free chapters online.