Hidden Markov Models — ML Resources Hub

Key idea

A system you can't see directly, but you can see its outputs. The state changes over time according to its own rules. You only observe noisy emissions. Given the emissions, you infer the hidden states.

Click an activity icon to flip it — the hidden-state posterior and Viterbi path re-shuffle to explain the new evidence

Sequence

The toy model has three hidden weather states (Sunny / Cloudy / Rainy) that you never observe directly — only the activity I picked that day (Walk / Shop / Clean). Each state has its own emission preference, and they tend to persist (Sunny is more likely to stay Sunny than flip straight to Rainy). The three colour-tinted rows show the forward-backward posterior: how strongly the model believes each state was active given the entire sequence. The dark line is the Viterbi most-likely path — the single best explanation rather than per-step marginals.

Classic example: speech recognition. The "hidden state" is the phoneme being spoken; the "observation" is the audio signal. You don't see the phoneme directly, only the sound. The model captures (a) which phonemes tend to follow which (transition probabilities) and (b) what each phoneme tends to sound like (emission probabilities).

HMMs have been largely replaced by RNNs and transformers in speech and NLP, but they're still the right tool when (a) the latent states have a real interpretation you care about, (b) data is small, or (c) you want exact inference rather than approximate.

Reach for it when

The hidden states correspond to real-world categories you care about
Small data — exact inference is feasible and helps
Biological sequence analysis (gene finding, protein structure)
You need a probabilistic story you can interrogate

Skip it when

Lots of data and you don't care about latent-state interpretability — use RNN / transformer
State transitions depend on long-range context (Markov assumption breaks)
Emissions are high-dimensional images or text
You want end-to-end learning of representations

from hmmlearn import hmm

# 3 hidden states, Gaussian emissions
model = hmm.GaussianHMM(n_components=3, covariance_type="full")
model.fit(observations)   # shape: (n_samples, n_features)

# Decode: most likely hidden-state sequence (Viterbi)
hidden_states = model.predict(observations)

# Score: log-likelihood of the observed sequence under the model
print(f"log p(obs) = {model.score(observations):.2f}")

Want the three classical problems and the algorithms?

The model $$ p(\mathbf{x}_{1:T}, \mathbf{z}_{1:T}) \;=\; \pi(z_1)\,\prod_{t=2}^{T} A(z_t \mid z_{t-1}) \,\prod_{t=1}^{T} B(x_t \mid z_t) $$

πinitial state distribution
Astate transition matrix
Bemission distribution (Gaussian, multinomial, …)
z_thidden state at time t
x_tobserved emission at time t

Three classical problems, each with its own algorithm:

1. Evaluation — given a model, how likely is an observed sequence? Use the forward algorithm. Dynamic programming, O(T·K²) where K is the number of states.

2. Decoding — given a model and observations, what's the most likely hidden state sequence? Use the Viterbi algorithm. Same complexity as forward, but tracks the argmax path.

3. Learning — given observations only, fit the model parameters. Use the Baum-Welch algorithm (EM specialized to HMMs). Iterate: E-step computes posterior over latent states via forward-backward; M-step updates π, A, B.

Without the Markov assumption, all of these would be exponential in T. The local conditioning is what makes HMMs tractable.

Reach for it when

Tagging tasks where labels follow a Markov chain (POS tagging, NER)
Acoustic modelling for small-vocabulary speech
Bioinformatics (CpG islands, gene structure, profile HMMs)
Activity recognition from sensor sequences

Skip it when

Long-range dependencies — RNN / transformer captures them better
Continuous high-dim emissions — better handled by deep generative models
You need to learn the state space (not just its parameters)
Labels are observed at every step — use CRF / sequence tagger instead

from hmmlearn import hmm
import numpy as np

# Multinomial emissions for discrete observations
model = hmm.CategoricalHMM(n_components=3, n_iter=100, random_state=0)
model.fit(X, lengths=lengths)    # lengths splits concatenated sequences

# Posterior probability of each state at each timestep
log_prob, posteriors = model.score_samples(X)

# Viterbi: most likely state sequence
states = model.predict(X)

print("Transition matrix:")
print(np.round(model.transmat_, 3))

Want forward-backward, max-product, and the CRF connection?

Forward recursion $$ \alpha_t(j) \;=\; B(x_t \mid z_t{=}j) \;\sum_{i=1}^{K} \alpha_{t-1}(i)\, A(i \to j) $$

α_t(j)joint probability of observations 1..t and hidden state j at t
Combined with the analogous backward recursion β_t(j), you get the full posterior γ_t(j) at each step

Sum-product vs max-product. Forward-backward and Viterbi are the same dynamic program with different semiring operations: forward sums over paths to compute p(x); Viterbi maxes over paths to find the best one. Both are O(T·K²) and both run in log-space in practice to avoid underflow.

Baum-Welch. The EM algorithm for HMMs. The E-step uses forward-backward to compute expected state and transition counts; the M-step normalizes those counts to update π, A, and the emission parameters. Local optima are common; warm-start with k-means or supervised pre-training if you have it.

HMMs vs. CRFs. HMMs are generative (model p(x, z)); CRFs are discriminative (model p(z | x) directly). CRFs win when you can engineer good features over the observations and have labels for training. HMMs win when labels are scarce or you want to do unsupervised structure discovery.

Generalizations. HSMMs (hidden semi-Markov) allow non-geometric state durations. Factorial HMMs decompose the state into multiple chains. Input-output HMMs condition transitions and emissions on external inputs. Beyond that, you've reinvented an RNN — at which point you probably should just use one.

Reach for it when

You need exact inference and have computational budget for it
Latent structure interpretation is the goal (e.g. regime detection)
Probabilistic outputs feed downstream Bayesian computation
Domain knowledge dictates a specific transition structure

Skip it when

You can afford to label sequences — train a sequence tagger / transformer instead
The state space is unknown and continuous
You need to model variable-rate sequences with strong dependencies
You want to backprop through the model end-to-end (use a differentiable variant)

import numpy as np

def forward_log(log_pi, log_A, log_B):
    """Forward in log-space. log_B has shape (T, K) — log emission probs."""
    T, K = log_B.shape
    log_alpha = np.empty_like(log_B)
    log_alpha[0] = log_pi + log_B[0]
    for t in range(1, T):
        # logsumexp trick to stay numerically stable
        log_alpha[t] = log_B[t] + _logsumexp(log_alpha[t-1, :, None] + log_A, axis=0)
    return log_alpha

def _logsumexp(x, axis):
    m = x.max(axis=axis, keepdims=True)
    return (m + np.log(np.exp(x - m).sum(axis=axis, keepdims=True))).squeeze(axis)

Too dense?