Latent-state sequence models — what RNNs replaced, but still useful when interpretability matters.
Key idea
A system you can't see directly, but you can see its outputs. The state changes over time according to its own rules. You only observe noisy emissions. Given the emissions, you infer the hidden states.
Click an activity icon to flip it — the hidden-state posterior and Viterbi path re-shuffle to explain the new evidence
The toy model has three hidden weather states (Sunny / Cloudy / Rainy) that you never observe directly — only the activity I picked that day (Walk / Shop / Clean). Each state has its own emission preference, and they tend to persist (Sunny is more likely to stay Sunny than flip straight to Rainy). The three colour-tinted rows show the forward-backward posterior: how strongly the model believes each state was active given the entire sequence. The dark line is the Viterbi most-likely path — the single best explanation rather than per-step marginals.
Classic example: speech recognition. The "hidden state" is the phoneme being spoken; the "observation" is the audio signal. You don't see the phoneme directly, only the sound. The model captures (a) which phonemes tend to follow which (transition probabilities) and (b) what each phoneme tends to sound like (emission probabilities).
HMMs have been largely replaced by RNNs and transformers in speech and NLP, but they're still the right tool when (a) the latent states have a real interpretation you care about, (b) data is small, or (c) you want exact inference rather than approximate.
Reach for it when
The hidden states correspond to real-world categories you care about
Small data — exact inference is feasible and helps
Biological sequence analysis (gene finding, protein structure)
You need a probabilistic story you can interrogate
Skip it when
Lots of data and you don't care about latent-state interpretability — use RNN / transformer
State transitions depend on long-range context (Markov assumption breaks)
Emissions are high-dimensional images or text
You want end-to-end learning of representations
from hmmlearn import hmm
# 3 hidden states, Gaussian emissions
model = hmm.GaussianHMM(n_components=3, covariance_type="full")
model.fit(observations) # shape: (n_samples, n_features)
# Decode: most likely hidden-state sequence (Viterbi)
hidden_states = model.predict(observations)
# Score: log-likelihood of the observed sequence under the model
print(f"log p(obs) = {model.score(observations):.2f}")
Want the three classical problems and the algorithms?
Three classical problems, each with its own algorithm:
1. Evaluation — given a model, how likely is an observed sequence? Use the forward algorithm. Dynamic programming, O(T·K²) where K is the number of states.
2. Decoding — given a model and observations, what's the most likely hidden state sequence? Use the Viterbi algorithm. Same complexity as forward, but tracks the argmax path.
3. Learning — given observations only, fit the model parameters. Use the Baum-Welch algorithm (EM specialized to HMMs). Iterate: E-step computes posterior over latent states via forward-backward; M-step updates π, A, B.
Without the Markov assumption, all of these would be exponential in T. The local conditioning is what makes HMMs tractable.
Reach for it when
Tagging tasks where labels follow a Markov chain (POS tagging, NER)
Long-range dependencies — RNN / transformer captures them better
Continuous high-dim emissions — better handled by deep generative models
You need to learn the state space (not just its parameters)
Labels are observed at every step — use CRF / sequence tagger instead
from hmmlearn import hmm
import numpy as np
# Multinomial emissions for discrete observations
model = hmm.CategoricalHMM(n_components=3, n_iter=100, random_state=0)
model.fit(X, lengths=lengths) # lengths splits concatenated sequences
# Posterior probability of each state at each timestep
log_prob, posteriors = model.score_samples(X)
# Viterbi: most likely state sequence
states = model.predict(X)
print("Transition matrix:")
print(np.round(model.transmat_, 3))
Want forward-backward, max-product, and the CRF connection?
αt(j)joint probability of observations 1..t and hidden state j at t
Combined with the analogous backward recursion βt(j), you get the full posterior γt(j) at each step
Sum-product vs max-product. Forward-backward and Viterbi are the same dynamic program with different semiring operations: forward sums over paths to compute p(x); Viterbi maxes over paths to find the best one. Both are O(T·K²) and both run in log-space in practice to avoid underflow.
Baum-Welch. The EM algorithm for HMMs. The E-step uses forward-backward to compute expected state and transition counts; the M-step normalizes those counts to update π, A, and the emission parameters. Local optima are common; warm-start with k-means or supervised pre-training if you have it.
HMMs vs. CRFs. HMMs are generative (model p(x, z)); CRFs are discriminative (model p(z | x) directly). CRFs win when you can engineer good features over the observations and have labels for training. HMMs win when labels are scarce or you want to do unsupervised structure discovery.
Generalizations. HSMMs (hidden semi-Markov) allow non-geometric state durations. Factorial HMMs decompose the state into multiple chains. Input-output HMMs condition transitions and emissions on external inputs. Beyond that, you've reinvented an RNN — at which point you probably should just use one.
Reach for it when
You need exact inference and have computational budget for it
Latent structure interpretation is the goal (e.g. regime detection)