Open the black box — feature importance, attribution, mechanistic interpretability, and the limits of "why did the model do that?"
Key idea
Two flavours. Per-prediction explanation: "why did this input get this answer?". Model-wide understanding: "what does this model actually compute?". The first is feature-attribution (SHAP, LIME, Integrated Gradients). The second is mechanistic interpretability (probing, circuits, sparse autoencoders). Both have real successes and real limits.
SHAP-style local explanation — which features pushed this prediction up, which pushed it down?
A single tabular prediction with per-feature SHAP-style contributions. Each bar shows how much that feature pushed the prediction up (orange) or down (indigo) relative to the average. Sum them all up + the model's average prediction = this specific prediction. SHAP values guarantee this additivity by construction.
Feature importance (model-wide). Permutation importance: shuffle a feature; how much does the score drop? Tree importance: how often a feature appears in splits, weighted by gain. Both are simple and useful but can be misled by correlated features.
SHAP & LIME (per-prediction). SHAP (Lundberg & Lee, 2017) — game-theoretic Shapley values, additivity by construction, slow but principled. LIME — fit a local linear surrogate around a single prediction, fast but unstable. Both give "this feature contributed +X to this prediction".
Saliency maps for images. Where in the image did the model "look"? Gradient × input, Integrated Gradients, GradCAM, SmoothGrad. All are easy to compute, but easy to mislead — a saliency map can highlight the right region for the wrong reason.
Attention as explanation? Tempting but controversial. Attention weights don't directly correspond to feature importance, even in transformers. Use as a starting hypothesis, not an explanation. Jain & Wallace (2019): "Attention is not Explanation".
Mechanistic interpretability. Find specific circuits inside a neural network that implement a specific behaviour. Anthropic's induction heads, OpenAI's sparse autoencoders. Still early — the field is producing real results but on toy models.
Reach for it when
Regulated domain (finance, healthcare, insurance) — explanations are mandatory
Debugging — why is this prediction wrong?
Audit — does the model rely on a feature it shouldn't?
Trust — communicating model behaviour to non-ML stakeholders
Real limits
Explanations can mislead — humans accept plausible-but-wrong narratives
Correlated features confound most attribution methods
"Why" questions presume a causal model that the attribution doesn't have
Mechanistic understanding is hard work and rarely complete
import shap
import xgboost as xgb
# Train and explain with SHAP
model = xgb.XGBClassifier().fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Force plot — one prediction's attributions
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])
# Summary plot — global feature importance with direction
shap.summary_plot(shap_values, X_test)
# Permutation importance — model-agnostic
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=0)
for i in result.importances_mean.argsort()[::-1][:5]:
print(f"{X_val.columns[i]:30s} {result.importances_mean[i]:.3f}")
Want global vs local, axiomatic guarantees, & mechanistic methods?
Exact computation is exponential — SHAP uses tractable approximations
Axiomatic attribution. SHAP, Integrated Gradients, and LIME each obey specific axioms — efficiency, sensitivity, implementation invariance. The choice of axioms determines the method. There's no "right" axiom set; pick what matches your need.
Local vs global explanations. Local: "why this specific prediction?" — SHAP-per-instance, IG, LIME. Global: "what does the model rely on overall?" — permutation importance, mean |SHAP|, partial dependence plots. Both are useful for different questions.
Partial dependence plots (PDP) and ICE. PDP: average the model's prediction over the training distribution, varying one feature. ICE: do the same per-instance. Useful for visualising the marginal effect of one feature on the model's prediction.
Counterfactual explanations. "What's the smallest change to the input that flips the prediction?" Useful for actionable explanations: "to get the loan, raise income by $5k". Tools: DiCE, Alibi, scikit-learn-compat libraries.
Probing classifiers. Train a small linear classifier on a model's internal representations to predict some property. Reveals what information is linearly decodable from the model's intermediate states. Used heavily in interpretability of language models.
Concept-based methods. TCAV (Kim et al. 2018) — measure how sensitive a prediction is to a high-level concept (defined by examples). Useful for "does the model use this concept?" rather than "this raw feature?".
Sparse autoencoder objective$$ \mathcal{L} = \big\lVert x - \hat x \big\rVert^2 + \lambda\,\lVert z \rVert_1, \quad z = \sigma(W_\text{enc} x + b) $$
Learn a sparse code z over the network's hidden state
Activations of single z-dimensions tend to correspond to interpretable features
Anthropic / DeepMind's flagship mech-interp tool
Mechanistic interpretability. Aim: understand a neural network the way you'd understand a compiled program — find the algorithms it implements. Olah et al.'s Circuits work (Distill 2020+), Anthropic's induction heads (2022), Nanda's modular arithmetic (2023). Real progress on toy and small-real models.
Probing & representational similarity. Probe with a small classifier; representational similarity analysis (RSA); centered kernel alignment (CKA). All are ways of asking "what information is in this layer?" without intervening on the model.
Causal interpretability. Patch activations between forward passes; ablate specific neurons or attention heads; measure the effect. Activation patching (Meng et al. 2022), path patching (Goldowsky-Dill 2023). The most rigorous way to attribute behaviour to components.
Sparse autoencoders for feature discovery. Train a sparse autoencoder on a model's hidden states. The sparse codes' dimensions tend to correspond to interpretable features ("this neuron fires for fruit"). Bricken et al. 2023 (Anthropic), Cunningham et al. 2023.
Why "attention is not explanation" matters. Attention weights are part of the computation but they don't tell you what the model is computing. Two attention distributions can lead to the same output; high attention to a token doesn't mean the model needs that token. Use attention as a hypothesis generator, not an answer.
Limits and dangers of explanations. Even faithful explanations can mislead — humans confabulate. "Right answer for the wrong reason" is hard to detect. Adversarial explanations exist (Slack et al. 2020 — fool LIME / SHAP). Treat explanation as a debugging tool, not as truth.
Mechanistic for LLMs. The frontier — induction heads, name mover heads, indirect object identification, modular arithmetic circuits, the "in-context learning" mechanism. Anthropic and OpenAI have dedicated interpretability teams; results are technical and incremental.
import torch, torch.nn as nn
# Sparse autoencoder over a transformer's hidden states
class SAE(nn.Module):
def __init__(self, d_model, d_dict, lam=1e-3):
super().__init__()
self.W_enc = nn.Linear(d_model, d_dict)
self.W_dec = nn.Linear(d_dict, d_model, bias=False)
self.lam = lam
def forward(self, x):
z = torch.relu(self.W_enc(x))
x_hat = self.W_dec(z)
loss = (x - x_hat).pow(2).mean() + self.lam * z.abs().mean()
return x_hat, z, loss
# Train on hidden activations harvested from a transformer pass
hidden = transformer(text).hidden_states[layer]
x_hat, z, loss = sae(hidden)
# z has thousands of dimensions, most of which are zero at any given input.
# Look at what each non-zero feature fires on across many examples.