Comparing models in production — sample size, statistical significance, bandits, and the human factors.
Key idea
You can't tell from offline metrics alone whether a model is "better". Offline AUC up doesn't always mean online conversion up. A/B test in production with a clearly-defined metric, enough sample size for the effect you care about, and pre-registered analysis. Then decide.
The basic A/B. Random 50/50 split of users (or sessions). Group A (control) gets the current model; Group B gets the candidate. Compare a primary metric over a fixed evaluation window. Hypothesis test: is the difference larger than chance?
Sample size. Smaller effect → larger sample needed. The standard formula: n ≈ 16 / d² where d is the effect size in standard-deviation units. A 1% relative improvement on a 10%-baseline metric needs ~20 000 users per arm.
Common gotchas. Peeking at results and stopping early (inflates false-positive rate). Network effects (the treatment of one user affects another). Novelty effects (users react to anything new, then revert). Seasonality (test on a representative time window).
Statistical power$$ n \approx \frac{2\sigma^2 (z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2} $$
δ the minimum detectable effect
α, β false-positive and false-negative rates (typically 0.05, 0.20)
σ standard deviation of the metric
Sequential testing. Fixed-horizon A/B requires committing to a sample size up front. Sequential / always-valid p-values let you peek without inflating false positives. Methods: SPRT, mSPRT, group-sequential designs, e-values. Tools: Optimizely, Statsig, Eppo all implement some variant.
CUPED (variance reduction). Microsoft's technique: regress the metric on a pre-experiment covariate (the same user's metric before the test). The residuals have much lower variance, so you need fewer users for the same power. 20–50% sample size reduction is typical.
Multi-armed bandits. Instead of fixed splits, allocate more traffic to better-performing arms over time. Thompson sampling is the popular choice. Trade-off: faster learning vs cleaner causal inference. Use bandits for exploitation; A/B for explanation.
Bayesian A/B. Compute the posterior probability that B beats A by at least δ. More intuitive for stakeholders ("80% chance B is better"). Same data; different interpretation. Doesn't fix multiple-comparisons or peeking by itself.
Multiple comparisons. Testing many metrics increases the chance of a false positive somewhere. Pre-register a primary metric; report others as "exploratory". Bonferroni or Benjamini-Hochberg for principled correction.
Subgroup analysis. The treatment can help one segment and hurt another. Slice by demographics, geography, device. Watch for Simpson's paradox: aggregate looks fine, every subgroup is worse.
Estimate the value of a new policy π from logs of an old behaviour policy
Doesn't require deploying π to evaluate it
Variance grows with how different π is from behaviour
Interleaving. Show both models' results to the same user (e.g., ranked-list problems — interleave A's and B's recommendations). Much higher statistical power per user; works because each user is their own control. Used heavily at Microsoft, Netflix, search engines.
Off-policy evaluation (OPE). Estimate how a new policy would perform from logs of a previous one. IPS (inverse propensity scoring), doubly robust estimators, model-based OPE. Standard in recommendation and ad-ranking. Requires the logging policy to have explored — if it always picked the same thing, you can't evaluate alternatives.
Long-term effects. Some changes hurt short-term metrics but help long-term (paywalls, ads, content moderation). Different evaluation: long-term holdouts, instrumental variables, or careful causal modelling. Hard; rarely done well.
Heterogeneous treatment effects (HTE). The treatment helps some users and hurts others. Estimating τ(x) = E[Y(1) − Y(0) | X = x] with causal forests, double ML, or T-/X-learners. Useful for targeted deployment.
Switchback experiments. When users can't be split (e.g., ride-share dispatching), switch the treatment on and off over time within the same population. Mitigates network effects.
Pre-registration & audit. Write the analysis plan before looking at results. Commit to one primary metric, one stopping rule, one analysis method. Reduces hindsight-driven p-hacking — and makes the test reusable in retros.
Cost-aware testing. Each "challenger" model deployment incurs implementation, ramp, and rollback costs. Decision-theoretic framing: expected value of running the test > cost of running it? Often the test isn't worth running because the expected effect is too small.
import numpy as np
# Inverse Propensity Scoring — estimate policy value from logged data
def ips_estimate(actions, rewards, propensities, new_policy):
"""
actions: array of taken actions
rewards: observed reward for each
propensities: P(action | context) under the LOGGING policy
new_policy: function returning P(action | context) under the NEW policy
"""
weights = np.array([
new_policy(a, ctx) / propensities[i]
for i, (a, ctx) in enumerate(zip(actions, contexts))
])
return (weights * rewards).mean()
# Switchback for network-effect mitigation
def switchback_test(treatment_schedule, observations):
"""Treatment switches on/off in blocks; compare blocks of A vs B."""
df = pd.DataFrame(observations)
df["assignment"] = treatment_schedule
return df.groupby("assignment")["metric"].mean()