Watch your deployed model — for drift, for degradation, for outright failure.
Key idea
The model that worked yesterday may not work today. Inputs drift. Behaviour changes. Edge cases pile up. Monitor four things: system health (uptime, latency), data drift (input distributions), prediction drift (output distributions), and performance (where ground truth is available). Each fails differently; each needs its own alerting.
The four monitoring planes. System (latency, throughput, errors, GPU util). Data (per-feature drift vs training). Prediction (output distribution drift). Performance (accuracy, calibration where you have feedback). System monitoring is "is the server up". The other three are "is the model still doing its job".
The hard part. You usually don't have ground truth at deployment time. You see the prediction; you don't see the answer for hours, days, or never. Inference becomes a guessing game. Workarounds: proxy metrics, A/B against a baseline, periodic labelled audits.
Data drift: inputs look different but the relationship is the same
Concept drift: same inputs, different optimal answer
Detecting them requires different signals
Data drift detectors. Compare incoming features against the training distribution. KS for numerical, chi-squared for categorical, PSI for binned numerical. Run on a rolling window; alert when above threshold. See Data Validation.
Prediction drift. The model's output distribution itself can drift even without obvious input drift — the model is making different decisions. Often the first signal of concept drift. Compare current prediction histograms against a baseline.
Delayed feedback. Many real systems have labels weeks later (loan defaults, click-through, fraud). Build a delayed-evaluation pipeline that joins predictions with labels when they arrive; recompute metrics on the lookback window. A dashboard that's behind by 30 days is still useful.
Proxy metrics. When no ground truth is available — engagement metrics, click-through, user feedback, downstream conversion. Calibrate proxies against periodic labelled audits.
Tools. Evidently, NannyML, Whylogs, Arize, Fiddler. Open-source for the first three; commercial dashboards for the last two. Most do data + prediction drift; Arize and Fiddler add subgroup analysis and explainability.
Alert hygiene. Tune thresholds so alerts fire rarely but reliably. Group related signals. Have a runbook ("if drift on feature X, do Y"). Test alerts periodically. False alarms train the team to ignore alerts.
import pandas as pd
from evidently.report import Report
from evidently.metrics import (
DataDriftPreset, TargetDriftPreset, RegressionPreset,
)
# Daily drift report
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=df_train, current_data=df_today)
report.save_html(f"drift_{today}.html")
# Programmatic — fetch the drift result, alert if needed
result = report.as_dict()
drift_share = result["metrics"][0]["result"]["dataset_drift"]
if drift_share:
alert_slack("Data drift detected in production model")
Uses the model's own calibrated probabilities to predict its error
Surprisingly accurate when calibration holds
Shadow models. Run a candidate model alongside production; log both predictions. Compare their distributions and performance (when labels arrive). Identifies regressions before promotion. Standard for high-stakes domains.
Automatic retraining triggers. When drift alarms fire, kick off a retraining job. Conservative: alarm only, human decides. Aggressive: auto-retrain + auto-promote (with promotion gates). See Automated Retraining.
Per-feature drift attribution. When overall drift fires, which features are responsible? Greedy: sort features by individual KS statistic. Better: SHAP-on-drift — the contribution of each feature to the model's output shift.
Calibration drift. A model can have stable accuracy but drift in its predicted probabilities — confidence rises or falls relative to truth. Track per-bin observed accuracy vs predicted probability over time.
Subgroup monitoring. Aggregate metrics hide subgroup harm. Slice by demographic, geography, customer segment. Alert separately. See Fairness.
Logging infrastructure. Production ML monitoring is essentially "structured logging at scale + dashboards". Kafka for stream ingestion, ClickHouse / DuckDB / Snowflake for analytics, Grafana for dashboards, Alertmanager / PagerDuty for paging.
Cost monitoring. Inference is expensive — track $/request, GPU utilisation, request mix. Many shops over-provision GPUs because the costs are invisible at the model-developer level.