Monitoring — ML Resources Hub

Key idea

The model that worked yesterday may not work today. Inputs drift. Behaviour changes. Edge cases pile up. Monitor four things: system health (uptime, latency), data drift (input distributions), prediction drift (output distributions), and performance (where ground truth is available). Each fails differently; each needs its own alerting.

The four monitoring planes. System (latency, throughput, errors, GPU util). Data (per-feature drift vs training). Prediction (output distribution drift). Performance (accuracy, calibration where you have feedback). System monitoring is "is the server up". The other three are "is the model still doing its job".

The hard part. You usually don't have ground truth at deployment time. You see the prediction; you don't see the answer for hours, days, or never. Inference becomes a guessing game. Workarounds: proxy metrics, A/B against a baseline, periodic labelled audits.

What to alert on

Server: 5xx rate, p99 latency, queue depth
Data: PSI > 0.25, missing rate up, schema violations
Predictions: positive-rate shift, score distribution shift
Performance: aggregate metric drop where labels available
Subgroups: per-segment regression beyond threshold

Common mistakes

Only system-level monitoring — the API is "up" but predicting garbage
Alerting on noise — too many false positives, alerts get ignored
No baseline — drift comparisons need a reference distribution
Aggregate-only metrics — subgroup regressions stay hidden

from prometheus_client import Counter, Histogram, start_http_server

# Standard server-side metrics
REQUESTS = Counter("preds_total", "Total predictions", ["model_version", "status"])
LATENCY  = Histogram("pred_latency_seconds", "Inference latency",
                    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5])

@app.post("/predict")
def predict(req: Request):
    t0 = time.time()
    try:
        score = MODEL(req.x)
        REQUESTS.labels(model_version=MODEL.version, status="ok").inc()
        return {"score": score}
    except Exception as e:
        REQUESTS.labels(model_version=MODEL.version, status="error").inc()
        raise
    finally:
        LATENCY.observe(time.time() - t0)

start_http_server(9000)        # /metrics endpoint for Prometheus

Want drift detectors, delayed feedback, & alert tuning?

Two ways "broken" looks $$ \underbrace{P(X) \;\text{changes}}_{\text{data drift}} \quad \text{or} \quad \underbrace{P(Y \mid X) \;\text{changes}}_{\text{concept drift}} $$

Data drift: inputs look different but the relationship is the same
Concept drift: same inputs, different optimal answer
Detecting them requires different signals

Data drift detectors. Compare incoming features against the training distribution. KS for numerical, chi-squared for categorical, PSI for binned numerical. Run on a rolling window; alert when above threshold. See Data Validation.

Prediction drift. The model's output distribution itself can drift even without obvious input drift — the model is making different decisions. Often the first signal of concept drift. Compare current prediction histograms against a baseline.

Delayed feedback. Many real systems have labels weeks later (loan defaults, click-through, fraud). Build a delayed-evaluation pipeline that joins predictions with labels when they arrive; recompute metrics on the lookback window. A dashboard that's behind by 30 days is still useful.

Proxy metrics. When no ground truth is available — engagement metrics, click-through, user feedback, downstream conversion. Calibrate proxies against periodic labelled audits.

Tools. Evidently, NannyML, Whylogs, Arize, Fiddler. Open-source for the first three; commercial dashboards for the last two. Most do data + prediction drift; Arize and Fiddler add subgroup analysis and explainability.

Alert hygiene. Tune thresholds so alerts fire rarely but reliably. Group related signals. Have a runbook ("if drift on feature X, do Y"). Test alerts periodically. False alarms train the team to ignore alerts.

import pandas as pd
from evidently.report import Report
from evidently.metrics import (
    DataDriftPreset, TargetDriftPreset, RegressionPreset,
)

# Daily drift report
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=df_train, current_data=df_today)
report.save_html(f"drift_{today}.html")

# Programmatic — fetch the drift result, alert if needed
result = report.as_dict()
drift_share = result["metrics"][0]["result"]["dataset_drift"]
if drift_share:
    alert_slack("Data drift detected in production model")

Want shadow models, automated retraining triggers, & observability platforms?

Performance estimation without labels $$ \text{CBPE: } \hat\mu = \frac{1}{n}\sum_i \mathbb{E}_{p(y|\hat y_i, x_i)}\!\big[\,\ell(y, \hat y_i)\big] $$

NannyML's "Confidence-Based Performance Estimation"
Uses the model's own calibrated probabilities to predict its error
Surprisingly accurate when calibration holds

Shadow models. Run a candidate model alongside production; log both predictions. Compare their distributions and performance (when labels arrive). Identifies regressions before promotion. Standard for high-stakes domains.

Automatic retraining triggers. When drift alarms fire, kick off a retraining job. Conservative: alarm only, human decides. Aggressive: auto-retrain + auto-promote (with promotion gates). See Automated Retraining.

Per-feature drift attribution. When overall drift fires, which features are responsible? Greedy: sort features by individual KS statistic. Better: SHAP-on-drift — the contribution of each feature to the model's output shift.

Calibration drift. A model can have stable accuracy but drift in its predicted probabilities — confidence rises or falls relative to truth. Track per-bin observed accuracy vs predicted probability over time.

Subgroup monitoring. Aggregate metrics hide subgroup harm. Slice by demographic, geography, customer segment. Alert separately. See Fairness.

Logging infrastructure. Production ML monitoring is essentially "structured logging at scale + dashboards". Kafka for stream ingestion, ClickHouse / DuckDB / Snowflake for analytics, Grafana for dashboards, Alertmanager / PagerDuty for paging.

Cost monitoring. Inference is expensive — track $/request, GPU utilisation, request mix. Many shops over-provision GPUs because the costs are invisible at the model-developer level.

import nannyml as nml

# Performance estimation without labels (CBPE) — needs calibrated probabilities
estimator = nml.CBPE(
    y_pred_proba="y_proba",
    y_pred="y_pred",
    y_true="y_true",
    timestamp_column_name="ts",
    metrics=["accuracy", "roc_auc"],
    chunk_period="W",
)
estimator.fit(reference_data=df_reference)
results = estimator.estimate(df_production)
results.plot().show()

# Univariate drift by feature
drift = nml.UnivariateDriftCalculator(
    column_names=feature_cols,
    chunk_period="D",
).fit(reference_data=df_reference).calculate(df_production)

Too dense?