Key idea

"Average error" hides outliers; "relative error" hides zeros. Each regression metric makes a different bargain with reality. MSE punishes big mistakes; MAE treats all of them equally; R² compares your model to "always predict the mean"; MAPE answers "what fraction off are we, on average" — but blows up when the truth is zero.

Drag the outlier — watch MSE explode while MAE shrugs; the model line is OLS fit to all the points
y = 0.0

Drag the orange outlier away from the trend and watch the metric strip below: MSE rises quadratically, MAE rises linearly, drops sharply because the variance the model has to explain just grew. A single point can move MSE by an order of magnitude — that's why MSE-trained models often "chase" outliers.

MSE (mean squared error). Bayes-optimal if errors are Gaussian; punishes large errors quadratically. Same units as — usually report √MSE (RMSE) for interpretability.

MAE (mean absolute error). Same units as y. Robust to outliers — corresponds to predicting the median rather than the mean.

(coefficient of determination). 1 − SSE/SST. The fraction of variance the model explains relative to "always predict the mean". 1 = perfect, 0 = no better than the mean, negative = worse than the mean. Pseudo-R² for nonlinear models is similar in spirit but unitless.

MAPE (mean absolute percentage error). Scale-free, easy to communicate ("our forecast is 12% off on average"). Useless when y can be zero or near-zero.

Pinball loss / quantile loss. The right metric for quantile regression — measures how well you predicted the τ-th quantile, not the mean.

Reach for

  • RMSE: default, especially when big errors matter most
  • MAE: outliers, robust reporting
  • : "how much variance did I explain?"
  • MAPE: forecasting where y > 0 always
  • Pinball: prediction intervals

Don't use

  • MAPE when y can be zero or close to it
  • MSE alone when outliers exist — report MAE too
  • R² for non-linear or non-Gaussian models without care
  • Any single metric without sanity-checking residual plots
import numpy as np
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    mean_absolute_percentage_error, r2_score,
)

rmse  = mean_squared_error(y_true, y_pred, squared=False)
mae   = mean_absolute_error(y_true, y_pred)
mape  = mean_absolute_percentage_error(y_true, y_pred)
r2    = r2_score(y_true, y_pred)

# Pinball loss for quantile regression
def pinball(y_true, y_pred, tau=0.5):
    e = y_true - y_pred
    return np.maximum(tau * e, (tau - 1) * e).mean()
Want adjusted-R², MSLE, calibration, and proper scoring rules for regression?
Mean squared error decomposed $$ \text{MSE} = \mathbb{E}\!\big[(y - \hat y)^2\big] = (\mathbb{E}[\hat y] - y^*)^2 + \mathbb{V}[\hat y] + \sigma_\text{noise}^2 $$
  • Bias² of the mean prediction
  • Variance of the predictions across resampled training sets
  • Irreducible noise — the floor no model can beat

Why RMSE is unstable in practice. One outlier can multiply RMSE by 10×. If you can't avoid them, report MAE alongside, or look at quantiles of the squared residual rather than the mean. Robust alternatives include median absolute error and trimmed RMSE.

Adjusted R². Standard R² always rises when you add features (more flexibility, more variance explained). Adjusted R² penalises that: 1 − (1 − R²)(n − 1)/(n − p − 1). Use it when comparing models with different numbers of features.

MSLE (mean squared log error). Apply MSE to log(1 + y). Symmetric in relative terms, robust to large positive outliers. Useful for skewed targets like prices, counts, or income.

Predicting distributions. Don't just minimise mean error — predict quantiles or full distributions. Pinball loss for quantiles; CRPS (continuous ranked probability score) for full predicted distributions. CRPS reduces to MAE when the prediction is a point mass.

Residual diagnostics. Plot residuals against predictions, against each feature, against time. Patterns there reveal model failures that aggregate metrics hide — heteroskedasticity, missing interactions, time leakage.

The "loss vs. metric" mismatch. Train on MSE if you want to predict means; on MAE if you want medians; on pinball if you want quantiles. Optimising the wrong thing and then evaluating with the right one is a common mistake.

import numpy as np

# Robust RMSE — trim the 5% worst residuals before averaging
def trimmed_rmse(y_true, y_pred, trim=0.05):
    sq = (y_true - y_pred) ** 2
    k = int(len(sq) * (1 - trim))
    return np.sqrt(np.sort(sq)[:k].mean())

# CRPS — proper score for predicted distributions
def crps_gaussian(y, mu, sigma):
    z = (y - mu) / sigma
    from scipy.stats import norm
    return sigma * (z * (2 * norm.cdf(z) - 1) +
                    2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
Want CRPS, log-score, and proper scoring theory?
CRPS — proper score for distributions $$ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big(F(z) - \mathbb{1}\{y \leq z\}\big)^2\, dz $$
  • Fpredicted CDF
  • Reduces to MAE if F is a point mass at the prediction
  • Proper — minimised when F matches the true distribution

Proper scoring rules for regression. A scoring rule S(F, y) is "proper" if EY∼G S(F, Y) is minimised at F = G. Log score, CRPS, and energy score are all proper for predictive distributions. MAE and squared error are proper for point predictions targeting the median and mean respectively.

Coverage and calibration of intervals. If a model says "95% confidence", verify that 95% of the actual observations fall in the predicted interval on a held-out set. Empirical coverage below nominal = under-confident; above = over-confident. Conformal prediction gives exact coverage guarantees under exchangeability.

Forecast horizons. For time-series, report errors at multiple horizons separately — error at t+1 and t+30 can be very different. SMAPE (symmetric MAPE) is the classical forecasting metric; modern competitions (M4, M5) use a weighted combination of multiple metrics.

Heteroskedastic targets. When the noise depends on the input (variance grows with magnitude), uniform metrics can be misleading. Predict and evaluate quantile bands, or use a normalising transformation (log, Box-Cox) before fitting.

Forecast skill scores. Compare model performance to a baseline (climatology, persistence, naive forecast) — the skill score is 1 − error/baseline_error. Important in weather, finance, and epidemiology where the dataset's intrinsic difficulty changes over time.

Bootstrap CIs for metrics. Don't just report a single number; bootstrap residuals or resampled test sets to put a confidence interval on RMSE, MAE, or R². Two models with a "10% improvement" can be statistically indistinguishable.

import numpy as np

# Bootstrap CI for any regression metric
def bootstrap_metric(y_true, y_pred, metric, B=1000, alpha=0.05):
    n = len(y_true)
    vals = []
    for _ in range(B):
        idx = np.random.choice(n, n, replace=True)
        vals.append(metric(y_true[idx], y_pred[idx]))
    lo, hi = np.percentile(vals, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return np.mean(vals), lo, hi

# Empirical coverage of prediction intervals
def coverage(y, lower, upper):
    return ((y >= lower) & (y <= upper)).mean()
Too dense?