Train / val / test — the discipline of evaluating on data your model hasn't seen.
Key idea
You can't measure generalization on data you trained on. Split the dataset into three roles: train (fit the model), validation (pick hyperparameters and compare models), and test (one number at the end). Touching the test set in any way during model development invalidates it.
See how different splitting strategies divide your dataset — and which ones leak when the data isn't iid
Above: 60 examples coloured by class (imbalanced 80/20) and grouped (every 3 share a group ID). Random splitting can put almost no class-1 examples in val; stratified preserves the class ratio. Time-based always puts the future in val (no peeking). Group keeps all members of a group on one side — required when independence assumptions break (patient ID, household ID, session ID).
The three roles. Train, validation, test. Train fits. Validation chooses. Test reports. The test set is touched once, at the end. Multiple test set evaluations during development invalidate the test as an estimate of true performance.
Random split. Default for iid data. Typically 60/20/20 or 70/15/15. Doesn't preserve class balance — use stratified if classes are imbalanced.
Stratified split. Same class proportions in each split. Critical for imbalanced classification; standard for any classification task with k > 2 classes.
Time-based split. Train on the past, validate / test on the future. The only correct split for sequential data. Random splits leak the future into training and overstate performance.
Group split. When examples are not independent — multiple records per patient, per user, per session — keep all examples from a group on one side of the split. Otherwise you measure within-group memorisation, not generalization.
Pick stratified / group / time when
Classes are imbalanced → stratified
Multiple examples per entity → group
Sequential / time-series → time-based
Geographic / spatial — try spatial blocks
Pitfalls to avoid
Random splitting time-series — leaks the future
Random splitting grouped data — leaks the group
Refitting preprocessing on val/test — leaks statistics
Touching the test set during model selection
from sklearn.model_selection import (
train_test_split, StratifiedShuffleSplit,
GroupShuffleSplit, TimeSeriesSplit,
)
# Standard 60/20/20 with stratification
X_dev, X_test, y_dev, y_test = train_test_split(
X, y, test_size=0.20, random_state=0, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(
X_dev, y_dev, test_size=0.25, random_state=1, stratify=y_dev)
# Group split — never put two examples from the same patient in different folds
splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_idx, test_idx = next(splitter.split(X, y, groups))
# Time series — always train on the past
tscv = TimeSeriesSplit(n_splits=5)
for tr, val in tscv.split(X):
fit(X[tr], y[tr]); evaluate(X[val], y[val])
Want nested CV, distribution shift checks, and probing splits?
CV estimates the generalization gap on data drawn from the same distribution
Distribution shift is what kills production models — your CV told you nothing about it
Always reserve an out-of-distribution test set if you can
Nested CV. Outer loop estimates generalization; inner loop chooses hyperparameters. Without it, tuning on the same folds you report from gives optimistic results. Expensive — typical setup is 5 outer × 3 inner — but the right thing for small datasets.
Holdout vs CV. For large datasets a single 80/20 split is fine and much cheaper. CV pays off when data is small (the variance of any single split is too high). Rule of thumb: CV below ~50k examples, single holdout above.
Distribution-shift tests. A common pattern: train on data from 2020–2023, test on data from 2024. Reveals whether your model relies on features that change over time. Same for geographic shift (train on Europe, test on Asia), demographic shift, etc.
Probe / adversarial splits. Construct a test set deliberately different from the training distribution — long-tail classes, distribution shift, adversarial perturbations. Forces you to find weaknesses before deployment does.
The split-it-back-together trick. When your dataset is too small for both validation and test, use nested CV — the outer fold's test is the inner fold's "held-out" validation; you cycle through. Doubles your effective data with care.
Repeated random splits. Report mean ± std over multiple random splits. Surprisingly informative on small data: a difference of 1 std isn't a real improvement.
from sklearn.model_selection import cross_val_score, GroupKFold
import numpy as np
# Repeated stratified CV with confidence interval
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=0)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="roc_auc")
print(f"{scores.mean():.3f} ± {scores.std():.3f} n={len(scores)}")
# Group K-Fold — keep all examples from a group on one side
gkf = GroupKFold(n_splits=5)
for tr, te in gkf.split(X, y, groups):
# Train on tr, evaluate on te
pass
# Distribution-shift split: use a different time period as test
mask_train = df["year"] <= 2023
mask_test = df["year"] == 2024
Want bootstrap, conformal prediction, and OOD detection?
Bootstrap estimator$$ \hat{\theta}_\text{boot} = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}\!\big(D_b^*\big), \quad D_b^* \sim D \text{ with replacement} $$
Resample B times to estimate the distribution of any statistic
Out-of-bag examples (~37% per draw) act as a held-out validation set for free
Bootstrap vs CV. Bootstrap resamples with replacement; CV partitions. Bootstrap gives confidence intervals on metrics; CV gives a point estimate. Use both — bootstrap to report uncertainty, CV to pick the best model. Out-of-bag estimation in random forests uses the bootstrap by construction.
Conformal prediction. Use a held-out calibration fold to construct prediction intervals with finite-sample coverage guarantees. Distribution-free, model-agnostic. Increasingly the right answer for production deployment of regression and classification.
OOD detection. Many production systems combine an in-distribution model with an OOD detector that flags inputs the model hasn't seen anything like before. Methods include Mahalanobis distance in feature space, Energy-based scores, Outlier exposure, conformal anomaly detection.
The "validation set rot" trap. Repeatedly tuning on the same validation set selects for performance on that set specifically — eventually your "best" model is overfit to validation. Hold out a final test set, rotate validation sets, or use nested CV.
Adaptive data analysis. Dwork et al. (2015) showed that interactive data analysis can produce arbitrarily inflated performance estimates if you're not careful — even with held-out validation. Differential-privacy-based mechanisms (thresholdout, gauss-out) give bounds; in practice, treat each new analysis as eating into a budget.
Time-series with non-stationarity. The classical TimeSeriesSplit assumes a fixed underlying process. When the world genuinely changes, error grows with the gap between train and test ends. Walk-forward validation with re-training, exponentially-weighted features, and model monitoring all matter.
import numpy as np
from sklearn.utils import resample
# Bootstrap a metric with confidence interval
def bootstrap_ci(y_true, y_pred, metric, B=1000, alpha=0.05):
rng = np.random.default_rng(0)
n = len(y_true)
vals = []
for _ in range(B):
idx = rng.choice(n, n, replace=True)
vals.append(metric(y_true[idx], y_pred[idx]))
lo, hi = np.percentile(vals, [100 * alpha / 2, 100 * (1 - alpha / 2)])
return np.mean(vals), lo, hi
# Walk-forward CV for non-stationary time series
def walk_forward(X, y, n_splits=5, min_train=500):
n = len(X); step = (n - min_train) // n_splits
for i in range(n_splits):
tr_end = min_train + i * step
yield slice(0, tr_end), slice(tr_end, tr_end + step)