Key idea

ML code is still code. The hard part of ML code is what happens with the data and the model; the rest is plain Python that benefits from the same testing you'd write for anything else. Plus three ML-specific tests: shape, range, gradient flow.

The disasters: silent shape bugs (broadcasting where you didn't expect), gradient-stopping mistakes (a .detach() in the wrong place), data corruption (NaN propagation), reproducibility loss (a seed didn't take). All of these are catchable with cheap tests.

Five tests every ML repo should have. (1) Forward pass produces the right output shape. (2) Backward pass produces non-zero gradients on every learnable parameter. (3) A single training step decreases the loss on a tiny batch. (4) Loaded checkpoints reproduce exactly. (5) The same seed produces the same output.

What to test

  • Shapes: inputs, outputs, intermediate activations
  • Numeric range: no NaNs / Infs after forward / backward
  • Gradients: non-zero on every trainable parameter
  • Loss decrease: 100 steps on 1 batch should overfit it
  • Determinism: same seed → same loss to 1e-6
  • Data pipeline: schemas, ranges, missing rates

What to NOT test

  • Final accuracy on the real dataset (too slow + brittle)
  • Exact-equality of arbitrary intermediate tensors (floating-point)
  • Library internals (PyTorch already tests itself)
  • Anything that needs a GPU (run those separately, less often)
import pytest
import torch
from my_model import MyNet

@pytest.fixture
def model(): return MyNet(in_dim=128, out_dim=10)

def test_forward_shape(model):
    x = torch.randn(4, 128)
    assert model(x).shape == (4, 10)

def test_grad_flow(model):
    x = torch.randn(4, 128); y = torch.randint(10, (4,))
    loss = torch.nn.functional.cross_entropy(model(x), y)
    loss.backward()
    for name, p in model.named_parameters():
        assert p.grad is not None, f"no grad: {name}"
        assert (p.grad != 0).any(), f"all-zero grad: {name}"

def test_overfit_tiny_batch(model):
    x = torch.randn(2, 128); y = torch.tensor([3, 7])
    opt = torch.optim.Adam(model.parameters(), lr=1e-2)
    for _ in range(200):
        opt.zero_grad()
        loss = torch.nn.functional.cross_entropy(model(x), y)
        loss.backward(); opt.step()
    assert loss.item() < 0.1, f"can't overfit 2 samples: {loss.item()}"
Want data tests, property-based tests, & CI integration?
The test pyramid for ML $$ \underbrace{\text{unit (fast, many)}}_{\text{shapes, ranges, grads}} \;\;\;>\;\;\; \underbrace{\text{integration (medium)}}_{\text{1-batch overfit, save/load}} \;\;\;>\;\;\; \underbrace{\text{end-to-end (slow, few)}}_{\text{tiny full pipeline}} $$
  • Many cheap tests at the bottom — they catch most bugs
  • A few expensive tests at the top — only what really requires it

Data validation. Schema (column names, types). Ranges (age > 0). Missingness (fewer than X% nulls per column). Cardinality (distinct categories within expected set). Tools: Great Expectations, Pandera, deequ.

Property-based testing. Hypothesis library — generate random tensors and assert properties (output shape always matches input batch size; gradient norm bounded). Catches edge cases hand-written tests miss.

Snapshot tests. Save the output of a deterministic pipeline once, assert it matches on subsequent runs. Catches accidental behaviour changes; brittle against intentional ones — has to be regenerated when you change the model.

Integration test recipe. A tiny synthetic dataset (10 examples), a 1-batch training run, a save-and-reload, an inference call. Total runtime: under a second. Catches 90% of "did I break the pipeline" bugs.

CI for ML. Run the unit + fast integration tests on every PR. Run slower (1-minute) end-to-end tests on merge. Keep GPU tests on a separate workflow that runs on demand or nightly. See CI for ML.

Test the failure modes. What happens with an empty batch? With NaN inputs? With a class that wasn't seen at training? Each of these usually warrants a test that asserts an explicit error rather than silent garbage.

import pandera as pa
from pandera import Column, Check
from hypothesis import given, strategies as st
import torch

# Pandera schema for input data
schema = pa.DataFrameSchema({
    "age":    Column(int,   Check.in_range(0, 120)),
    "income": Column(float, Check.greater_than_or_equal_to(0)),
    "city":   Column(str,   Check.isin(["London", "Paris", "Berlin"])),
})

def load_data(path):
    df = pd.read_csv(path)
    return schema.validate(df)               # raises if input doesn't conform

# Property-based: model behaviour across random shapes
@given(
    batch=st.integers(min_value=1, max_value=64),
    dim=st.integers(min_value=4, max_value=256),
)
def test_forward_arbitrary_shape(batch, dim):
    model = MyNet(in_dim=dim, out_dim=10)
    x = torch.randn(batch, dim)
    out = model(x)
    assert out.shape == (batch, 10)
    assert not out.isnan().any()
Want fuzzing, mutation testing, golden datasets, & flake hunting?
Determinism budget $$ \|\text{out}(s_1) - \text{out}(s_2)\|_\infty < \epsilon $$
  • Two seeded runs should agree to ε
  • cuDNN nondeterminism, atomic adds, multi-GPU sync — all break this
  • The right ε is small but not zero on GPU

Golden datasets. Tiny, hand-curated, labelled by you, kept in the repo. Run the trained model on these every CI; assert accuracy > threshold and that specific edge-case predictions still pass. Catches regression from accidental changes.

Mutation testing. Tools like mutmut systematically introduce small bugs into your code (flipping conditions, changing constants) and check if your tests catch them. If a mutation survives, the test suite has a gap.

Flaky test hunting. ML tests that fail intermittently — almost always due to seeded but non-deterministic ops, or to data loaders with random sampling without seeds. Use pytest-rerunfailures to surface them; fix the root cause, don't just retry.

Test data versioning. Your tests are only as stable as their data. DVC / LakeFS for test datasets, or commit a small parquet directly. Either way: identify each test by its data version.

Determinism on GPU. cuDNN benchmark mode picks fastest kernels, which can differ between runs. torch.use_deterministic_algorithms(True) + CUBLAS_WORKSPACE_CONFIG=:4096:8 get you most of the way; some ops still aren't deterministic. Tests that assert bit-exact GPU equality are usually a mistake.

Differential testing. Compare two implementations of the same thing — your model vs a reference implementation, your CUDA kernel vs the CPU version, the new optimiser vs the old one. Run both on the same input, assert output close.

import os, random
import numpy as np
import torch

def fully_deterministic(seed=42):
    """Best-effort determinism on PyTorch + CUDA."""
    os.environ["PYTHONHASHSEED"]      = str(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Differential test — model.eval() should equal a reference
def test_against_reference(model, ref_model, x):
    model.eval(); ref_model.eval()
    with torch.no_grad():
        a = model(x); b = ref_model(x)
    assert (a - b).abs().max() < 1e-4, (a - b).abs().max()
Too dense?