CI for ML — ML Resources Hub

Key idea

Run the cheap checks on every PR; run the expensive ones less often. Lint + types + unit tests + 1-batch overfit on CPU on every push. Full training runs on a schedule. GPU-required tests on demand or nightly. Don't make every contributor wait for a model to train.

The setup most repos converge on: GitHub Actions for orchestration, three workflow files (PR-fast, PR-slow, nightly), Docker for env consistency, and clear separation between "lint and test the code" and "actually validate the model".

What runs on every PR. ruff / black / mypy (linting + types). Unit tests on CPU. The 1-batch overfit smoke test. A data-schema validation. Total time target: under 5 minutes.

What runs less often. Full training on a tiny dataset. Integration test against a held-out evaluation set. GPU benchmarks. Model registry promotion gates.

# .github/workflows/pr.yml — runs on every push to a PR
name: PR Checks
on: [pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install uv && uv pip install --system -r requirements.txt
      - run: ruff check .
      - run: mypy --strict src/
      - run: pytest -q --cov=src tests/unit
      - run: pytest -q tests/smoke      # 1-batch overfit, schema checks

Three workflow tiers

PR (every push): lint, types, unit tests, 1-batch overfit. < 5 min.
Nightly: full pipeline on tiny dataset, model evals on held-out set, security scan
On-demand / release: full training, GPU benchmarks, model registry promotion

Common mistakes

Running everything on every PR — devs stop reading red builds
Caching deps badly — flaky installs slow every run
No matrix testing across Python / CUDA versions you actually support
Storing model artefacts in the repo (use a registry instead)

Want matrix builds, caching, & GPU CI strategies?

CI cost model $$ \text{cost} = (\text{frequency}) \times (\text{duration}) \times (\text{rate}) $$

The PR loop is high-frequency, low-duration — optimise duration
The release loop is low-frequency, high-duration — duration matters less
Spend GPU-minutes where they actually catch bugs

Cache aggressively. pip / uv cache, model checkpoints, dataset downloads, Docker layers. Most ML CI time is waiting for installs and downloads. actions/cache with a key on the lockfile hash gets you 90% of the way.

Matrix builds. Test against the Python + CUDA + torch versions you support. strategy.matrix in GitHub Actions; fail-fast: false so one failure doesn't kill the others.

GPU CI. Self-hosted runners with GPUs, or paid services (Modal, CoreWeave, RunPod). Run only the tests that genuinely need a GPU — most ML code can be unit-tested on CPU.

Container-based CI. Build a Docker image with your env once; reuse across jobs. nvcr.io/nvidia/pytorch:24.05-py3 is a sensible base; pin the tag.

Pre-commit hooks. Run lint + format + simple checks locally before the commit lands. Pre-commit catches typos and format issues without burning CI minutes.

Branch protection. Require PR review, require all checks pass, require linear history. The "checks pass" is what turns CI into actual enforcement instead of a vibes-based warning.

name: PR Checks
on: [pull_request]

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        python: ["3.10", "3.11", "3.12"]
        torch:  ["2.4", "2.5"]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: ${{ matrix.python }} }
      - name: Cache pip
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: pip-${{ matrix.python }}-${{ hashFiles('uv.lock') }}
      - run: pip install uv
      - run: uv pip install --system -r requirements.txt torch==${{ matrix.torch }}
      - run: pytest -q --cov=src

Want artifact registries, automatic version bumps, & ML-specific check gates?

Promotion pipeline $$ \text{commit} \to \text{lint/test} \to \text{train} \to \text{eval} \to \text{registry: staging} \to \text{promote: production} $$

Each stage has explicit criteria for the next
Failing criteria => roll back; passing => promote

Reproducible runs in CI. Pin everything: Python, torch, CUDA, NumPy, system OS. Use a hashed lockfile. Set determinism flags. Save the run's full provenance (config, git sha, data hash) as an artefact.

Model evaluation gates. A model is only promoted if it beats the production baseline on held-out evals and clears subgroup-specific bars (no regressions on minority groups). Tools: model registries (MLflow, W&B), promotion workflows, model cards.

Data drift checks in CI. Before retraining, validate that the new data matches the schema and distribution of training data within thresholds. Great Expectations + a GH Action runs this on every data update.

Security & supply chain. Pip-audit or safety on every PR. Dependabot for automated security updates. SBOM generation. Pin git submodules / external models by SHA.

Cost monitoring. Track CI / training cost per PR. Quotas per team. Alert when a run uses 10× normal compute — usually means a bug.

Self-hosted runners. For GPU jobs or special hardware. Manage scaling (Karpenter, RunPod) so you only pay for time you use. Worth it when GitHub-hosted GPU pricing exceeds your dedicated-runner cost.

Release engineering. Semantic versioning for the package; a separate version for the trained model (commit + checkpoint id). Changelog auto-generation from conventional commits.

name: Nightly Train + Eval
on:
  schedule: [{ cron: '0 6 * * *' }]   # daily at 06:00 UTC

jobs:
  train:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Train tiny model on subset
        run: python -m my_project.train --config configs/ci-tiny.yaml
      - name: Eval against baseline
        run: |
          python -m my_project.eval \
            --checkpoint outputs/latest.pt \
            --baseline   gs://my-bucket/baseline.pt \
            --fail-if-worse-than 0.02       # > 2pp regression fails the job
      - name: Promote on success
        if: ${{ success() }}
        run: mlflow models register --name my-model --source outputs/latest.pt

Too dense?