Key idea

Training and serving are different problems. Training optimises for throughput on a fixed dataset. Serving optimises for latency on one request at a time (or small batches). The right tool, the right format, and the right batching strategy all change. Don't ship a 100-line Flask app; use one of the four standard serving stacks.

The four standard stacks. FastAPI + uvicorn for "I want a Python endpoint that calls my model" (small models, hobby projects). BentoML for "I want a packaged, deployable model" (production tabular / sklearn / small DL). Triton for "I want NVIDIA-optimised, dynamic batching, multi-framework". vLLM (or TGI) for "I'm serving an LLM and need throughput".

Three big serving decisions. Sync vs async (mostly sync for low-latency, async if you can wait or stream). Single-request vs batched (batching trades latency for throughput). CPU vs GPU (CPU for small models, GPU for transformers / large CNNs).

Stack by use-case

  • Small Python model: FastAPI + uvicorn
  • Tabular / sklearn / packaged: BentoML
  • GPU model with dynamic batching: Triton
  • LLM: vLLM, TGI, or a hosted API
  • Edge / mobile: ONNX Runtime, TFLite, CoreML

Common pitfalls

  • Loading the model on every request — load once at startup
  • Hardcoded paths — use a registry alias instead
  • Synchronous I/O blocking GPU — use async or thread pool
  • No request validation — bad inputs crash the server
  • No timeout — a slow GPU stalls everything
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import torch, mlflow

app = FastAPI()

# Load the model once at startup
MODEL = mlflow.pytorch.load_model("models:/credit-risk@production").eval()

class Request(BaseModel):
    features: list[float] = Field(..., min_length=16, max_length=16)

class Response(BaseModel):
    score: float
    model_version: str

@app.post("/predict", response_model=Response)
def predict(req: Request):
    x = torch.tensor(req.features, dtype=torch.float32).unsqueeze(0)
    with torch.inference_mode():
        score = MODEL(x).sigmoid().item()
    return Response(score=score, model_version=MODEL.version)

@app.get("/health")
def health(): return {"status": "ok"}
Want dynamic batching, ONNX, & the LLM serving stack?
Latency budget decomposition $$ T_{\text{request}} = T_{\text{net}} + T_{\text{preprocess}} + T_{\text{queue}} + T_{\text{forward}} + T_{\text{postprocess}} $$
  • Profile each piece — the bottleneck is rarely where you think
  • Pre/post-processing often dominates small-model latency
  • Queue time is what dynamic batching trades against

Dynamic batching. The server accumulates inflight requests up to a small time window (10–50 ms) or batch size; runs the model once on the batch; splits the outputs. Triton, BentoML, and most serving frameworks support this natively. Huge throughput win for GPU-served models with bursty traffic.

ONNX. Open neural-network exchange — convert your PyTorch / TF model to a portable graph. Runs on ONNX Runtime (CPU, GPU, edge, mobile). Useful when you want to deploy outside the original training framework, or want graph-level optimisations.

TorchScript vs torch.compile vs ONNX. TorchScript: PyTorch's classic graph format. torch.compile: modern JIT (PyTorch 2.0+); usually faster for training. ONNX: framework-agnostic. For pure inference on PyTorch, torch.compile or ONNX Runtime are usually fastest.

LLM serving stack. vLLM (UC Berkeley) and TGI (HuggingFace) are the open-source standards. PagedAttention to avoid memory waste, continuous batching to mix concurrent requests, KV cache management. Hosted APIs (OpenAI, Anthropic, Replicate) abstract all of this.

Quantisation for serving. int8 / int4 weights. Often 2–4× smaller and faster with little accuracy loss. bitsandbytes, autoawq, gptq for LLMs. torch.ao for tabular / vision. See Quantization & Distillation.

Multi-model serving. Modern serving frameworks let you load multiple models in one process, share GPU memory, and route requests based on the URL. Useful when you have many small models — or different fine-tunes of one base.

import bentoml, torch

# Export model for BentoML
saved = bentoml.pytorch.save_model("credit-risk", model)

# Service definition
@bentoml.service(
    resources={"gpu": 1},
    traffic={"timeout": 5},
    workers=1,
)
class CreditRiskService:
    @bentoml.api(
        batchable=True,
        batch_dim=0,
        max_batch_size=64,
        max_latency_ms=50,
    )
    async def predict(self, features: list[list[float]]) -> list[float]:
        x = torch.tensor(features)
        with torch.inference_mode():
            scores = self.model(x).sigmoid()
        return scores.tolist()

# bentoml serve credit_risk:CreditRiskService --port 3000
Want LLM serving deep-dive, autoscaling, & SLA engineering?
LLM serving throughput $$ \text{tok/s} \approx \frac{\text{compute capacity}}{\text{KV cache pressure} + \text{prefill cost}} $$
  • KV cache dominates memory for long contexts
  • Continuous batching mixes prefill and decode → big throughput gain
  • PagedAttention frees fragmented KV memory blocks

PagedAttention (vLLM). Kwon et al. (2023). KV cache stored in paged blocks (like virtual memory). Avoids the fragmentation of contiguous-allocation schemes. 2–4× throughput improvement.

Continuous batching. Don't wait for a batch to finish before starting a new one — feed in new requests as old ones complete. Combined with prefill / decode separation, much higher GPU util than naive batching.

Speculative decoding. A small "draft" model proposes k tokens; the big "target" model verifies them in one pass. Accept the prefix the target agrees with; resample from where it disagrees. 2× speed-up with no accuracy loss (in expectation). Built into vLLM, TGI, llama.cpp.

Autoscaling. KServe, BentoCloud, Modal, RunPod. Scale replicas based on QPS, queue depth, or GPU utilisation. Pre-warmed pools for low cold-start. Scale-to-zero for cost-sensitive workloads.

SLA engineering. p50, p95, p99 latency. Saturation thresholds. Error budgets. Most ML APIs have "p99 latency under X ms" as a hard requirement. Monitor + alert on it; load-test with realistic traffic distributions.

Edge inference. ONNX Runtime, TFLite, CoreML, MediaPipe. Quantisation is essential (int8 or int4). Many vision and small NLP models run on phones with sub-100ms latency.

Streaming responses. For LLMs, stream tokens as they're generated. Server-Sent Events or HTTP chunked transfer. Improves perceived latency dramatically; standard in all production chat APIs.

# vLLM — serve an LLM with PagedAttention + continuous batching
# pip install vllm
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.85,
    max_model_len=8192,
    enable_chunked_prefill=True,
)

sampling = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text)

# Or run as a server:
# vllm serve meta-llama/Llama-3-8B-Instruct --port 8000
# OpenAI-compatible endpoint at /v1/chat/completions
Too dense?