FastAPI, BentoML, Triton, vLLM — how a trained model becomes an endpoint your product can call.
Key idea
Training and serving are different problems. Training optimises for throughput on a fixed dataset. Serving optimises for latency on one request at a time (or small batches). The right tool, the right format, and the right batching strategy all change. Don't ship a 100-line Flask app; use one of the four standard serving stacks.
The four standard stacks. FastAPI + uvicorn for "I want a Python endpoint that calls my model" (small models, hobby projects). BentoML for "I want a packaged, deployable model" (production tabular / sklearn / small DL). Triton for "I want NVIDIA-optimised, dynamic batching, multi-framework". vLLM (or TGI) for "I'm serving an LLM and need throughput".
Three big serving decisions. Sync vs async (mostly sync for low-latency, async if you can wait or stream). Single-request vs batched (batching trades latency for throughput). CPU vs GPU (CPU for small models, GPU for transformers / large CNNs).
Stack by use-case
Small Python model: FastAPI + uvicorn
Tabular / sklearn / packaged: BentoML
GPU model with dynamic batching: Triton
LLM: vLLM, TGI, or a hosted API
Edge / mobile: ONNX Runtime, TFLite, CoreML
Common pitfalls
Loading the model on every request — load once at startup
Hardcoded paths — use a registry alias instead
Synchronous I/O blocking GPU — use async or thread pool
No request validation — bad inputs crash the server
No timeout — a slow GPU stalls everything
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import torch, mlflow
app = FastAPI()
# Load the model once at startup
MODEL = mlflow.pytorch.load_model("models:/credit-risk@production").eval()
class Request(BaseModel):
features: list[float] = Field(..., min_length=16, max_length=16)
class Response(BaseModel):
score: float
model_version: str
@app.post("/predict", response_model=Response)
def predict(req: Request):
x = torch.tensor(req.features, dtype=torch.float32).unsqueeze(0)
with torch.inference_mode():
score = MODEL(x).sigmoid().item()
return Response(score=score, model_version=MODEL.version)
@app.get("/health")
def health(): return {"status": "ok"}
Want dynamic batching, ONNX, & the LLM serving stack?
Profile each piece — the bottleneck is rarely where you think
Pre/post-processing often dominates small-model latency
Queue time is what dynamic batching trades against
Dynamic batching. The server accumulates inflight requests up to a small time window (10–50 ms) or batch size; runs the model once on the batch; splits the outputs. Triton, BentoML, and most serving frameworks support this natively. Huge throughput win for GPU-served models with bursty traffic.
ONNX. Open neural-network exchange — convert your PyTorch / TF model to a portable graph. Runs on ONNX Runtime (CPU, GPU, edge, mobile). Useful when you want to deploy outside the original training framework, or want graph-level optimisations.
TorchScript vs torch.compile vs ONNX. TorchScript: PyTorch's classic graph format. torch.compile: modern JIT (PyTorch 2.0+); usually faster for training. ONNX: framework-agnostic. For pure inference on PyTorch, torch.compile or ONNX Runtime are usually fastest.
LLM serving stack. vLLM (UC Berkeley) and TGI (HuggingFace) are the open-source standards. PagedAttention to avoid memory waste, continuous batching to mix concurrent requests, KV cache management. Hosted APIs (OpenAI, Anthropic, Replicate) abstract all of this.
Quantisation for serving. int8 / int4 weights. Often 2–4× smaller and faster with little accuracy loss. bitsandbytes, autoawq, gptq for LLMs. torch.ao for tabular / vision. See Quantization & Distillation.
Multi-model serving. Modern serving frameworks let you load multiple models in one process, share GPU memory, and route requests based on the URL. Useful when you have many small models — or different fine-tunes of one base.
import bentoml, torch
# Export model for BentoML
saved = bentoml.pytorch.save_model("credit-risk", model)
# Service definition
@bentoml.service(
resources={"gpu": 1},
traffic={"timeout": 5},
workers=1,
)
class CreditRiskService:
@bentoml.api(
batchable=True,
batch_dim=0,
max_batch_size=64,
max_latency_ms=50,
)
async def predict(self, features: list[list[float]]) -> list[float]:
x = torch.tensor(features)
with torch.inference_mode():
scores = self.model(x).sigmoid()
return scores.tolist()
# bentoml serve credit_risk:CreditRiskService --port 3000
Continuous batching mixes prefill and decode → big throughput gain
PagedAttention frees fragmented KV memory blocks
PagedAttention (vLLM). Kwon et al. (2023). KV cache stored in paged blocks (like virtual memory). Avoids the fragmentation of contiguous-allocation schemes. 2–4× throughput improvement.
Continuous batching. Don't wait for a batch to finish before starting a new one — feed in new requests as old ones complete. Combined with prefill / decode separation, much higher GPU util than naive batching.
Speculative decoding. A small "draft" model proposes k tokens; the big "target" model verifies them in one pass. Accept the prefix the target agrees with; resample from where it disagrees. 2× speed-up with no accuracy loss (in expectation). Built into vLLM, TGI, llama.cpp.
Autoscaling. KServe, BentoCloud, Modal, RunPod. Scale replicas based on QPS, queue depth, or GPU utilisation. Pre-warmed pools for low cold-start. Scale-to-zero for cost-sensitive workloads.
SLA engineering. p50, p95, p99 latency. Saturation thresholds. Error budgets. Most ML APIs have "p99 latency under X ms" as a hard requirement. Monitor + alert on it; load-test with realistic traffic distributions.
Edge inference. ONNX Runtime, TFLite, CoreML, MediaPipe. Quantisation is essential (int8 or int4). Many vision and small NLP models run on phones with sub-100ms latency.
Streaming responses. For LLMs, stream tokens as they're generated. Server-Sent Events or HTTP chunked transfer. Improves perceived latency dramatically; standard in all production chat APIs.
# vLLM — serve an LLM with PagedAttention + continuous batching
# pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.85,
max_model_len=8192,
enable_chunked_prefill=True,
)
sampling = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text)
# Or run as a server:
# vllm serve meta-llama/Llama-3-8B-Instruct --port 8000
# OpenAI-compatible endpoint at /v1/chat/completions