A probability distribution over functions. Predict with calibrated uncertainty, not just a point estimate.
Key idea
Instead of one guess, give me a range. A Gaussian process predicts both a value and how confident it is about that value. Where you have lots of nearby data, it's confident; where you don't, it widens its uncertainty.
Click to add observations · shift+click to remove · slide ℓ to change how far each point's influence reaches
ℓ = 0.25σₙ = 0.05
The bold line is the posterior mean; the shaded band is the ±2σ uncertainty. The thin lines are four sample functions drawn from the posterior — they show what kinds of functions are still consistent with your data. Clear all observations and you're seeing the prior — the GP's belief before any data. Add a point and watch the band collapse to it (within σₙ); the uncertainty inflates the further you go.
Most regression models give you one number per prediction. A Gaussian process gives you a number and an error bar — and the error bar is data-aware: it shrinks near your training points and grows when you predict far away from anything you've seen.
That uncertainty is the whole point. GPs are slow and don't scale to big data, but when you have a few hundred points and you care about knowing what you don't know — design of experiments, Bayesian optimization, robotics — they're the right tool.
Reach for it when
You have small data (≲1000 points)
Calibrated uncertainty matters
You want a non-parametric model that adapts to the data
You're doing Bayesian optimization
Skip it when
You have a lot of data (GPs are O(N³))
You only need point predictions, not uncertainty
The signal is well-behaved and a simpler model would do
You're working with images or text (use neural nets)
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
gp = GaussianProcessRegressor(kernel=RBF(length_scale=1.0))
gp.fit(X_train, y_train)
# Predict with uncertainty
y_mean, y_std = gp.predict(X_test, return_std=True)
Want to see what's actually happening under the hood?
k(x,x')kernel — encodes how similar outputs should be when inputs are similar
A GP is a distribution over functions such that any finite set of function values is jointly Gaussian. The kernel determines what kind of functions are likely — smooth, periodic, rough, etc.
The kernel k is the heart of the model. The most common one — the RBF kernel k(x, x') = exp(−‖x − x'‖² / 2ℓ²) — assumes smooth functions; nearby inputs produce similar outputs, with "nearby" controlled by the length scale ℓ.
Given training data, the posterior over f(x*) at a test point is also Gaussian, in closed form:
Kθcovariance matrix induced by kernel hyperparameters θ
fit term — yTK-1y penalizes poor data fit
complexity term — log|K| penalizes overly-flexible kernels
The log marginal likelihood automatically trades off fit vs. complexity — a built-in Occam's razor. Optimize it w.r.t. kernel hyperparameters to set length scales, amplitudes, and noise simultaneously.
Scaling. The cubic cost of inverting the N×N kernel matrix is the headline limitation. Sparse approximations represent the GP via m ≪ N inducing points and reduce cost to O(Nm²). FITC and DTC are early variants; SVGP (variational, Hensman et al.) is the modern workhorse and scales to millions of points via stochastic optimization.
Kernel composition. Sums and products of valid kernels are valid kernels. This lets you encode structure: RBF + Periodic for time series with seasonality, RBF × Linear for trends, etc. Automatic kernel discovery (Duvenaud et al.) tries to compose these structures from data.
Connection to neural networks. An infinitely-wide neural network with random Gaussian weights converges to a GP (Neal, 1996). The NTK (neural tangent kernel) describes the GP that trained networks effectively act as in the infinite-width limit — connecting GPs to deep learning theory.
Likelihood freedom. The closed-form posterior requires Gaussian likelihood. For classification or other non-Gaussian likelihoods, approximate inference is needed: Laplace, EP (expectation propagation), or variational methods.
Reach for it when
Bayesian optimization / experimental design
Spatial statistics (kriging is a GP)
You want interpretable hyperparameters (length scale = "how far is similar?")
Multi-fidelity modelling — cheap and expensive evaluations
Skip it when
You can't choose a sensible kernel (no inductive bias)
You need extrapolation — RBF kernels decay to the prior mean
Strict inference latency — sparse GPs are still slower than NNs
Heavy-tailed noise — Gaussian likelihood is too sensitive to outliers
import gpytorch, torch
class ExactGP(gpytorch.models.ExactGP):
def __init__(self, X, y, likelihood):
super().__init__(X, y, likelihood)
self.mean = gpytorch.means.ConstantMean()
self.cov = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
return gpytorch.distributions.MultivariateNormal(self.mean(x), self.cov(x))
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGP(X_train, y_train, likelihood)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
# Optimize marginal likelihood via Adam
model.train(); likelihood.train()
opt = torch.optim.Adam(model.parameters(), lr=0.1)
for _ in range(200):
opt.zero_grad()
loss = -mll(model(X_train), y_train)
loss.backward()
opt.step()