Find the line (or curve) that splits two classes with the widest margin possible.
Key idea
Find the boundary with the widest possible gap. If you can draw multiple straight lines separating two classes, the SVM picks the one that leaves the most "elbow room" on either side. The points right on the edge of the gap are the support vectors — they alone determine the boundary.
Toggle the kernel · slide C (linear) or γ (RBF) · watch the boundary bend and the support vectors light up
C = 1.00γ = 3.0
In linear mode, dashed lines mark the margin at f = ±1; outlined points are the support vectors. Drop the slider for C and watch the margin widen — the SVM tolerates more violations in exchange for a fatter buffer. Switch to RBF on the Concentric dataset and the boundary bends into a circle; raise γ to make it tighter (and more prone to overfit), lower it for a smoother fit.
Imagine two clusters of points on a 2D plane, one class on the left, one on the right. A regular classifier might draw any line between them. An SVM picks the line that's maximally far from both — the "fattest" possible separator. The intuition: leaving a wide margin should generalize better to new points than squeezing a line right up against the data.
When the classes aren't linearly separable in the original feature space, the kernel trick lets SVMs find a separator in an implicit higher-dimensional space without ever explicitly computing the coordinates. That's how SVMs handle non-linear problems.
Reach for it when
Small to medium data (≲ 50k points)
High-dimensional features (text, gene expression)
You want a strong out-of-the-box classifier with sensible defaults
Clean binary classification where the boundary is the interesting object
Skip it when
Large data — SVMs scale poorly (≳ O(N²))
You need calibrated probabilities (SVM scores aren't probabilities)
You don't have a good kernel guess and don't want to grid-search
w, bnormal vector and offset of the separating hyperplane
yi ∈ {−1, +1}class label
max(0, 1 − …)hinge loss — penalises points inside the margin or misclassified
Cregularization knob — large C = hard margin, small = wide margin tolerating mistakes
The max-margin idea. For a linearly separable problem, find w and b such that yi(wᵀxi + b) ≥ 1 for all training points, and minimise ‖w‖. Minimising ‖w‖ is equivalent to maximising the margin width 2/‖w‖. Only the points exactly on the margin (the "support vectors") matter; the rest could be deleted without changing the answer.
Soft margin. Real data isn't perfectly separable. The hinge loss allows some points inside the margin or misclassified, paying a linear penalty. The C hyperparameter trades margin width against training error.
The kernel trick. SVMs depend on the data only through inner products xiᵀxj. Replace these with a kernel function K(xi, xj) = φ(xi)ᵀφ(xj), where φ maps to a higher-dimensional space — without ever computing φ explicitly. Common kernels:
• Linear:K = xᵀx'. Fast baseline. Use for high-dim sparse data (text).
• Polynomial:K = (xᵀx' + c)d. Captures feature interactions up to degree d.
• RBF (Gaussian):K = exp(−γ‖x − x'‖²). The default. Infinite-dim feature space.
• Sigmoid: historical; rarely used today.
Scaling matters. All distance-based kernels (RBF, polynomial) are sensitive to feature scales. Always standardize first.
Reach for it when
Small or medium dataset where O(N²) training is acceptable
You want a strong baseline with limited hyperparameter tuning
High-dimensional sparse features (text classification was SVMs' golden era)
You want non-linear decision boundaries without engineering features
Skip it when
Data is large — SGD-trained linear models scale much better
You need probability outputs (Platt scaling exists but is post-hoc)
Multi-class with many classes — one-vs-rest gets expensive
You want feature importance / explanations — SVMs are opaque
Decision: f(x) = sign(Σ αi yi K(xi, x) + b) — only kernels with support vectors
Primal vs. dual. The primal optimises over w (size = number of features). The dual optimises over α (size = number of training points). Use the primal when features ≪ samples; use the dual when features ≫ samples (where the kernel trick lives). The KKT conditions give the explicit relationship: w = Σ αiyiφ(xi), summing only over support vectors.
Kernel choice as inductive bias. A kernel encodes "what does similar mean?". RBF treats all features symmetrically; if some matter more, scale them differently. Polynomial encodes specific interaction orders. String kernels, graph kernels, and learned kernels (deep kernels, NTK) extend SVMs to structured data — though for most of these, modern neural approaches have won.
Probability calibration. SVM decision scores aren't probabilities. Platt scaling fits a logistic regression on the SVM outputs using a held-out set: P(y=1|x) ≈ σ(A·f(x) + B). sklearn does this when you pass probability=True (which slows training noticeably). For honest probabilities, prefer logistic regression directly.
Scaling SVMs. Standard SVMs are O(N²)–O(N³). For larger data: (1) LinearSVC uses a different solver and scales to millions; (2) approximate kernels (Nyström, random Fourier features) project data into a finite-dim space, then train a linear model; (3) SGDClassifier with hinge loss = linear SVM via stochastic gradient descent, scales to streams.
Why SVMs lost. Once feature engineering was replaced by representation learning (deep nets) and tabular wins went to gradient boosting, SVMs lost their main strongholds. They're still relevant for: small-data classification with strong kernel priors, one-class anomaly detection (one-class SVM), and as a theoretical object — VC-dimension and margin bounds came from SVM theory.
Reach for it when
Small-data classification with domain-knowledge kernels (e.g. string / graph kernels)
One-class SVMs for novelty / anomaly detection
Linear SVMs at scale via LinearSVC or SGD-hinge
Theoretical work where margin bounds matter
Skip it when
You're in modern deep-learning territory — representations + simple heads dominate
Probabilistic outputs are first-class — go logistic / GP / Bayesian
Big tabular — boosted trees almost always win
You don't want to tune C and γ — RF is more forgiving
from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Approximate RBF kernel + linear model — scales to millions of points
big_svm = Pipeline([
("scale", StandardScaler()),
("rbf", RBFSampler(gamma="scale", n_components=500, random_state=0)),
("clf", SGDClassifier(loss="hinge", alpha=1e-4, max_iter=20)),
]).fit(X_train, y_train)
print(f"Test acc: {big_svm.score(X_test, y_test):.3f}")