Key idea

Find the boundary with the widest possible gap. If you can draw multiple straight lines separating two classes, the SVM picks the one that leaves the most "elbow room" on either side. The points right on the edge of the gap are the support vectors — they alone determine the boundary.

Toggle the kernel · slide C (linear) or γ (RBF) · watch the boundary bend and the support vectors light up
C = 1.00 γ = 3.0

In linear mode, dashed lines mark the margin at f = ±1; outlined points are the support vectors. Drop the slider for C and watch the margin widen — the SVM tolerates more violations in exchange for a fatter buffer. Switch to RBF on the Concentric dataset and the boundary bends into a circle; raise γ to make it tighter (and more prone to overfit), lower it for a smoother fit.

Imagine two clusters of points on a 2D plane, one class on the left, one on the right. A regular classifier might draw any line between them. An SVM picks the line that's maximally far from both — the "fattest" possible separator. The intuition: leaving a wide margin should generalize better to new points than squeezing a line right up against the data.

When the classes aren't linearly separable in the original feature space, the kernel trick lets SVMs find a separator in an implicit higher-dimensional space without ever explicitly computing the coordinates. That's how SVMs handle non-linear problems.

Reach for it when

  • Small to medium data (≲ 50k points)
  • High-dimensional features (text, gene expression)
  • You want a strong out-of-the-box classifier with sensible defaults
  • Clean binary classification where the boundary is the interesting object

Skip it when

  • Large data — SVMs scale poorly (≳ O(N²))
  • You need calibrated probabilities (SVM scores aren't probabilities)
  • You don't have a good kernel guess and don't want to grid-search
  • Highly imbalanced classes — needs careful class-weight tuning
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("scale", StandardScaler()),    # SVMs are scale-sensitive
    ("svm",   SVC(kernel="rbf", C=1.0, gamma="scale")),
]).fit(X_train, y_train)

print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
Want the max-margin math and the kernel trick?
Soft-margin SVM $$ \min_{\mathbf{w}, b} \;\tfrac{1}{2}\|\mathbf{w}\|^2 \;+\; C \sum_{i=1}^{N} \max\!\left(0,\, 1 - y_i\,(\mathbf{w}^\top \mathbf{x}_i + b)\right) $$
  • w, bnormal vector and offset of the separating hyperplane
  • yi ∈ {−1, +1}class label
  • max(0, 1 − …)hinge loss — penalises points inside the margin or misclassified
  • Cregularization knob — large C = hard margin, small = wide margin tolerating mistakes

The max-margin idea. For a linearly separable problem, find w and b such that yi(wᵀxi + b) ≥ 1 for all training points, and minimise ‖w‖. Minimising ‖w‖ is equivalent to maximising the margin width 2/‖w‖. Only the points exactly on the margin (the "support vectors") matter; the rest could be deleted without changing the answer.

Soft margin. Real data isn't perfectly separable. The hinge loss allows some points inside the margin or misclassified, paying a linear penalty. The C hyperparameter trades margin width against training error.

The kernel trick. SVMs depend on the data only through inner products xiᵀxj. Replace these with a kernel function K(xi, xj) = φ(xi)ᵀφ(xj), where φ maps to a higher-dimensional space — without ever computing φ explicitly. Common kernels:

Linear: K = xᵀx'. Fast baseline. Use for high-dim sparse data (text).
Polynomial: K = (xᵀx' + c)d. Captures feature interactions up to degree d.
RBF (Gaussian): K = exp(−γ‖x − x'‖²). The default. Infinite-dim feature space.
Sigmoid: historical; rarely used today.

Scaling matters. All distance-based kernels (RBF, polynomial) are sensitive to feature scales. Always standardize first.

Reach for it when

  • Small or medium dataset where O(N²) training is acceptable
  • You want a strong baseline with limited hyperparameter tuning
  • High-dimensional sparse features (text classification was SVMs' golden era)
  • You want non-linear decision boundaries without engineering features

Skip it when

  • Data is large — SGD-trained linear models scale much better
  • You need probability outputs (Platt scaling exists but is post-hoc)
  • Multi-class with many classes — one-vs-rest gets expensive
  • You want feature importance / explanations — SVMs are opaque
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Standard recipe: scale → SVM with grid-searched C, gamma
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("svm",   SVC()),
])

grid = GridSearchCV(
    pipe,
    {
        "svm__kernel": ["rbf"],
        "svm__C":     [0.1, 1, 10, 100],
        "svm__gamma": ["scale", 0.01, 0.1, 1.0],
    },
    cv=5, scoring="f1", n_jobs=-1,
).fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Test accuracy: {grid.score(X_test, y_test):.3f}")
Want the dual form, KKT conditions, and the scaling story?
Dual formulation $$ \max_{\boldsymbol{\alpha}} \;\sum_{i} \alpha_i \;-\; \tfrac{1}{2}\sum_{i,j} \alpha_i \alpha_j \, y_i y_j \, K(\mathbf{x}_i, \mathbf{x}_j) \qquad \text{s.t.}\; 0 \le \alpha_i \le C,\; \sum_i \alpha_i y_i = 0 $$
  • Dual variables αi — one per training point
  • Non-zero αi ⟺ point i is a support vector
  • Decision: f(x) = sign(Σ αi yi K(xi, x) + b) — only kernels with support vectors

Primal vs. dual. The primal optimises over w (size = number of features). The dual optimises over α (size = number of training points). Use the primal when features ≪ samples; use the dual when features ≫ samples (where the kernel trick lives). The KKT conditions give the explicit relationship: w = Σ αiyiφ(xi), summing only over support vectors.

Kernel choice as inductive bias. A kernel encodes "what does similar mean?". RBF treats all features symmetrically; if some matter more, scale them differently. Polynomial encodes specific interaction orders. String kernels, graph kernels, and learned kernels (deep kernels, NTK) extend SVMs to structured data — though for most of these, modern neural approaches have won.

Probability calibration. SVM decision scores aren't probabilities. Platt scaling fits a logistic regression on the SVM outputs using a held-out set: P(y=1|x) ≈ σ(A·f(x) + B). sklearn does this when you pass probability=True (which slows training noticeably). For honest probabilities, prefer logistic regression directly.

Scaling SVMs. Standard SVMs are O(N²)–O(N³). For larger data: (1) LinearSVC uses a different solver and scales to millions; (2) approximate kernels (Nyström, random Fourier features) project data into a finite-dim space, then train a linear model; (3) SGDClassifier with hinge loss = linear SVM via stochastic gradient descent, scales to streams.

Why SVMs lost. Once feature engineering was replaced by representation learning (deep nets) and tabular wins went to gradient boosting, SVMs lost their main strongholds. They're still relevant for: small-data classification with strong kernel priors, one-class anomaly detection (one-class SVM), and as a theoretical object — VC-dimension and margin bounds came from SVM theory.

Reach for it when

  • Small-data classification with domain-knowledge kernels (e.g. string / graph kernels)
  • One-class SVMs for novelty / anomaly detection
  • Linear SVMs at scale via LinearSVC or SGD-hinge
  • Theoretical work where margin bounds matter

Skip it when

  • You're in modern deep-learning territory — representations + simple heads dominate
  • Probabilistic outputs are first-class — go logistic / GP / Bayesian
  • Big tabular — boosted trees almost always win
  • You don't want to tune C and γ — RF is more forgiving
from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Approximate RBF kernel + linear model — scales to millions of points
big_svm = Pipeline([
    ("scale", StandardScaler()),
    ("rbf",   RBFSampler(gamma="scale", n_components=500, random_state=0)),
    ("clf",   SGDClassifier(loss="hinge", alpha=1e-4, max_iter=20)),
]).fit(X_train, y_train)

print(f"Test acc: {big_svm.score(X_test, y_test):.3f}")
Too dense?