Key idea

Most high-dim data lives on a lower-dim "shape". A photo has millions of pixels, but only a few hundred numbers can describe most of what's in it. Dimensionality reduction finds those few numbers.

Slide θ to spin the projection axis — watch the 1D histogram widen and narrow; "Find PC1" snaps to the max-variance angle
θ = 0°

PCA is exactly this game played analytically: find the axis that maximises projected variance. The faint indigo cross-hairs are the principal axes the model found (PC1 and PC2); the orange axis is yours to spin. When yours matches PC1, the histogram is at its widest. The number under it is what people mean by "captured variance" — for the tilted dataset, one component recovers ~90% of the structure.

Two reasons to do this. First, visualization: you can't plot 50 dimensions, but you can plot 2. Algorithms like t-SNE and UMAP project high-dim data into 2D in a way that preserves which points are near each other.

Second, compression and denoising: removing noise dimensions can make downstream models faster and sometimes more accurate. PCA is the classical workhorse here — it finds the directions of greatest variance and projects onto them.

Reach for it when

  • Visualizing high-dim data in 2D / 3D
  • Speeding up downstream models on high-dim features
  • Removing redundant correlated features
  • Feature engineering for tabular ML

Skip it when

  • You need every feature to be interpretable — projected dimensions are mixes
  • Each feature already carries unique signal (no redundancy)
  • You're modelling images / text — let the model learn the embedding
  • You need to reconstruct the original space exactly
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize first — PCA is variance-based, scale-sensitive
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2).fit(X_scaled)
X_2d = pca.transform(X_scaled)

print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Cumulative:          {pca.explained_variance_ratio_.cumsum()}")
Want PCA's math and the t-SNE / UMAP story?
PCA via SVD $$ X_{\text{centered}} \;=\; U \Sigma V^\top, \qquad \text{principal components} = V_{:,1{:}k} $$
  • Vright singular vectors — the principal directions in feature space
  • Σsingular values — magnitudes; squared ∝ variance along each direction
  • Top k columns of V give the best k-dim linear approximation in the L2 sense

PCA finds the linear projection that preserves the most variance. It's optimal under squared-error loss, fast (closed form via SVD), interpretable (each component is a weighted combination of original features), and computes uncertainty / reconstruction errors. Downside: linear only — can't unfold curved manifolds.

t-SNE models pairwise similarities in high-dim space as Gaussian, in low-dim space as Student-t (heavier tails). Minimizes KL divergence between the two. Preserves local neighborhood structure beautifully but distorts global distances. Stochastic — different runs give different layouts. Don't use for downstream modelling, only visualization.

UMAP is t-SNE's faster successor. Builds a fuzzy topological graph, then optimizes a low-dim embedding to match it. Faster than t-SNE, better at preserving global structure, deterministic-ish with a seed. Has become the default visualization method for embeddings.

ICA (Independent Component Analysis) finds projections that are statistically independent, not just uncorrelated. Recovers true source signals when they're non-Gaussian — classic application is blind source separation (the "cocktail party problem").

Reach for it when

  • PCA: denoising, decorrelating features for downstream models
  • t-SNE / UMAP: 2D visualization of high-dim embeddings
  • ICA: recovering independent sources (EEG, audio separation)
  • You need to compress before clustering or kNN

Skip it when

  • You need to preserve global distances exactly (t-SNE / UMAP distort them)
  • Non-linear structure with no global trend (PCA fails)
  • Features are sparse — PCA densifies and destroys sparsity benefits
  • Embeddings will be used as features downstream — t-SNE / UMAP outputs aren't designed for that
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Reduce to 50 dims with PCA first — t-SNE / UMAP are slow on raw high-dim data
X_pca = PCA(n_components=50, random_state=0).fit_transform(X)

# Then non-linear embedding to 2D
X_tsne = TSNE(n_components=2, perplexity=30, init="pca", random_state=0).fit_transform(X_pca)
X_umap = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X_pca)

# PCA -> t-SNE / UMAP is the standard recipe for visualizing image / text embeddings
Want the manifold view, autoencoders, and pitfalls?
Probabilistic PCA $$ \mathbf{x} \;=\; W\,\mathbf{z} + \boldsymbol{\mu} + \boldsymbol{\varepsilon}, \quad \mathbf{z} \sim \mathcal{N}(0, I),\; \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2 I) $$
  • PCA = max-likelihood limit (σ² → 0) of this latent-variable model
  • Generalizes to factor analysis, mixture-of-PPCA, and variational autoencoders

PCA pitfalls. (1) Scale matters — components are dominated by high-variance features unless you standardize. (2) Linear only — fails on curved manifolds like the Swiss roll. (3) Outliers can hijack the top components — use robust PCA (decomposition into low-rank + sparse) if outliers are a concern.

Choosing k. Cumulative explained variance to a threshold (e.g. 95%). Scree-plot elbow. Cross-validated reconstruction error. For downstream classification, choose k via held-out task performance — the variance criterion isn't task-aligned.

Kernel PCA uses the kernel trick to do PCA in a feature space implicitly defined by a kernel. Solves the linearity limitation but loses the interpretability of axes.

Manifold learning. Isomap (geodesic distances + classical MDS), LLE (local linear embedding), Laplacian eigenmaps. Each makes a different assumption about local structure. t-SNE and UMAP dominate in practice but the older methods have stronger theoretical guarantees for specific manifold types.

Autoencoders are the deep-learning generalization. Encoder maps x → z, decoder maps z → x̂; train to minimize reconstruction error. Bottleneck dimension = embedding dim. VAEs add a probabilistic prior on z, giving you a generative model.

t-SNE caveats. Distances in the t-SNE plot are not meaningful. Cluster sizes are not meaningful. The number of clusters is dictated by perplexity as much as by the data. Always re-run with several perplexity values to check stability — see Wattenberg et al. "How to Use t-SNE Effectively".

Reach for it when

  • Robust PCA — large outliers expected
  • Kernel PCA — known kernel inductive bias and small / moderate data
  • Variational autoencoders — generative model + embedding in one
  • Topological data analysis — UMAP's preserved structure feeds persistent homology

Skip it when

  • Embedding must be stable under small data changes (t-SNE / UMAP are sensitive)
  • You need to invert the embedding back to original space (manifold methods can't)
  • Out-of-sample extension matters — t-SNE doesn't natively transform new points
  • You care about preserved geodesic distances — only Isomap claims that
import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, d_in, d_latent):
        super().__init__()
        self.enc = nn.Sequential(
            nn.Linear(d_in, 256), nn.ReLU(),
            nn.Linear(256, d_latent),
        )
        self.dec = nn.Sequential(
            nn.Linear(d_latent, 256), nn.ReLU(),
            nn.Linear(256, d_in),
        )
    def forward(self, x):
        z = self.enc(x)
        return self.dec(z), z

model = Autoencoder(d_in=X.shape[1], d_latent=16)
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for epoch in range(50):
    x_hat, _ = model(X)
    loss = loss_fn(x_hat, X)
    opt.zero_grad(); loss.backward(); opt.step()

# Embedding is model.enc(X)
Too dense?