Translation-equivariant networks for grid-structured data — images, video, audio spectrograms.
Key idea
Share the same small filter across the whole image. A CNN scans a tiny window over the input, looking for a particular pattern (an edge, a texture, a colour blob). Stack many such layers and the network builds up from edges, to textures, to object parts, to whole objects.
Drag the filter on the input · click any kernel cell to cycle its value
Input · 24×24
Feature map · 22×22
Kernel · 3×3
The kernel slides across the input, computing a weighted sum at every position. Try the Sobel-x kernel — the feature map lights up at vertical edges. Try blur on the digit — high-frequency detail vanishes. Click any kernel cell to cycle its value: build your own filter and see what it detects. The polished external explainers below take this much further.
Compare an image classifier with two designs. A fully-connected network gets one big vector — every pixel multiplied by its own weight. A million pixels means a million weights per output, and the network has to relearn "edges look like edges" everywhere in the image. A CNN reuses the same small filter (say 3×3) across all positions — much fewer parameters, and the model knows "an edge in the top left is the same kind of thing as an edge in the bottom right".
That weight sharing is why CNNs work so well on images. They also use pooling layers to shrink the spatial dimensions as you go deeper, building up to a few high-level features that summarize the whole image.
Reach for it when
Image classification, detection, segmentation
Audio / spectrogram processing
Any grid-structured data with local translation invariance
You have a pretrained backbone available
Skip it when
Text or sequences — use transformers / RNNs
Graph-structured data — use GNNs
Spatial relationships aren't local (e.g. very long-range dependencies)
You need a transformer ViT for the same dataset and have the compute
import torch.nn as nn
model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(64 * 8 * 8, n_classes),
)
Output is "what the filter found at each position"
Anatomy of a conv layer. Each layer learns F filters, applied at every spatial position. Input shape (B, Cin, H, W), output (B, F, H', W'). padding controls whether output spatial dims shrink; stride controls how aggressively they shrink. F typically grows with depth (more filters as we get more abstract).
Pooling. Max or average over small windows, downsampling spatial dims. Originally for invariance and parameter reduction; modern designs often skip pooling and use strided convolutions instead.
ResNet (He et al., 2015). Add a skip connection from input to output of a block: y = x + F(x). This made it possible to train 100+ layer networks by giving gradients a direct path back. Almost every modern vision architecture uses skip connections.
Modern norms. Batch normalization is the classical choice but breaks with small batches. Group / Layer norm are more robust. The norm + activation order (pre-norm vs post-norm) matters for training stability.
Reach for it when
Image classification — start from a pretrained ResNet / EfficientNet
Medical imaging where data is limited but a pretrained backbone helps
Skip it when
Very large image datasets and compute — Vision Transformers often win
Inputs aren't really grid-structured
You need global reasoning the receptive field can't reach
Reasoning over symbolic / discrete inputs
import torch.nn as nn
import torchvision.models as tvm
# Transfer learning: load a pretrained ResNet, replace the final layer
model = tvm.resnet50(weights=tvm.ResNet50_Weights.DEFAULT)
model.fc = nn.Linear(model.fc.in_features, n_classes)
# Freeze backbone for the first few epochs
for p in model.parameters():
p.requires_grad = False
for p in model.fc.parameters():
p.requires_grad = True
Want receptive fields, dilated convs, and modern variants?
The portion of the input that influences one output pixel grows with depth
Receptive field engineering. The output of a deep CNN at one position only "sees" a finite patch of the input. Dilated (atrous) convolutions enlarge the receptive field without growing parameters — crucial for semantic segmentation and dense prediction. The effective receptive field is smaller than theoretical: most weight goes near the center.
Modern architectures. ResNet (skip connections), DenseNet (every-to-every connections), EfficientNet (compound scaling of depth/width/resolution), ConvNeXt (CNN designed to match ViT performance with modern tricks). Most production CV systems use a ResNet or EfficientNet backbone with task-specific heads.
Inductive bias. CNNs hard-code three priors: locality (filters are small), translation equivariance (same filter everywhere), hierarchical composition (deeper layers see more). These priors let CNNs learn from much less data than ViTs. ViTs win when you have enough data to learn those priors from scratch.
Depthwise separable. A 3×3 conv with Cin→Cout channels has 9·Cin·Cout params. Split into 3×3 depthwise (one filter per input channel) + 1×1 pointwise (mix channels): 9·Cin + Cin·Cout params. Used in MobileNet, EfficientNet — much cheaper at modest accuracy cost.
Equivariance and invariance. A CNN is equivariant to translation: shift the input, the output shifts the same way. With pooling / global average pooling on top, the final prediction is invariant to translation. For rotation / reflection equivariance, see group-equivariant CNNs (Cohen & Welling).