Key idea

Images are grids of numbers; vision models find structure in them. Early layers spot edges; middle layers find textures; later ones recognise objects. The architecture (CNN, ViT) and the training objective (classification, segmentation, generation) together determine what the model can do — but they all start by hierarchically composing simple features into complex ones.

Apply classic data augmentations — the same image, transformed, becomes many training examples

A simple 28×28 image with each classic augmentation. Augmentations are how vision models get data efficiency — one labelled image becomes hundreds of training examples. The model learns to be invariant to the things you augment over: small translations, rotations, lighting changes, colour shifts, partial occlusion.

Classification. Predict one (or top-k) label for the whole image. ImageNet-style training. CNNs (ResNet, EfficientNet) dominated; ViTs caught up at scale.

Object detection. Localise objects with bounding boxes + class labels. Two-stage (Faster R-CNN) vs one-stage (YOLO, RetinaNet, DETR). YOLO is fast; DETR-family is end-to-end with attention.

Segmentation. Per-pixel class label. Semantic segmentation (label per pixel) vs instance segmentation (label per pixel + instance ID). U-Net, Mask R-CNN, SAM are landmarks.

Generation. Make new images from text, sketches, or noise. Diffusion models dominate now (Stable Diffusion, Imagen, DALL-E 3). See the Generative Models page.

Self-supervised pretraining. The backbone for most modern vision. DINO, MAE, CLIP — all produce strong general-purpose vision features that beat training from scratch on most downstream tasks.

Common vision tasks

  • Classification: ResNet, ConvNeXt, ViT, EfficientNet
  • Detection: YOLO (fast), Faster R-CNN (accurate), DETR (end-to-end)
  • Segmentation: U-Net (medical), SAM (general), Mask R-CNN
  • Pose / keypoints: OpenPose, Detectron2, MediaPipe
  • Generation: Stable Diffusion, Flux, DALL-E 3

Watch out

  • Distribution shift kills production models — different lighting, sensors, populations
  • Adversarial perturbations exist — important for safety-critical deployment
  • Annotation is expensive — bounding boxes, masks, keypoints all cost
  • "Fairness" issues — face datasets are biased; deployment matters
import torch, torchvision
from torchvision import transforms

# Classic augmentation pipeline
train_tfm = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.5, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.2, 0.2, 0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std =[0.229, 0.224, 0.225]),    # ImageNet stats
])

# Pretrained backbone + a new head
model = torchvision.models.resnet50(weights="IMAGENET1K_V2")
model.fc = torch.nn.Linear(model.fc.in_features, num_classes)
Want ViTs, DETR, segmentation, & vision foundation models?
Convolution $$ (f * g)[i, j] = \sum_{m, n} f[i - m,\, j - n] \cdot g[m, n] $$
  • Local, weight-shared, translation-equivariant
  • The inductive bias that made deep learning work for vision
  • ViTs partially abandon it for self-attention

CNNs (ResNet family). Convolutional layers — local kernels, weight-shared, translation-equivariant. Residual connections (He et al. 2016) made very deep networks trainable. ResNet-50 was the default vision backbone for years.

Vision Transformers (ViT). Split image into patches, treat each as a token, run a transformer. Dosovitskiy et al. (2020). Scales beautifully with data and compute; matches or beats CNNs above ~100M images.

Object detection: from R-CNN to DETR. R-CNN: extract proposals, classify each. Faster R-CNN: shared backbone + region proposal network. YOLO: single forward pass, dense predictions. DETR: end-to-end transformer with bipartite matching. RT-DETR, DINO-DETR are recent transformers detection SOTA.

Segmentation. Fully Convolutional Networks (Long et al. 2015) — the original. U-Net (Ronneberger et al. 2015) — encoder-decoder with skip connections, dominant in medical imaging. SAM (Kirillov et al. 2023) — Meta's "segment anything" foundation model with promptable segmentation.

CLIP and multi-modal. Radford et al. (2021). Image and text encoders trained contrastively on (image, caption) pairs. Aligns image and text in a shared embedding space — enables zero-shot classification by similarity to text prompts. Foundation for modern multi-modal models (LLaVA, GPT-4V).

Data augmentation as regularization. Beyond simple flips and crops: CutMix, MixUp, AugMix, RandAugment. Strong augmentation is a key reason modern vision models train so data-efficiently — the model is forced to learn meaningful features rather than surface texture.

import torch
from transformers import ViTForImageClassification, ViTImageProcessor

# Vision Transformer fine-tuning
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
model     = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=num_classes,
    ignore_mismatched_sizes=True,
)

inputs = processor(images=batch_pil, return_tensors="pt")
out    = model(**inputs, labels=labels)
out.loss.backward()
Want video, 3D, dense prediction, NeRF, and diffusion-based generation?
Neural Radiance Field (NeRF) $$ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\, \sigma(\mathbf{r}(t))\, \mathbf{c}(\mathbf{r}(t), \mathbf{d})\, dt $$
  • Volumetric rendering integral parametrised by an MLP
  • Input: 3D point + view direction; output: density and emitted colour
  • 3D reconstruction from 2D images — and the basis for many 3D-aware models

Video understanding. Temporal context matters. Approaches: 3D CNNs (I3D, SlowFast), two-stream (RGB + optical flow), video transformers (TimeSformer, ViViT), masked video modelling. Video is also a great pre-training signal — the future-prediction objective transfers to many downstream tasks.

Dense prediction. Depth estimation (MiDaS, Marigold), surface normals, optical flow (RAFT), keypoint detection. Often shared backbone + task-specific head. Modern approach: pretrain on huge unlabelled video, fine-tune for the specific dense task.

3D from 2D. NeRF (Mildenhall et al. 2020) and its descendants represent a scene as a continuous field over 3D coordinates, queried by volume rendering. Gaussian Splatting (Kerbl et al. 2023) is faster, doesn't use an MLP. Both reconstruct 3D from multi-view images.

Diffusion for vision. Stable Diffusion, SDXL, Flux, DALL-E 3, Midjourney — all latent diffusion models. Text-to-image, image-to-image, inpainting, control via additional conditioning (ControlNet, T2I-Adapter). See Diffusion Models.

Foundation models for vision. CLIP, DINOv2, SAM, Florence-2. Pretrained encoders that you build downstream tasks on top of. The shift from "train your own classifier from scratch" to "embed with a foundation model + linear head" has been complete in industry.

Vision-language models. LLaVA, MiniGPT, GPT-4V, Gemini, Claude. A vision encoder feeds image tokens into an LLM. Enables open-ended image understanding — describe, reason, answer. The bridge between vision and reasoning.

from segment_anything import SamPredictor, sam_model_registry
import cv2

# SAM — segment anything with a single prompt point
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

image = cv2.imread("photo.jpg")
predictor.set_image(image)

# Prompt: a single point on the object of interest
mask, scores, _ = predictor.predict(
    point_coords=[[640, 480]],   # x, y
    point_labels=[1],            # 1 = positive (this point is the object)
    multimask_output=True,
)
Too dense?