Classifying, detecting, segmenting, generating — what models do with pixels.
Key idea
Images are grids of numbers; vision models find structure in them. Early layers spot edges; middle layers find textures; later ones recognise objects. The architecture (CNN, ViT) and the training objective (classification, segmentation, generation) together determine what the model can do — but they all start by hierarchically composing simple features into complex ones.
Apply classic data augmentations — the same image, transformed, becomes many training examples
A simple 28×28 image with each classic augmentation. Augmentations are how vision models get data efficiency — one labelled image becomes hundreds of training examples. The model learns to be invariant to the things you augment over: small translations, rotations, lighting changes, colour shifts, partial occlusion.
Classification. Predict one (or top-k) label for the whole image. ImageNet-style training. CNNs (ResNet, EfficientNet) dominated; ViTs caught up at scale.
Object detection. Localise objects with bounding boxes + class labels. Two-stage (Faster R-CNN) vs one-stage (YOLO, RetinaNet, DETR). YOLO is fast; DETR-family is end-to-end with attention.
Segmentation. Per-pixel class label. Semantic segmentation (label per pixel) vs instance segmentation (label per pixel + instance ID). U-Net, Mask R-CNN, SAM are landmarks.
Generation. Make new images from text, sketches, or noise. Diffusion models dominate now (Stable Diffusion, Imagen, DALL-E 3). See the Generative Models page.
Self-supervised pretraining. The backbone for most modern vision. DINO, MAE, CLIP — all produce strong general-purpose vision features that beat training from scratch on most downstream tasks.
The inductive bias that made deep learning work for vision
ViTs partially abandon it for self-attention
CNNs (ResNet family). Convolutional layers — local kernels, weight-shared, translation-equivariant. Residual connections (He et al. 2016) made very deep networks trainable. ResNet-50 was the default vision backbone for years.
Vision Transformers (ViT). Split image into patches, treat each as a token, run a transformer. Dosovitskiy et al. (2020). Scales beautifully with data and compute; matches or beats CNNs above ~100M images.
Object detection: from R-CNN to DETR. R-CNN: extract proposals, classify each. Faster R-CNN: shared backbone + region proposal network. YOLO: single forward pass, dense predictions. DETR: end-to-end transformer with bipartite matching. RT-DETR, DINO-DETR are recent transformers detection SOTA.
Segmentation. Fully Convolutional Networks (Long et al. 2015) — the original. U-Net (Ronneberger et al. 2015) — encoder-decoder with skip connections, dominant in medical imaging. SAM (Kirillov et al. 2023) — Meta's "segment anything" foundation model with promptable segmentation.
CLIP and multi-modal. Radford et al. (2021). Image and text encoders trained contrastively on (image, caption) pairs. Aligns image and text in a shared embedding space — enables zero-shot classification by similarity to text prompts. Foundation for modern multi-modal models (LLaVA, GPT-4V).
Data augmentation as regularization. Beyond simple flips and crops: CutMix, MixUp, AugMix, RandAugment. Strong augmentation is a key reason modern vision models train so data-efficiently — the model is forced to learn meaningful features rather than surface texture.
import torch
from transformers import ViTForImageClassification, ViTImageProcessor
# Vision Transformer fine-tuning
processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = ViTForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
num_labels=num_classes,
ignore_mismatched_sizes=True,
)
inputs = processor(images=batch_pil, return_tensors="pt")
out = model(**inputs, labels=labels)
out.loss.backward()
Want video, 3D, dense prediction, NeRF, and diffusion-based generation?
Volumetric rendering integral parametrised by an MLP
Input: 3D point + view direction; output: density and emitted colour
3D reconstruction from 2D images — and the basis for many 3D-aware models
Video understanding. Temporal context matters. Approaches: 3D CNNs (I3D, SlowFast), two-stream (RGB + optical flow), video transformers (TimeSformer, ViViT), masked video modelling. Video is also a great pre-training signal — the future-prediction objective transfers to many downstream tasks.
Dense prediction. Depth estimation (MiDaS, Marigold), surface normals, optical flow (RAFT), keypoint detection. Often shared backbone + task-specific head. Modern approach: pretrain on huge unlabelled video, fine-tune for the specific dense task.
3D from 2D. NeRF (Mildenhall et al. 2020) and its descendants represent a scene as a continuous field over 3D coordinates, queried by volume rendering. Gaussian Splatting (Kerbl et al. 2023) is faster, doesn't use an MLP. Both reconstruct 3D from multi-view images.
Diffusion for vision. Stable Diffusion, SDXL, Flux, DALL-E 3, Midjourney — all latent diffusion models. Text-to-image, image-to-image, inpainting, control via additional conditioning (ControlNet, T2I-Adapter). See Diffusion Models.
Foundation models for vision. CLIP, DINOv2, SAM, Florence-2. Pretrained encoders that you build downstream tasks on top of. The shift from "train your own classifier from scratch" to "embed with a foundation model + linear head" has been complete in industry.
Vision-language models. LLaVA, MiniGPT, GPT-4V, Gemini, Claude. A vision encoder feeds image tokens into an LLM. Enables open-ended image understanding — describe, reason, answer. The bridge between vision and reasoning.
from segment_anything import SamPredictor, sam_model_registry
import cv2
# SAM — segment anything with a single prompt point
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
image = cv2.imread("photo.jpg")
predictor.set_image(image)
# Prompt: a single point on the object of interest
mask, scores, _ = predictor.predict(
point_coords=[[640, 480]], # x, y
point_labels=[1], # 1 = positive (this point is the object)
multimask_output=True,
)