Computer Vision
From pixels to understanding — image, video, 3D, VLMs, and world models.
Image Fundamentals: Pixels, Channels, Color Spaces
An image is a tensor of light samples. Every vision model you will ever use starts from this one fact.
Convolutions from Scratch
A convolution is a tiny dense layer you slide across an image, sharing the same weights at every location.
CNNs: LeNet to ResNet
Every major CNN of the last thirty years is the same conv–nonlinearity–downsample recipe with one new idea bolted on. Learn the ideas in order.
Image Classification
A classifier is a function from pixels to a probability distribution over classes. Everything else is plumbing.
Transfer Learning & Fine-Tuning
Somebody else spent a million GPU hours teaching a network what edges, textures, and object parts look like. You should borrow those features before training your own.
Object Detection — YOLO from Scratch
Detection is classification plus regression, run at every position in a feature map, then cleaned up with non-maximum suppression.
Semantic Segmentation — U-Net
Segmentation is classification at every pixel. U-Net makes it work by pairing a downsampling encoder with an upsampling decoder and wiring skip connections between them.
Instance Segmentation — Mask R-CNN
Add a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.
Image Generation — GANs
A GAN is two neural networks in a fixed game. One draws, one critiques. They get better together until the drawings fool the critic.
Image Generation — Diffusion Models
A diffusion model learns to denoise. Train it to remove a tiny bit of noise from a noisy image, repeat that backwards a thousand times, and you have an image generator.
Stable Diffusion — Architecture & Fine-Tuning
Stable Diffusion is a DDPM that runs in the latent space of a pretrained VAE, conditioned on text via cross-attention, sampled with a fast deterministic ODE solver, and steered …
Video Understanding — Temporal Modeling
A video is a sequence of images plus the physics that connects them. Every video model either treats time as an extra axis (3D conv), a sequence to attend over (transformer), or…
3D Vision: Point Clouds, NeRFs
3D vision comes in two flavours. Point clouds are the sensor's raw output. NeRFs are the learned volumetric field. Both answer "what is where in space."
Vision Transformers (ViT)
Cut the image into patches, treat each patch as a word, run a standard transformer. Don't look back.
Real-Time Vision: Edge Deployment
Edge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds …
Build a Complete Vision Pipeline
A production vision system is a chain of models and rules stitched with data contracts. The pieces are already in this phase; the capstone wires them together end-to-end.
Self-Supervised Vision — SimCLR, DINO, MAE
Labels are the bottleneck of supervised vision. Self-supervised pretraining removes them: learn visual features from 100M unlabelled images, fine-tune on 10k labelled ones.
Open-Vocabulary Vision — CLIP
Train an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.
OCR & Document Understanding
OCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.
Image Retrieval & Metric Learning
A retrieval system ranks candidates by a distance in embedding space. Metric learning is the discipline of shaping that space so the distances mean what you want.
Keypoint Detection & Pose Estimation
A pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.
3D Gaussian Splatting from Scratch
A scene is a cloud of millions of 3D Gaussians. Each one has a position, orientation, scale, opacity, and a colour that depends on viewing direction. Rasterise them, backprop th…
Diffusion Transformers & Rectified Flow
The U-Net is not the secret of diffusion. Replace it with a transformer, swap the noise schedule for a straight-line flow, and suddenly you have SD3, FLUX, and every 2026 text-t…
SAM 3 & Open-Vocabulary Segmentation
Give a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.
Vision-Language Models (ViT-MLP-LLM)
A vision encoder converts an image into tokens. An MLP projector maps those tokens into the LLM's embedding space. A language model does the rest. That pattern — ViT-MLP-LLM — i…
Monocular Depth & Geometry Estimation
A depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a f…
Multi-Object Tracking & Video Memory
Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.
World Models & Video Diffusion
A video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.