28 lessons

Computer Vision

From pixels to understanding — image, video, 3D, VLMs, and world models.

Image Fundamentals: Pixels, Channels, Color Spaces

Learn

Python

An image is a tensor of light samples. Every vision model you will ever use starts from this one fact.

Convolutions from Scratch

Build

Python

A convolution is a tiny dense layer you slide across an image, sharing the same weights at every location.

CNNs: LeNet to ResNet

Build

Python

Every major CNN of the last thirty years is the same conv–nonlinearity–downsample recipe with one new idea bolted on. Learn the ideas in order.

Image Classification

Build

Python

A classifier is a function from pixels to a probability distribution over classes. Everything else is plumbing.

Transfer Learning & Fine-Tuning

Build

Python

Somebody else spent a million GPU hours teaching a network what edges, textures, and object parts look like. You should borrow those features before training your own.

Object Detection — YOLO from Scratch

Build

Python

Detection is classification plus regression, run at every position in a feature map, then cleaned up with non-maximum suppression.

Semantic Segmentation — U-Net

Build

Python

Segmentation is classification at every pixel. U-Net makes it work by pairing a downsampling encoder with an upsampling decoder and wiring skip connections between them.

Instance Segmentation — Mask R-CNN

Build

Python

Add a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.

Image Generation — GANs

Build

Python

A GAN is two neural networks in a fixed game. One draws, one critiques. They get better together until the drawings fool the critic.

Image Generation — Diffusion Models

Build

Python

A diffusion model learns to denoise. Train it to remove a tiny bit of noise from a noisy image, repeat that backwards a thousand times, and you have an image generator.

Stable Diffusion — Architecture & Fine-Tuning

Build

Python

Stable Diffusion is a DDPM that runs in the latent space of a pretrained VAE, conditioned on text via cross-attention, sampled with a fast deterministic ODE solver, and steered …

Video Understanding — Temporal Modeling

Build

Python

A video is a sequence of images plus the physics that connects them. Every video model either treats time as an extra axis (3D conv), a sequence to attend over (transformer), or…

3D Vision: Point Clouds, NeRFs

Build

Python

3D vision comes in two flavours. Point clouds are the sensor's raw output. NeRFs are the learned volumetric field. Both answer "what is where in space."

Vision Transformers (ViT)

Build

Python

Cut the image into patches, treat each patch as a word, run a standard transformer. Don't look back.

Real-Time Vision: Edge Deployment

Build

Python

Edge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds …

Build a Complete Vision Pipeline

Build

Python

A production vision system is a chain of models and rules stitched with data contracts. The pieces are already in this phase; the capstone wires them together end-to-end.

Self-Supervised Vision — SimCLR, DINO, MAE

Build

Python

Labels are the bottleneck of supervised vision. Self-supervised pretraining removes them: learn visual features from 100M unlabelled images, fine-tune on 10k labelled ones.

Open-Vocabulary Vision — CLIP

Build

Python

Train an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.

OCR & Document Understanding

Build

Python

OCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.

Image Retrieval & Metric Learning

Build

Python

A retrieval system ranks candidates by a distance in embedding space. Metric learning is the discipline of shaping that space so the distances mean what you want.

Keypoint Detection & Pose Estimation

Build

Python

A pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.

3D Gaussian Splatting from Scratch

Build

Python

A scene is a cloud of millions of 3D Gaussians. Each one has a position, orientation, scale, opacity, and a colour that depends on viewing direction. Rasterise them, backprop th…

Diffusion Transformers & Rectified Flow

Build

Python

The U-Net is not the secret of diffusion. Replace it with a transformer, swap the noise schedule for a straight-line flow, and suddenly you have SD3, FLUX, and every 2026 text-t…

SAM 3 & Open-Vocabulary Segmentation

Build

Python

Give a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.

Vision-Language Models (ViT-MLP-LLM)

Build

Python

A vision encoder converts an image into tokens. An MLP projector maps those tokens into the LLM's embedding space. A language model does the rest. That pattern — ViT-MLP-LLM — i…

Monocular Depth & Geometry Estimation

Build

Python

A depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a f…

Multi-Object Tracking & Video Memory

Build

Python

Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.

World Models & Video Diffusion

Build

Python

A video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.