Interviews Vector
Back to Roadmap
4
28 lessons

Computer Vision

From pixels to understanding — image, video, 3D, VLMs, and world models.

01

Image Fundamentals: Pixels, Channels, Color Spaces

Learn
Python

An image is a tensor of light samples. Every vision model you will ever use starts from this one fact.

02

Convolutions from Scratch

Build
Python

A convolution is a tiny dense layer you slide across an image, sharing the same weights at every location.

03

CNNs: LeNet to ResNet

Build
Python

Every major CNN of the last thirty years is the same conv–nonlinearity–downsample recipe with one new idea bolted on. Learn the ideas in order.

04

Image Classification

Build
Python

A classifier is a function from pixels to a probability distribution over classes. Everything else is plumbing.

05

Transfer Learning & Fine-Tuning

Build
Python

Somebody else spent a million GPU hours teaching a network what edges, textures, and object parts look like. You should borrow those features before training your own.

06

Object Detection — YOLO from Scratch

Build
Python

Detection is classification plus regression, run at every position in a feature map, then cleaned up with non-maximum suppression.

07

Semantic Segmentation — U-Net

Build
Python

Segmentation is classification at every pixel. U-Net makes it work by pairing a downsampling encoder with an upsampling decoder and wiring skip connections between them.

08

Instance Segmentation — Mask R-CNN

Build
Python

Add a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.

09

Image Generation — GANs

Build
Python

A GAN is two neural networks in a fixed game. One draws, one critiques. They get better together until the drawings fool the critic.

10

Image Generation — Diffusion Models

Build
Python

A diffusion model learns to denoise. Train it to remove a tiny bit of noise from a noisy image, repeat that backwards a thousand times, and you have an image generator.

11

Stable Diffusion — Architecture & Fine-Tuning

Build
Python

Stable Diffusion is a DDPM that runs in the latent space of a pretrained VAE, conditioned on text via cross-attention, sampled with a fast deterministic ODE solver, and steered …

12

Video Understanding — Temporal Modeling

Build
Python

A video is a sequence of images plus the physics that connects them. Every video model either treats time as an extra axis (3D conv), a sequence to attend over (transformer), or…

13

3D Vision: Point Clouds, NeRFs

Build
Python

3D vision comes in two flavours. Point clouds are the sensor's raw output. NeRFs are the learned volumetric field. Both answer "what is where in space."

14

Vision Transformers (ViT)

Build
Python

Cut the image into patches, treat each patch as a word, run a standard transformer. Don't look back.

15

Real-Time Vision: Edge Deployment

Build
Python

Edge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds …

16

Build a Complete Vision Pipeline

Build
Python

A production vision system is a chain of models and rules stitched with data contracts. The pieces are already in this phase; the capstone wires them together end-to-end.

17

Self-Supervised Vision — SimCLR, DINO, MAE

Build
Python

Labels are the bottleneck of supervised vision. Self-supervised pretraining removes them: learn visual features from 100M unlabelled images, fine-tune on 10k labelled ones.

18

Open-Vocabulary Vision — CLIP

Build
Python

Train an image encoder and a text encoder together so that matching (image, caption) pairs land at the same point in a shared space. That is the whole trick.

19

OCR & Document Understanding

Build
Python

OCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.

20

Image Retrieval & Metric Learning

Build
Python

A retrieval system ranks candidates by a distance in embedding space. Metric learning is the discipline of shaping that space so the distances mean what you want.

21

Keypoint Detection & Pose Estimation

Build
Python

A pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.

22

3D Gaussian Splatting from Scratch

Build
Python

A scene is a cloud of millions of 3D Gaussians. Each one has a position, orientation, scale, opacity, and a colour that depends on viewing direction. Rasterise them, backprop th…

23

Diffusion Transformers & Rectified Flow

Build
Python

The U-Net is not the secret of diffusion. Replace it with a transformer, swap the noise schedule for a straight-line flow, and suddenly you have SD3, FLUX, and every 2026 text-t…

24

SAM 3 & Open-Vocabulary Segmentation

Build
Python

Give a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.

25

Vision-Language Models (ViT-MLP-LLM)

Build
Python

A vision encoder converts an image into tokens. An MLP projector maps those tokens into the LLM's embedding space. A language model does the rest. That pattern — ViT-MLP-LLM — i…

26

Monocular Depth & Geometry Estimation

Build
Python

A depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a f…

27

Multi-Object Tracking & Video Memory

Build
Python

Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.

28

World Models & Video Diffusion

Build
Python

A video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.