Interviews Vector
Back to Roadmap
7
16 lessons

Transformers Deep Dive

The architecture that changed everything.

01

Why Transformers: The Problems with RNNs

Learn
Python

RNNs process tokens one at a time. Transformers process all tokens at once. That single architectural bet changed every scaling curve in deep learning after 2017.

02

Self-Attention from Scratch

Build
Python

Attention is a lookup table where every word asks "who matters to me?" - and learns the answer.

03

Multi-Head Attention

Build
Python

One attention head learns one relation at a time. Eight heads learn eight. Heads are free. Take more of them.

04

Positional Encoding: Sinusoidal, RoPE, ALiBi

Build
Python

Attention is permutation-invariant. "The cat sat on the mat" and "mat the on sat cat the" produce the same output without positional signal. Three algorithms fix it — each with …

05

The Full Transformer: Encoder + Decoder

Build
Python

Attention is the star. Everything else — residuals, normalization, feed-forward, cross-attention — is the scaffolding that lets you stack it deep.

06

BERT — Masked Language Modeling

Build
Python

GPT predicts the next word. BERT predicts a missing word. One sentence of difference — and half a decade of everything embedding-shaped.

07

GPT — Causal Language Modeling

Build
Python

BERT sees both sides. GPT sees only the past. The triangle mask is the most consequential single line of code in modern AI.

08

T5, BART — Encoder-Decoder Models

Learn
Python

Encoders understand. Decoders generate. Put them back together and you get a model built for input → output tasks: translate, summarize, rewrite, transcribe.

09

Vision Transformers (ViT)

Build
Python

An image is a grid of patches. A sentence is a grid of tokens. The same transformer eats both.

10

Audio Transformers — Whisper Architecture

Learn
Python

Audio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.

11

Mixture of Experts (MoE)

Build
Python

A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scalin…

12

KV Cache, Flash Attention & Inference Optimization

Build
Python

Training is parallel and FLOP-bound. Inference is serial and memory-bound. Different bottleneck, different tricks.

13

Scaling Laws

Learn
Python

The 2020 Kaplan paper said: bigger model, lower loss. The 2022 Hoffmann paper said: you were under-training. Compute goes into two buckets — parameters and tokens — and the spli…

14

Build a Transformer from Scratch

Build
Python

Thirteen lessons. One model. No shortcuts.

15

Attention Variants — Sliding Window, Sparse, Differential

Build
Python

Full attention is a circle. Every token sees every token, and memory pays the price. Four variants bend the shape of the circle and recover half the cost.

16

Speculative Decoding — Draft, Verify, Repeat

Build
Python

Autoregressive decoding is serial. Each token waits for the previous one. Speculative decoding breaks the chain: a cheap model drafts N tokens, the expensive model verifies all …