16 lessons

Transformers Deep Dive

The architecture that changed everything.

Why Transformers: The Problems with RNNs

RNNs process tokens one at a time. Transformers process all tokens at once. That single architectural bet changed every scaling curve in deep learning after 2017.

Self-Attention from Scratch

Build

Python

Attention is a lookup table where every word asks "who matters to me?" - and learns the answer.

Multi-Head Attention

Build

Python

One attention head learns one relation at a time. Eight heads learn eight. Heads are free. Take more of them.

Positional Encoding: Sinusoidal, RoPE, ALiBi

Build

Python

Attention is permutation-invariant. "The cat sat on the mat" and "mat the on sat cat the" produce the same output without positional signal. Three algorithms fix it — each with …

The Full Transformer: Encoder + Decoder

Build

Python

Attention is the star. Everything else — residuals, normalization, feed-forward, cross-attention — is the scaffolding that lets you stack it deep.

BERT — Masked Language Modeling

Build

Python

GPT predicts the next word. BERT predicts a missing word. One sentence of difference — and half a decade of everything embedding-shaped.

GPT — Causal Language Modeling

Build

Python

BERT sees both sides. GPT sees only the past. The triangle mask is the most consequential single line of code in modern AI.

T5, BART — Encoder-Decoder Models

Learn

Python

Encoders understand. Decoders generate. Put them back together and you get a model built for input → output tasks: translate, summarize, rewrite, transcribe.

Vision Transformers (ViT)

Build

Python

An image is a grid of patches. A sentence is a grid of tokens. The same transformer eats both.

Audio Transformers — Whisper Architecture

Learn

Python

Audio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.

Mixture of Experts (MoE)

Build

Python

A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scalin…

KV Cache, Flash Attention & Inference Optimization

Build

Python

Training is parallel and FLOP-bound. Inference is serial and memory-bound. Different bottleneck, different tricks.

Scaling Laws

Learn

Python

The 2020 Kaplan paper said: bigger model, lower loss. The 2022 Hoffmann paper said: you were under-training. Compute goes into two buckets — parameters and tokens — and the spli…

Build a Transformer from Scratch

Build

Python

Thirteen lessons. One model. No shortcuts.

Attention Variants — Sliding Window, Sparse, Differential

Build

Python

Full attention is a circle. Every token sees every token, and memory pays the price. Four variants bend the shape of the circle and recover half the cost.

Speculative Decoding — Draft, Verify, Repeat

Build

Python

Autoregressive decoding is serial. Each token waits for the previous one. Speculative decoding breaks the chain: a cheap model drafts N tokens, the expensive model verifies all …