Why Transformers: The Problems with RNNs
RNNs process tokens one at a time. Transformers process all tokens at once. That single architectural bet changed every scaling curve in deep learning after 2017.
Self-Attention from Scratch
Attention is a lookup table where every word asks "who matters to me?" - and learns the answer.
Multi-Head Attention
One attention head learns one relation at a time. Eight heads learn eight. Heads are free. Take more of them.
Positional Encoding: Sinusoidal, RoPE, ALiBi
Attention is permutation-invariant. "The cat sat on the mat" and "mat the on sat cat the" produce the same output without positional signal. Three algorithms fix it — each with …
The Full Transformer: Encoder + Decoder
Attention is the star. Everything else — residuals, normalization, feed-forward, cross-attention — is the scaffolding that lets you stack it deep.
BERT — Masked Language Modeling
GPT predicts the next word. BERT predicts a missing word. One sentence of difference — and half a decade of everything embedding-shaped.
GPT — Causal Language Modeling
BERT sees both sides. GPT sees only the past. The triangle mask is the most consequential single line of code in modern AI.
T5, BART — Encoder-Decoder Models
Encoders understand. Decoders generate. Put them back together and you get a model built for input → output tasks: translate, summarize, rewrite, transcribe.
Vision Transformers (ViT)
An image is a grid of patches. A sentence is a grid of tokens. The same transformer eats both.
Audio Transformers — Whisper Architecture
Audio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.
Mixture of Experts (MoE)
A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scalin…
KV Cache, Flash Attention & Inference Optimization
Training is parallel and FLOP-bound. Inference is serial and memory-bound. Different bottleneck, different tricks.
Scaling Laws
The 2020 Kaplan paper said: bigger model, lower loss. The 2022 Hoffmann paper said: you were under-training. Compute goes into two buckets — parameters and tokens — and the spli…
Build a Transformer from Scratch
Thirteen lessons. One model. No shortcuts.
Attention Variants — Sliding Window, Sparse, Differential
Full attention is a circle. Every token sees every token, and memory pays the price. Four variants bend the shape of the circle and recover half the cost.
Speculative Decoding — Draft, Verify, Repeat
Autoregressive decoding is serial. Each token waits for the previous one. Speculative decoding breaks the chain: a cheap model drafts N tokens, the expensive model verifies all …