Interviews Vector
Back to Roadmap
10
24 lessons

LLMs from Scratch

Build, train, and understand large language models.

01

Tokenizers: BPE, WordPiece, SentencePiece

Build
Python, Rust

Your LLM does not read English. It reads integers. The tokenizer decides whether those integers carry meaning or waste it.

02

Building a Tokenizer from Scratch

Build
Python

Lesson 01 gave you a toy. This lesson gives you a weapon.

03

Data Pipelines for Pre-Training

Build
Python

The model is a mirror. It reflects whatever data you feed it. Feed it garbage, it reflects garbage with perfect fluency.

04

Pre-Training a Mini GPT (124M)

Build
Python

GPT-2 Small has 124 million parameters. That's 12 transformer layers, 12 attention heads, and 768-dimensional embeddings. You can train it from scratch on a single GPU in a few …

05

Distributed Training, FSDP, DeepSpeed

Build
Python

Your 124M model trained on one GPU. Now try 7 billion parameters. The model doesn't fit in memory. The data takes weeks on a single machine. Distributed training isn't optional …

06

Instruction Tuning — SFT

Build
Python

A base model predicts the next token. That's it. It doesn't follow instructions, answer questions, or refuse harmful requests. SFT is the bridge between a token predictor and a …

07

RLHF — Reward Model + PPO

Build
Python

SFT teaches the model to follow instructions. But it doesn't teach the model which response is BETTER. Two grammatically correct, factually accurate answers can differ enormousl…

08

DPO — Direct Preference Optimization

Build
Python

RLHF works. It also requires training three models (SFT, reward model, policy), managing PPO's instability, and tuning a KL penalty. DPO asks: what if you could skip all of that…

09

Constitutional AI & Self-Improvement

Build
Python

RLHF needs humans in the loop. Constitutional AI replaces most of them with the model itself. Write a list of principles, have the model critique its own outputs against those p…

10

Evaluation — Benchmarks, Evals

Build
Python

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count t…

11

Quantization: INT8, GPTQ, AWQ, GGUF

Build
Python

A 70B model in FP16 needs 140GB. Two A100s just for weights. Quantize to FP8: one 80GB GPU. INT4: a MacBook.

12

Inference Optimization

Build
Python

Two phases define LLM inference. Prefill processes your prompt in parallel -- compute-bound. Decode generates tokens one at a time -- memory-bound. Every optimization targets on…

13

Building a Complete LLM Pipeline

Build
Python

Everything from Lessons 01 to 12 is one stage of one pipeline. This lesson is the scaffold that turns those stages into a single end-to-end run: tokenize, pre-train, scale, SFT,…

14

Open Models: Architecture Walkthroughs

Learn
Python

You built a GPT-2 Small from scratch in Lesson 04. Frontier open models in 2026 are the same family with five or six concrete changes. RMSNorm instead of LayerNorm. SwiGLU inste…

15

Speculative Decoding and EAGLE-3

Build
Python

Phase 7 · Lesson 16 proved the math: the Leviathan rejection rule preserves the verifier's distribution exactly. This lesson is the training-stack view of 2026 production specul…

16

Differential Attention (V2)

Build
Python

Softmax attention spreads a small amount of probability over every non-matching token. Over 100k tokens that noise adds up and drowns the signal. Differential Transformer (Ye et…

17

Native Sparse Attention (DeepSeek NSA)

Build
Python

At 64k tokens, attention eats 70-80% of decode latency. Every open-model lab has a plan to fix it. DeepSeek's NSA (ACL 2025 best paper) is the one that stuck: three parallel att…

18

Multi-Token Prediction (MTP)

Build
Python

Every autoregressive LLM from GPT-2 to Llama 3 trains on one loss per position: predict the next token. DeepSeek-V3 added a second loss per position: predict the token after tha…

19

DualPipe Parallelism

Learn
Python

DeepSeek-V3 was trained on 2,048 H800 GPUs with MoE experts scattered across nodes. Cross-node expert all-to-all communication cost 1 GPU-hour of comm for every 1 GPU-hour of co…

20

DeepSeek-V3 Architecture Walkthrough

Learn
Python

Phase 10 · Lesson 14 named the six architectural knobs every open model turns. DeepSeek-V3 (December 2024, 671B parameters total, 37B active) turns all six and adds four more: M…

21

Jamba — Hybrid SSM-Transformer

Learn
Python

State space models (SSMs) and transformers want different things. Transformers buy quality via attention at quadratic cost. SSMs buy linear-time inference and constant memory vi…

22

Async and Hogwild! Inference

Build
Python

Speculative decoding (Phase 10 · 15) parallelizes tokens within one sequence. Multi-agent frameworks parallelize across whole sequences but force explicit coordination (voting, …

23

Speculative Decoding and EAGLE

Build
Python

A frontier LLM generating one token requires a full forward pass over billions of parameters. That forward pass is massively over-provisioned: most of the time a much smaller mo…

24

Gradient Checkpointing and Activation Recomputation

Build
Python

Backprop keeps every intermediate activation. At 70B parameters and 128K context that is 3 TB of activations per rank. Checkpointing trades FLOPs for memory: recompute instead o…