Capstone Projects
17 end-to-end products + 9 deep-build tracks. 20-40 hours per project; 4-12 lessons per track.
Terminal-Native Coding Agent
By 2026 the shape of a coding agent is settled. A TUI harness, a stateful plan, a sandboxed tool surface, a loop that plans, acts, observes, recovers. Claude Code, Cursor 3, and…
RAG over Codebase (Cross-Repo Semantic Search)
Every serious engineering org in 2026 runs an internal code search that understands meaning, not just strings. Sourcegraph Amp, Cursor's codebase answers, Augment's enterprise g…
Real-Time Voice Assistant (ASR → LLM → TTS)
A voice agent that feels right has end-to-end latency under 800ms, knows when you have stopped talking, handles barge-in, and can call a tool without stalling. Retell, Vapi, Liv…
Multimodal Document QA (Vision-First)
The 2026 document-QA frontier moved away from OCR-then-text and toward vision-first late interaction. ColPali, ColQwen2.5, and ColQwen3-omni treat each PDF page as an image, emb…
Autonomous Research Agent (AI-Scientist Class)
Sakana's AI-Scientist-v2 published full papers. Agent Laboratory ran the experiments. Allen AI shared traces. The 2026 shape is plan-execute-verify tree search over experiments,…
DevOps Troubleshooting Agent for Kubernetes
AWS's DevOps Agent went GA, Resolve AI published its K8s playbooks, NeuBird demoed semantic monitoring, and Metoro tied AI SRE to per-service SLOs. The production shape is settl…
End-to-End Fine-Tuning Pipeline
An 8B model trained on your own data, DPO-aligned on your own preferences, quantized, speculative-decoded, and served at measurable $/1M tokens. The 2026 open stack is Axolotl v…
Production RAG Chatbot (Regulated Vertical)
Harvey, Glean, Mendable, and LlamaCloud all run the same production shape in 2026. Ingest with docling or Unstructured and ColPali for visuals. Hybrid search. Re-rank with bge-r…
Code Migration Agent (Repo-Level Upgrade)
Amazon's MigrationBench (Java 8 to 17) and Google's App Engine Py2-to-Py3 migrator set the 2026 bar. Moderne's OpenRewrite does deterministic AST rewrites at scale. Grit targets…
Multi-Agent Software Engineering Team
SWE-AF's factory architecture, MetaGPT's role-based prompting, AutoGen 0.4's typed actor graph, Cognition's Devin, and Factory's Droids all converged on the same 2026 shape: an …
LLM Observability & Eval Dashboard
Langfuse went open-core. Arize Phoenix published the 2026 GenAI semconv mappings. Helicone and Braintrust both doubled down on per-user cost attribution. Traceloop's OpenLLMetry…
Video Understanding Pipeline (Scene → QA)
Twelve Labs productized Marengo + Pegasus. VideoDB shipped the CRUD-for-video API. AI2's Molmo 2 published open VLM checkpoints. Gemini long-context handles hours of video nativ…
MCP Server with Registry and Governance
The Model Context Protocol stopped being the future and became the default tool-use spec in 2026. Anthropic, OpenAI, Google, and every major IDE ship MCP clients. Pinterest publ…
Speculative-Decoding Inference Server
EAGLE-3 in vLLM 0.7 ships 2.5-3x throughput on real traffic. P-EAGLE (AWS 2026) pushed parallel speculation even further. SGLang's SpecForge trained draft heads at scale. Red Ha…
Constitutional Safety Harness + Red-Team Range
Anthropic's Constitutional Classifiers, Meta's Llama Guard 4, Google's ShieldGemma-2, NVIDIA's Nemotron 3 Content Safety, and X-Guard for multilingual coverage defined the 2026 …
GitHub Issue-to-PR Autonomous Agent
AWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, and Google Jules all ship the same 2026 product shape: label an issue, get a PR. Run an agent in a cloud san…
Personal AI Tutor (Adaptive, Multimodal)
Khanmigo (Khan Academy), Duolingo Max, Google LearnLM / Gemini for Education, Quizlet Q-Chat, and Synthesis Tutor all shipped adaptive multimodal tutoring at scale in 2026. The …
Agent Harness Loop Contract
The harness is the agent. The model is a coprocessor. This lesson freezes the loop contract you can wire any model into.
Tool Registry with Schema Validation
A tool the agent cannot validate is a tool the agent cannot call. Build the registry and the schema checker before you build the tools.
JSON-RPC 2.0 Over Newline-Delimited Stdio
The transport between a model client and a tool server is JSON-RPC over stdio. Hand-rolling it once teaches you what every framing layer is paying for.
Function Call Dispatcher
The dispatcher is where the harness pays for every promise the schema made. Timeouts, retries, dedupe, error mapping. All on one seam.
Plan-Execute Control Flow
A plan that cannot survive a failure is a script. A script that can replan is an agent. Build the replanner first.
Verification Gates and Observation Budget
An agent harness without a verification layer is a wish in a trenchcoat. This lesson builds the deterministic gate chain that decides whether a tool call is allowed to fire, how…
Sandbox Runner with Denylist and Path Jail
The verification gate decides whether a tool call should run. The sandbox decides what happens when it does. This lesson ships a subprocess runner that refuses dangerous executa…
Eval Harness with Fixture Tasks
A coding agent is only as good as the suite of tasks you measure it against. This lesson builds an evaluation harness that takes a folder of fixture tasks, runs each through a c…
Observability with OTel GenAI Spans and Prometheus Metrics
An agent harness without observability is a black box that costs money. This lesson hand-rolls a span builder that emits records compliant with the OpenTelemetry GenAI semantic …
End-to-End Coding Agent on the Harness
Track A's payoff. This lesson stitches the gate chain, the sandbox, the eval harness, and the OTel spans into one working coding agent that fixes a real (small, fixture-scale) b…
BPE Tokenizer From Scratch
Bytes in, ids out, ids back to the same bytes. Build the tokenizer that every modern text model still starts from.
Tokenized Dataset with Sliding Window
A pretraining run is a function from token ids to gradients. This lesson builds the conveyor that feeds the ids in.
Token and Positional Embeddings
Ids are integers. The model wants vectors. Two lookup tables sit between them, and the choice of the positional one shapes what the model can learn.
Multi-Head Self-Attention
One linear projection, three views, H parallel heads, one mask. The attention block as the model actually uses it.
Transformer Block from Scratch
One block is the unit of every modern decoder LLM. Layer norm, multi head attention, residual, MLP, residual. The pre-LN variant trains stably without warmup. The post-LN varian…
GPT Model Assembly
Twelve blocks stacked, a token embedding, a learned position embedding, a final LayerNorm, and a tied language model head. That is the entire 124 million parameter GPT model. Th…
Training Loop and Evaluation
A loop that does not measure is a loop that lies. This lesson builds the training loop that drives the GPT model: AdamW with weight decay split, a warmup plus cosine learning ra…
Loading Pretrained Weights
Training a 124 million parameter model from scratch is a budget decision; loading a published checkpoint is a Tuesday. This lesson loads pretrained GPT-2 style weights from a sa…
Classifier Fine-Tuning by Head Swap
Track B's first capstone. A pretrained language model is a stack of self-attention blocks ending in a token-prediction head. When you want spam vs ham, the head is wrong but the…
Instruction Tuning by Supervised Fine-Tuning
A pretrained base model can extend a sequence but cannot follow an instruction. Supervised fine-tuning is the smallest change that fixes this: feed the model paired examples of …
Direct Preference Optimization from Scratch
Reward models and PPO are the classical RLHF stack. DPO collapses that stack into a single supervised loss that fits a policy directly against preference pairs. This lesson deri…
Full Evaluation Pipeline
Training is the part you can monitor with loss curves. Evaluation is the part you have to design. This lesson builds a unified eval pipeline that takes any trained language mode…
Large Corpus Downloader
Training a language model begins long before the first forward pass. The corpus has to land on disk, decompressed, deduplicated, and addressable, with the resume story already w…
HDF5 Tokenized Corpus
The downloaded corpus has to land in a layout the trainer can stream from at line speed. JSONL on disk does not survive 16 dataloader workers. HDF5 with a resizable, chunked int…
Cosine LR with Linear Warmup
The learning-rate schedule is the second most important decision after the loss function. AdamW with a cosine decay and a linear warmup is the modern default for language-model …
Gradient Clipping and Mixed Precision
The optimizer and schedule from the previous lesson assume gradients are sane. They usually are not. A single bad batch can spike the gradient norm by three orders of magnitude.…
Gradient Accumulation
Train at an effective batch you cannot afford, one micro-batch at a time. Scale the loss, hold the optimizer step, and let the gradients pile up.
Checkpoint Save and Resume
Train interrupts kill runs; checkpoints let them continue. Save model, optimizer, scheduler, loss history, step counter, and RNG state, atomically, so a kill at any moment leave…
Distributed Data Parallel and FSDP from Scratch
Multi-rank training is two collectives and one rule. Broadcast the parameters at startup, average the gradients after backward, never let the ranks disagree about what step they…
Language Model Evaluation Harness
A model that does well on a task you cannot define is a model that does well by accident. The harness is the task definition, the metric, the runner, and the leaderboard, in one…
Hypothesis Generator
A research agent that asks the same question twice is wasting tokens. The trick is forcing each draft to land somewhere new.
Literature Retrieval
A hypothesis is cheap. Knowing whether someone already proved it is the expensive part. Build the retrieval layer that answers that question before the runner spins up a sandbox.
Experiment Runner
The loop is only as honest as its measurements. Build the runner that takes a spec, executes it in a sandboxed subprocess, and emits a json metrics blob the evaluator can trust.
Result Evaluator
The runner produced numbers. The evaluator decides whether those numbers are an improvement, a regression, or noise. Build the verdict path that turns metrics into a one line co…
Paper Writer
A LaTeX skeleton is a contract between the researcher and the typesetter. If the contract is broken the document does not compile, and the failure is loud. Build the skeleton fi…
Critic Loop
A critic that returns "looks good" the first time is broken. A critic that always returns "needs work" is broken. The interesting critic is the one that converges, and you have …
Iteration Scheduler
A research loop without a scheduler is a queue with delusions. The scheduler is where the loop decides what to stop exploring, and that decision is the whole game.
End-to-End Research Demo
A demo is the place where every contract you wrote earlier has to compose. If any one of them leaks, the demo is the lesson that catches it.
Vision Encoder Patches
A vision model that reads pixels needs a tokenizer for pixels. Patch embedding is that tokenizer. Cut the image into a grid of squares, flatten each square, project it through o…
Vision Transformer Encoder
Patches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooli…
Projection Layer for Modality Alignment
A vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text e…
Cross-Attention Fusion
The projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can grou…
Vision-Language Pretraining
The encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs togethe…
Multimodal Evaluation
Training is half the loop. The other half is measurement. This lesson builds three evaluation surfaces from primitives: image-caption retrieval reported as R@1, R@5, R@10; visua…
Chunking Strategies, Compared
Chunking decides what your retriever can ever surface. Get the boundaries wrong and no embedding model, no reranker, no LLM can repair the damage downstream.
Hybrid Retrieval with BM25 and Dense Embeddings
Lexical and semantic retrieval fail on opposite query distributions. Hybrid retrieval with reciprocal rank fusion does not interpolate, it votes - and the vote wins on every que…
Cross-Encoder Reranker
A bi-encoder embeds query and document independently. A cross-encoder concatenates them and reads both at once. The cross-encoder is the smartest reader and the slowest. Used as…
Query Rewriting: HyDE, Multi-Query, and Decomposition
The query the user types is not the query your retriever wants. Rewriting bridges the gap before retrieval, so the index sees something closer to what the answer looks like.
RAG Evaluation: Precision, Recall, MRR, nDCG, Faithfulness, Answer Relevance
If you cannot grade your retrieval and your answer at the same time, you cannot ship the system. The two are not the same metric and the same prompt fails on different axes.
End-to-End RAG System
Six lessons of components. One pipeline. One eval loop. One self-terminating demo. This is the system you ship.
Task Spec Format
An eval harness is only as good as the contract its tasks honour. Freeze the JSONL shape and the metric vocabulary before you write a single scoring function.
Classical Metrics
BLEU, ROUGE-L, F1, exact-match, accuracy. Five metrics that still account for most published LLM eval numbers. Implement each from first principles so you know what the number m…
Code Exec Metric
Generated code is right when it passes the tests. The eval harness has to extract code, run it without crashing the host, and tally pass-rates honestly. This lesson builds that …
Perplexity and Calibration
If your model says 90 percent confident on a thousand answers and gets six hundred right, it is not well calibrated. Calibration is half of trustworthy eval. The other half is p…
Leaderboard Aggregation
Per-task scores are easy. Per-model rankings across heterogeneous tasks are harder. Statistical significance on a thousand-prediction leaderboard is the part everyone skips. Thi…
End-to-End Eval Runner
Five lessons of plumbing, one lesson to glue them. The runner reads the task spec from lesson 70, calls a model through an adapter, scores with lessons 71 and 72, attaches the c…
Collective Ops From Scratch
The four collective operations that hold distributed training together are allreduce, broadcast, allgather, and reduce_scatter. Every other primitive a training framework offers…
Data Parallel DDP From Scratch
DistributedDataParallel is a hook on top of allreduce. Wrap a model, broadcast the initial parameters from rank 0 so every rank starts identical, install a backward hook on ever…
ZeRO Optimizer State Sharding
Adam stores two moment estimates per parameter, both in float32. A 7B-parameter model carries 56 GB of optimiser state. ZeRO stage 1 shards that across N ranks; each rank owns 1…
Pipeline Parallel and Bubble Analysis
Tensor parallelism splits the matrix multiply across ranks. Pipeline parallelism splits the model across ranks, one stage per rank. Microbatches flow through the pipeline. The e…
Sharded Checkpoint and Atomic Resume
A 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes ever…
End-to-End Distributed Training
Lessons 76 through 80 each built one piece. This is the assembly: a tiny GPT trained across 4 simulated ranks with DDP for gradient sync, ZeRO-1 for optimiser-state sharding, an…
Jailbreak Taxonomy
A safety harness without a taxonomy is a coin flip. Name the attack before you defend it.
Prompt Injection Detector
A detector is a function from prompt to confidence and category. Anything else is a vibe.
Refusal Evaluation
Helpfulness on benign prompts and refusal on harmful prompts are two metrics, not one. Measure both.
Content Classifier Integration
Classifiers on the output side answer a different question than rules on the input side. Both need a policy router.
Constitutional Rules Engine
A rule is a name, a predicate, and an explanation. Anything missing one of those three is a vibe, not a rule.
End-to-End Safety Gate
Pre-gen, during-gen, post-gen. Three checkpoints, one verdict, an audit trail per request.