85 lessons

Capstone Projects

17 end-to-end products + 9 deep-build tracks. 20-40 hours per project; 4-12 lessons per track.

Terminal-Native Coding Agent

By 2026 the shape of a coding agent is settled. A TUI harness, a stateful plan, a sandboxed tool surface, a loop that plans, acts, observes, recovers. Claude Code, Cursor 3, and…

RAG over Codebase (Cross-Repo Semantic Search)

Capstone

Python

Every serious engineering org in 2026 runs an internal code search that understands meaning, not just strings. Sourcegraph Amp, Cursor's codebase answers, Augment's enterprise g…

Real-Time Voice Assistant (ASR → LLM → TTS)

Capstone

Python

A voice agent that feels right has end-to-end latency under 800ms, knows when you have stopped talking, handles barge-in, and can call a tool without stalling. Retell, Vapi, Liv…

Multimodal Document QA (Vision-First)

Capstone

Python

The 2026 document-QA frontier moved away from OCR-then-text and toward vision-first late interaction. ColPali, ColQwen2.5, and ColQwen3-omni treat each PDF page as an image, emb…

Autonomous Research Agent (AI-Scientist Class)

Capstone

Python

Sakana's AI-Scientist-v2 published full papers. Agent Laboratory ran the experiments. Allen AI shared traces. The 2026 shape is plan-execute-verify tree search over experiments,…

DevOps Troubleshooting Agent for Kubernetes

Capstone

Python

AWS's DevOps Agent went GA, Resolve AI published its K8s playbooks, NeuBird demoed semantic monitoring, and Metoro tied AI SRE to per-service SLOs. The production shape is settl…

End-to-End Fine-Tuning Pipeline

Capstone

Python

An 8B model trained on your own data, DPO-aligned on your own preferences, quantized, speculative-decoded, and served at measurable $/1M tokens. The 2026 open stack is Axolotl v…

Production RAG Chatbot (Regulated Vertical)

Capstone

Python

Harvey, Glean, Mendable, and LlamaCloud all run the same production shape in 2026. Ingest with docling or Unstructured and ColPali for visuals. Hybrid search. Re-rank with bge-r…

Code Migration Agent (Repo-Level Upgrade)

Capstone

Python

Amazon's MigrationBench (Java 8 to 17) and Google's App Engine Py2-to-Py3 migrator set the 2026 bar. Moderne's OpenRewrite does deterministic AST rewrites at scale. Grit targets…

Multi-Agent Software Engineering Team

Capstone

Python

SWE-AF's factory architecture, MetaGPT's role-based prompting, AutoGen 0.4's typed actor graph, Cognition's Devin, and Factory's Droids all converged on the same 2026 shape: an …

LLM Observability & Eval Dashboard

Capstone

Python

Langfuse went open-core. Arize Phoenix published the 2026 GenAI semconv mappings. Helicone and Braintrust both doubled down on per-user cost attribution. Traceloop's OpenLLMetry…

Video Understanding Pipeline (Scene → QA)

Capstone

Python

Twelve Labs productized Marengo + Pegasus. VideoDB shipped the CRUD-for-video API. AI2's Molmo 2 published open VLM checkpoints. Gemini long-context handles hours of video nativ…

MCP Server with Registry and Governance

Capstone

Python

The Model Context Protocol stopped being the future and became the default tool-use spec in 2026. Anthropic, OpenAI, Google, and every major IDE ship MCP clients. Pinterest publ…

Speculative-Decoding Inference Server

Capstone

Python

EAGLE-3 in vLLM 0.7 ships 2.5-3x throughput on real traffic. P-EAGLE (AWS 2026) pushed parallel speculation even further. SGLang's SpecForge trained draft heads at scale. Red Ha…

Constitutional Safety Harness + Red-Team Range

Capstone

Python

Anthropic's Constitutional Classifiers, Meta's Llama Guard 4, Google's ShieldGemma-2, NVIDIA's Nemotron 3 Content Safety, and X-Guard for multilingual coverage defined the 2026 …

GitHub Issue-to-PR Autonomous Agent

Capstone

Python

AWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, and Google Jules all ship the same 2026 product shape: label an issue, get a PR. Run an agent in a cloud san…

Personal AI Tutor (Adaptive, Multimodal)

Capstone

Python

Khanmigo (Khan Academy), Duolingo Max, Google LearnLM / Gemini for Education, Quizlet Q-Chat, and Synthesis Tutor all shipped adaptive multimodal tutoring at scale in 2026. The …

Agent Harness Loop Contract

Capstone

Python

The harness is the agent. The model is a coprocessor. This lesson freezes the loop contract you can wire any model into.

Tool Registry with Schema Validation

Capstone

Python

A tool the agent cannot validate is a tool the agent cannot call. Build the registry and the schema checker before you build the tools.

JSON-RPC 2.0 Over Newline-Delimited Stdio

Capstone

Python

The transport between a model client and a tool server is JSON-RPC over stdio. Hand-rolling it once teaches you what every framing layer is paying for.

Function Call Dispatcher

Capstone

Python

The dispatcher is where the harness pays for every promise the schema made. Timeouts, retries, dedupe, error mapping. All on one seam.

Plan-Execute Control Flow

Capstone

Python

A plan that cannot survive a failure is a script. A script that can replan is an agent. Build the replanner first.

Verification Gates and Observation Budget

Capstone

Python

An agent harness without a verification layer is a wish in a trenchcoat. This lesson builds the deterministic gate chain that decides whether a tool call is allowed to fire, how…

Sandbox Runner with Denylist and Path Jail

Capstone

Python

The verification gate decides whether a tool call should run. The sandbox decides what happens when it does. This lesson ships a subprocess runner that refuses dangerous executa…

Eval Harness with Fixture Tasks

Capstone

Python

A coding agent is only as good as the suite of tasks you measure it against. This lesson builds an evaluation harness that takes a folder of fixture tasks, runs each through a c…

Observability with OTel GenAI Spans and Prometheus Metrics

Capstone

Python

An agent harness without observability is a black box that costs money. This lesson hand-rolls a span builder that emits records compliant with the OpenTelemetry GenAI semantic …

End-to-End Coding Agent on the Harness

Capstone

Python

Track A's payoff. This lesson stitches the gate chain, the sandbox, the eval harness, and the OTel spans into one working coding agent that fixes a real (small, fixture-scale) b…

BPE Tokenizer From Scratch

Capstone

Python

Bytes in, ids out, ids back to the same bytes. Build the tokenizer that every modern text model still starts from.

Tokenized Dataset with Sliding Window

Capstone

Python

A pretraining run is a function from token ids to gradients. This lesson builds the conveyor that feeds the ids in.

Token and Positional Embeddings

Capstone

Python

Ids are integers. The model wants vectors. Two lookup tables sit between them, and the choice of the positional one shapes what the model can learn.

Multi-Head Self-Attention

Capstone

Python

One linear projection, three views, H parallel heads, one mask. The attention block as the model actually uses it.

Transformer Block from Scratch

Capstone

Python

One block is the unit of every modern decoder LLM. Layer norm, multi head attention, residual, MLP, residual. The pre-LN variant trains stably without warmup. The post-LN varian…

GPT Model Assembly

Capstone

Python

Twelve blocks stacked, a token embedding, a learned position embedding, a final LayerNorm, and a tied language model head. That is the entire 124 million parameter GPT model. Th…

Training Loop and Evaluation

Capstone

Python

A loop that does not measure is a loop that lies. This lesson builds the training loop that drives the GPT model: AdamW with weight decay split, a warmup plus cosine learning ra…

Loading Pretrained Weights

Capstone

Python

Training a 124 million parameter model from scratch is a budget decision; loading a published checkpoint is a Tuesday. This lesson loads pretrained GPT-2 style weights from a sa…

Classifier Fine-Tuning by Head Swap

Capstone

Python

Track B's first capstone. A pretrained language model is a stack of self-attention blocks ending in a token-prediction head. When you want spam vs ham, the head is wrong but the…

Instruction Tuning by Supervised Fine-Tuning

Capstone

Python

A pretrained base model can extend a sequence but cannot follow an instruction. Supervised fine-tuning is the smallest change that fixes this: feed the model paired examples of …

Direct Preference Optimization from Scratch

Capstone

Python

Reward models and PPO are the classical RLHF stack. DPO collapses that stack into a single supervised loss that fits a policy directly against preference pairs. This lesson deri…

Full Evaluation Pipeline

Capstone

Python

Training is the part you can monitor with loss curves. Evaluation is the part you have to design. This lesson builds a unified eval pipeline that takes any trained language mode…

Large Corpus Downloader

Capstone

Python

Training a language model begins long before the first forward pass. The corpus has to land on disk, decompressed, deduplicated, and addressable, with the resume story already w…

HDF5 Tokenized Corpus

Capstone

Python

The downloaded corpus has to land in a layout the trainer can stream from at line speed. JSONL on disk does not survive 16 dataloader workers. HDF5 with a resizable, chunked int…

Cosine LR with Linear Warmup

Capstone

Python

The learning-rate schedule is the second most important decision after the loss function. AdamW with a cosine decay and a linear warmup is the modern default for language-model …

Gradient Clipping and Mixed Precision

Capstone

Python

The optimizer and schedule from the previous lesson assume gradients are sane. They usually are not. A single bad batch can spike the gradient norm by three orders of magnitude.…

Gradient Accumulation

Capstone

Python

Train at an effective batch you cannot afford, one micro-batch at a time. Scale the loss, hold the optimizer step, and let the gradients pile up.

Checkpoint Save and Resume

Capstone

Python

Train interrupts kill runs; checkpoints let them continue. Save model, optimizer, scheduler, loss history, step counter, and RNG state, atomically, so a kill at any moment leave…

Distributed Data Parallel and FSDP from Scratch

Capstone

Python

Multi-rank training is two collectives and one rule. Broadcast the parameters at startup, average the gradients after backward, never let the ranks disagree about what step they…

Language Model Evaluation Harness

Capstone

Python

A model that does well on a task you cannot define is a model that does well by accident. The harness is the task definition, the metric, the runner, and the leaderboard, in one…

Hypothesis Generator

Capstone

Python

A research agent that asks the same question twice is wasting tokens. The trick is forcing each draft to land somewhere new.

Literature Retrieval

Capstone

Python

A hypothesis is cheap. Knowing whether someone already proved it is the expensive part. Build the retrieval layer that answers that question before the runner spins up a sandbox.

Experiment Runner

Capstone

Python

The loop is only as honest as its measurements. Build the runner that takes a spec, executes it in a sandboxed subprocess, and emits a json metrics blob the evaluator can trust.

Result Evaluator

Capstone

Python

The runner produced numbers. The evaluator decides whether those numbers are an improvement, a regression, or noise. Build the verdict path that turns metrics into a one line co…

Paper Writer

Capstone

Python

A LaTeX skeleton is a contract between the researcher and the typesetter. If the contract is broken the document does not compile, and the failure is loud. Build the skeleton fi…

Critic Loop

Capstone

Python

A critic that returns "looks good" the first time is broken. A critic that always returns "needs work" is broken. The interesting critic is the one that converges, and you have …

Iteration Scheduler

Capstone

Python

A research loop without a scheduler is a queue with delusions. The scheduler is where the loop decides what to stop exploring, and that decision is the whole game.

End-to-End Research Demo

Capstone

Python

A demo is the place where every contract you wrote earlier has to compose. If any one of them leaks, the demo is the lesson that catches it.

Vision Encoder Patches

Capstone

Python

A vision model that reads pixels needs a tokenizer for pixels. Patch embedding is that tokenizer. Cut the image into a grid of squares, flatten each square, project it through o…

Vision Transformer Encoder

Capstone

Python

Patches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooli…

Projection Layer for Modality Alignment

Capstone

Python

A vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text e…

Cross-Attention Fusion

Capstone

Python

The projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can grou…

Vision-Language Pretraining

Capstone

Python

The encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs togethe…

Multimodal Evaluation

Capstone

Python

Training is half the loop. The other half is measurement. This lesson builds three evaluation surfaces from primitives: image-caption retrieval reported as R@1, R@5, R@10; visua…

Chunking Strategies, Compared

Capstone

Python

Chunking decides what your retriever can ever surface. Get the boundaries wrong and no embedding model, no reranker, no LLM can repair the damage downstream.

Hybrid Retrieval with BM25 and Dense Embeddings

Capstone

Python

Lexical and semantic retrieval fail on opposite query distributions. Hybrid retrieval with reciprocal rank fusion does not interpolate, it votes - and the vote wins on every que…

Cross-Encoder Reranker

Capstone

Python

A bi-encoder embeds query and document independently. A cross-encoder concatenates them and reads both at once. The cross-encoder is the smartest reader and the slowest. Used as…

Query Rewriting: HyDE, Multi-Query, and Decomposition

Capstone

Python

The query the user types is not the query your retriever wants. Rewriting bridges the gap before retrieval, so the index sees something closer to what the answer looks like.

RAG Evaluation: Precision, Recall, MRR, nDCG, Faithfulness, Answer Relevance

Capstone

Python

If you cannot grade your retrieval and your answer at the same time, you cannot ship the system. The two are not the same metric and the same prompt fails on different axes.

End-to-End RAG System

Capstone

Python

Six lessons of components. One pipeline. One eval loop. One self-terminating demo. This is the system you ship.

Task Spec Format

Capstone

Python

An eval harness is only as good as the contract its tasks honour. Freeze the JSONL shape and the metric vocabulary before you write a single scoring function.

Classical Metrics

Capstone

Python

BLEU, ROUGE-L, F1, exact-match, accuracy. Five metrics that still account for most published LLM eval numbers. Implement each from first principles so you know what the number m…

Code Exec Metric

Capstone

Python

Generated code is right when it passes the tests. The eval harness has to extract code, run it without crashing the host, and tally pass-rates honestly. This lesson builds that …

Perplexity and Calibration

Capstone

Python

If your model says 90 percent confident on a thousand answers and gets six hundred right, it is not well calibrated. Calibration is half of trustworthy eval. The other half is p…

Leaderboard Aggregation

Capstone

Python

Per-task scores are easy. Per-model rankings across heterogeneous tasks are harder. Statistical significance on a thousand-prediction leaderboard is the part everyone skips. Thi…

End-to-End Eval Runner

Capstone

Python

Five lessons of plumbing, one lesson to glue them. The runner reads the task spec from lesson 70, calls a model through an adapter, scores with lessons 71 and 72, attaches the c…

Collective Ops From Scratch

Capstone

Python

The four collective operations that hold distributed training together are allreduce, broadcast, allgather, and reduce_scatter. Every other primitive a training framework offers…

Data Parallel DDP From Scratch

Capstone

Python

DistributedDataParallel is a hook on top of allreduce. Wrap a model, broadcast the initial parameters from rank 0 so every rank starts identical, install a backward hook on ever…

ZeRO Optimizer State Sharding

Capstone

Python

Adam stores two moment estimates per parameter, both in float32. A 7B-parameter model carries 56 GB of optimiser state. ZeRO stage 1 shards that across N ranks; each rank owns 1…

Pipeline Parallel and Bubble Analysis

Capstone

Python

Tensor parallelism splits the matrix multiply across ranks. Pipeline parallelism splits the model across ranks, one stage per rank. Microbatches flow through the pipeline. The e…

Sharded Checkpoint and Atomic Resume

Capstone

Python

A 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes ever…

End-to-End Distributed Training

Capstone

Python

Lessons 76 through 80 each built one piece. This is the assembly: a tiny GPT trained across 4 simulated ranks with DDP for gradient sync, ZeRO-1 for optimiser-state sharding, an…

Jailbreak Taxonomy

Capstone

Python

A safety harness without a taxonomy is a coin flip. Name the attack before you defend it.

Prompt Injection Detector

Capstone

Python

A detector is a function from prompt to confidence and category. Anything else is a vibe.

Refusal Evaluation

Capstone

Python

Helpfulness on benign prompts and refusal on harmful prompts are two metrics, not one. Measure both.

Content Classifier Integration

Capstone

Python

Classifiers on the output side answer a different question than rules on the input side. Both need a policy router.

Constitutional Rules Engine

Capstone

Python, YAML

A rule is a name, a predicate, and an explanation. Anything missing one of those three is a vibe, not a rule.

End-to-End Safety Gate

Capstone

Python

Pre-gen, during-gen, post-gen. Three checkpoints, one verdict, an audit trail per request.