28 lessons

Infrastructure & Production

Ship AI to the real world.

Managed LLM Platforms — Bedrock, Azure OpenAI, Vertex AI

Three hyperscalers, three distinct strategies. AWS Bedrock is a model marketplace — Claude, Llama, Titan, Stability, Cohere behind one API. Azure OpenAI is an exclusive OpenAI p…

Inference Platform Economics — Fireworks, Together, Baseten, Modal

Learn

Python

The 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and A…

GPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler

Learn

Python

Three layers, not one. Karpenter provisions nodes dynamically (under one minute, 40% faster than Cluster Autoscaler). KAI Scheduler handles gang scheduling, topology awareness, …

vLLM Serving Internals — PagedAttention, Continuous Batching, Chunked Prefill

Learn

Python

vLLM's dominance in 2026 rests on three compounding defaults, not a single trick. PagedAttention is always on. Continuous batching injects new requests into the active batch bet…

EAGLE-3 Speculative Decoding in Production

Learn

Python

Speculative decoding pairs a fast draft model with the target model. The draft proposes K tokens; the target verifies in a single forward; accepted tokens are free. In 2026, EAG…

SGLang and RadixAttention for Prefix-Heavy Workloads

Learn

Python

SGLang treats the KV cache as a first-class, reusable resource stored in a radix tree. Where vLLM schedules requests FCFS (first-come, first-served), SGLang's cache-aware schedu…

TensorRT-LLM on Blackwell with FP8 and NVFP4

Learn

Python

TensorRT-LLM is NVIDIA-only but it wins on Blackwell. On GB200 NVL72 with Dynamo orchestration, SemiAnalysis InferenceX measured $0.012 per million tokens on a 120B model in Q1-…

Inference Metrics — TTFT, TPOT, ITL, Goodput, P99

Learn

Python

Four metrics decide whether an inference deployment is working. TTFT is prefill plus queue plus network. TPOT (equivalently ITL) is the memory-bound decode cost per token. End-t…

Production Quantization — AWQ, GPTQ, GGUF, FP8, NVFP4

Learn

Python

Quantization format is not a universal choice — it is a function of hardware, serving engine, and workload. GGUF Q4_K_M or Q5_K_M owns CPU and edge, delivered through llama.cpp …

Cold Start Mitigation for Serverless LLMs

Learn

Python

A 20 GB model image takes 5-10 minutes (7B) to 20+ minutes (70B) to go from cold to serving. In a true serverless world, that is not a warm-up — it is an outage. Mitigations ope…

Multi-Region LLM Serving and KV Cache Locality

Learn

Python

Round-robin load balancing is actively harmful for cached LLM inference. A request that does not land on the node holding its prefix pays full prefill cost — roughly 800 ms at P…

Edge Inference — ANE, Hexagon, WebGPU, Jetson

Learn

Python

The core edge constraint is memory bandwidth, not compute. Mobile DRAM sits at 50-90 GB/s; datacenter HBM3 clears 2-3 TB/s — a 30-50x gap. Decode is memory-bound so the gap is d…

LLM Observability Stack Selection

Learn

Python

The 2026 observability market splits into two categories. Development platforms (LangSmith, Langfuse, Comet Opik) bundle monitoring with evals, prompt management, session replay…

Prompt Caching and Semantic Caching Economics

Learn

Python

**Pricing snapshot dated 2026-04.** Numeric claims below reflect vendor rate cards captured at this lesson's publication; verify against the linked docs before quoting them down…

Batch APIs — the 50% Discount as Industry Standard

Learn

Python

Every major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, …

Model Routing as a Cost-Reduction Primitive

Learn

Python

A dynamic broker evaluates every request (task type, token length, embedding similarity, confidence) and sends simple queries to a cheap model, escalating complex ones to a fron…

Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-d

Learn

Python

Prefill is compute-bound; decode is memory-bound. Running both on the same GPU wastes one resource. Disaggregation splits them onto separate pools and transfers KV cache between…

vLLM Production Stack with LMCache KV Offloading

Learn

Python

vLLM's production-stack is the reference Kubernetes deployment — router, engines, and observability wired together. LMCache is the KV-offloading layer that extracts KV cache out…

AI Gateways — LiteLLM, Portkey, Kong, Bifrost

Learn

Python

A gateway sits between your apps and model providers. Core features are provider routing, fallback, retries, rate limiting, secret references, observability, guardrails. Market …

Shadow, Canary, and Progressive Deployment

Learn

Python

LLM rollouts combine the hardest parts of software deployment: no unit tests, diffuse failure modes, delayed signals. The sequence is (1) shadow mode — duplicate prod requests t…

A/B Testing LLM Features — GrowthBook and Statsig

Learn

Python

Traditional A/B testing was not built for non-deterministic LLMs. The critical distinction: evals answer "can the model do the job?" A/B tests answer "do users care?" Both are r…

Load Testing LLM APIs — k6, LLMPerf, GenAI-Perf

Build

Python

Traditional load testers were not designed for streaming responses, variable output lengths, token-level metrics, or GPU saturation. Two traps bite most teams. The GIL trap: Loc…

SRE for AI — Multi-Agent Incident Response

Learn

Python

AI SRE uses LLMs grounded in infrastructure data (logs, runbooks, service topology) via RAG to automate investigation, documentation, and coordination phases. The 2026 architect…

Chaos Engineering for LLM Production

Learn

Python

Chaos engineering for LLMs is its own discipline in 2026. Prerequisites before running experiments in production: defined SLI/SLO, trace+metric+log observability, automated roll…

Security — Secrets, PII Scrubbing, Audit Logs

Learn

Python

Eliminate secret sprawl via centralized vaults (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault). Never store credentials in config files, env files in VCS, spreadsheets. …

Compliance — SOC 2, HIPAA, GDPR, EU AI Act, ISO 42001

Learn

Python

Multi-framework coverage is table stakes for 2026 enterprise deals. **EU AI Act**: in force since August 1, 2024. Most high-risk requirements enforce August 2, 2026. Fines up to…

FinOps for LLMs — Unit Economics and Multi-Tenant Attribution

Learn

Python

Traditional FinOps breaks on LLM spend. Costs are token-transactions, not resource-uptime. Tags don't map — an API call is a transaction, not an asset. Engineering decisions (pr…

Self-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLang

Learn

Python

Four engines dominate self-hosted inference in 2026. Pick based on hardware, scale, and ecosystem. **llama.cpp** is fastest on CPU — widest model support, full control over quan…