Managed LLM Platforms — Bedrock, Azure OpenAI, Vertex AI
Three hyperscalers, three distinct strategies. AWS Bedrock is a model marketplace — Claude, Llama, Titan, Stability, Cohere behind one API. Azure OpenAI is an exclusive OpenAI p…
Inference Platform Economics — Fireworks, Together, Baseten, Modal
The 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and A…
GPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler
Three layers, not one. Karpenter provisions nodes dynamically (under one minute, 40% faster than Cluster Autoscaler). KAI Scheduler handles gang scheduling, topology awareness, …
vLLM Serving Internals — PagedAttention, Continuous Batching, Chunked Prefill
vLLM's dominance in 2026 rests on three compounding defaults, not a single trick. PagedAttention is always on. Continuous batching injects new requests into the active batch bet…
EAGLE-3 Speculative Decoding in Production
Speculative decoding pairs a fast draft model with the target model. The draft proposes K tokens; the target verifies in a single forward; accepted tokens are free. In 2026, EAG…
SGLang and RadixAttention for Prefix-Heavy Workloads
SGLang treats the KV cache as a first-class, reusable resource stored in a radix tree. Where vLLM schedules requests FCFS (first-come, first-served), SGLang's cache-aware schedu…
TensorRT-LLM on Blackwell with FP8 and NVFP4
TensorRT-LLM is NVIDIA-only but it wins on Blackwell. On GB200 NVL72 with Dynamo orchestration, SemiAnalysis InferenceX measured $0.012 per million tokens on a 120B model in Q1-…
Inference Metrics — TTFT, TPOT, ITL, Goodput, P99
Four metrics decide whether an inference deployment is working. TTFT is prefill plus queue plus network. TPOT (equivalently ITL) is the memory-bound decode cost per token. End-t…
Production Quantization — AWQ, GPTQ, GGUF, FP8, NVFP4
Quantization format is not a universal choice — it is a function of hardware, serving engine, and workload. GGUF Q4_K_M or Q5_K_M owns CPU and edge, delivered through llama.cpp …
Cold Start Mitigation for Serverless LLMs
A 20 GB model image takes 5-10 minutes (7B) to 20+ minutes (70B) to go from cold to serving. In a true serverless world, that is not a warm-up — it is an outage. Mitigations ope…
Multi-Region LLM Serving and KV Cache Locality
Round-robin load balancing is actively harmful for cached LLM inference. A request that does not land on the node holding its prefix pays full prefill cost — roughly 800 ms at P…
Edge Inference — ANE, Hexagon, WebGPU, Jetson
The core edge constraint is memory bandwidth, not compute. Mobile DRAM sits at 50-90 GB/s; datacenter HBM3 clears 2-3 TB/s — a 30-50x gap. Decode is memory-bound so the gap is d…
LLM Observability Stack Selection
The 2026 observability market splits into two categories. Development platforms (LangSmith, Langfuse, Comet Opik) bundle monitoring with evals, prompt management, session replay…
Prompt Caching and Semantic Caching Economics
**Pricing snapshot dated 2026-04.** Numeric claims below reflect vendor rate cards captured at this lesson's publication; verify against the linked docs before quoting them down…
Batch APIs — the 50% Discount as Industry Standard
Every major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, …
Model Routing as a Cost-Reduction Primitive
A dynamic broker evaluates every request (task type, token length, embedding similarity, confidence) and sends simple queries to a cheap model, escalating complex ones to a fron…
Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-d
Prefill is compute-bound; decode is memory-bound. Running both on the same GPU wastes one resource. Disaggregation splits them onto separate pools and transfers KV cache between…
vLLM Production Stack with LMCache KV Offloading
vLLM's production-stack is the reference Kubernetes deployment — router, engines, and observability wired together. LMCache is the KV-offloading layer that extracts KV cache out…
AI Gateways — LiteLLM, Portkey, Kong, Bifrost
A gateway sits between your apps and model providers. Core features are provider routing, fallback, retries, rate limiting, secret references, observability, guardrails. Market …
Shadow, Canary, and Progressive Deployment
LLM rollouts combine the hardest parts of software deployment: no unit tests, diffuse failure modes, delayed signals. The sequence is (1) shadow mode — duplicate prod requests t…
A/B Testing LLM Features — GrowthBook and Statsig
Traditional A/B testing was not built for non-deterministic LLMs. The critical distinction: evals answer "can the model do the job?" A/B tests answer "do users care?" Both are r…
Load Testing LLM APIs — k6, LLMPerf, GenAI-Perf
Traditional load testers were not designed for streaming responses, variable output lengths, token-level metrics, or GPU saturation. Two traps bite most teams. The GIL trap: Loc…
SRE for AI — Multi-Agent Incident Response
AI SRE uses LLMs grounded in infrastructure data (logs, runbooks, service topology) via RAG to automate investigation, documentation, and coordination phases. The 2026 architect…
Chaos Engineering for LLM Production
Chaos engineering for LLMs is its own discipline in 2026. Prerequisites before running experiments in production: defined SLI/SLO, trace+metric+log observability, automated roll…
Security — Secrets, PII Scrubbing, Audit Logs
Eliminate secret sprawl via centralized vaults (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault). Never store credentials in config files, env files in VCS, spreadsheets. …
Compliance — SOC 2, HIPAA, GDPR, EU AI Act, ISO 42001
Multi-framework coverage is table stakes for 2026 enterprise deals. **EU AI Act**: in force since August 1, 2024. Most high-risk requirements enforce August 2, 2026. Fines up to…
FinOps for LLMs — Unit Economics and Multi-Tenant Attribution
Traditional FinOps breaks on LLM spend. Costs are token-transactions, not resource-uptime. Tags don't map — an API call is a transaction, not an asset. Engineering decisions (pr…
Self-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLang
Four engines dominate self-hosted inference in 2026. Pick based on hardware, scale, and ecosystem. **llama.cpp** is fastest on CPU — widest model support, full control over quan…