22 lessons

Autonomous Systems

Long-horizon agents, self-improvement, and the 2026 safety stack.

From Chatbots to Long-Horizon Agents (METR)

In 2023 a chatbot answered a question in one turn. In 2026 a frontier model routinely runs minutes to hours on a single task. METR's Time Horizon 1.1 benchmark (January 2026) pu…

STaR, V-STaR, Quiet-STaR: Self-Taught Reasoning

Learn

Python

The smallest possible self-improvement loop sits inside the rationale. A model generates a chain of thought, keeps the ones that land on correct answers, and fine-tunes on those…

AlphaEvolve: Evolutionary Coding Agents

Learn

Python

Pair a frontier coding model with an evolutionary loop and a machine-checkable evaluator. Let the loop run long enough. It discovers a 4x4 complex-matrix multiplication procedur…

Darwin Gödel Machine: Self-Modifying Agents

Learn

Python

Schmidhuber's 2003 Godel Machine required a formal proof that any self-modification was beneficial before accepting it. That proof is impossible in practice. Darwin Godel Machin…

AI Scientist v2: Workshop-Level Research

Learn

Python

Sakana's AI Scientist v2 (Yamada et al., arXiv:2504.08066) runs the full research loop: hypothesis, code, experiments, figures, writeup, submission. It is the first system to ha…

Automated Alignment Research (Anthropic AAR)

Learn

Python

Anthropic ran parallel teams of Claude Opus 4.6 Autonomous Alignment Researchers in independent sandboxes, coordinating via a shared forum whose logs live outside any sandbox (s…

Recursive Self-Improvement: Capability vs Alignment

Learn

Python

Recursive self-improvement (RSI) is no longer speculation. The ICLR 2026 RSI Workshop in Rio (April 23-27) framed it as an engineering problem with concrete tooling. Demis Hassa…

Bounded Self-Improvement Designs

Learn

Python

Research has converged on four primitives for bounding a self-improvement loop. Formal invariants that must hold across every edit. Alignment anchors that cannot be modified. Mu…

Autonomous Coding Agent Landscape (SWE-bench, CodeAct)

Learn

Python

SWE-bench Verified went from 4% to 80.9% in under three years. Same Claude Sonnet 4.5 scored 43.2% on SWE-agent v1 and 59.8% on Cline autonomous — the scaffolding around the mod…

Claude Code Permission Modes and Auto Mode

Learn

Python

Claude Code exposes seven permission modes. "plan" asks before every action, "default" asks only for risky ones, "acceptEdits" auto-approves file writes but still confirms shell…

Browser Agents and Indirect Prompt Injection

Learn

Python

ChatGPT agent (July 2025) merged Operator and deep research into one browser/terminal agent and set BrowseComp SOTA at 68.9%. OpenAI shut Operator down August 31, 2025 — consoli…

Durable Execution for Long-Running Agents

Learn

Python

Production long-horizon agents do not run in `while True`. Every LLM call becomes an activity with checkpoint, retry, and replay. Temporal's OpenAI Agents SDK integration went G…

Action Budgets, Iteration Caps, Cost Governors

Learn

Python

A mid-sized e-commerce agent's monthly LLM cost jumped from $1,200 to $4,800 after its team enabled the "order-tracking" skill. That is not a pricing bug. That is an agent that …

Kill Switches, Circuit Breakers, Canary Tokens

Learn

Python

A kill switch is a boolean held outside the agent's edit surface — a Redis key, a feature flag, a signed config — that disables the agent entirely. A circuit breaker is finer-gr…

HITL: Propose-Then-Commit

Learn

Python

The 2026 consensus on HITL is specific. It is not "the agent asks, the user clicks Approve." It is propose-then-commit: the proposed action is persisted to a durable store with …

Checkpoints and Rollback

Learn

Python

Every graph-state transition persists. When a worker crashes, its lease expires and another worker picks up at the latest checkpoint. Cloudflare Durable Objects hold state acros…

Constitutional AI and Rule Overrides

Learn

Python

Anthropic's January 22, 2026 Claude Constitution runs 79 pages and is CC0. It moves from rule-based to reason-based alignment and establishes a four-tier priority hierarchy: (1)…

Llama Guard and Input/Output Classification

Learn

Python

Llama Guard 3 (Meta, Llama-3.1-8B base, fine-tuned for content safety) classifies both LLM inputs and outputs against an MLCommons 13-hazard taxonomy across 8 languages. A 1B-IN…

Anthropic Responsible Scaling Policy v3.0

Learn

Python

RSP v3.0 went into effect February 24, 2026, replacing the 2023 policy. Two-tier mitigation: what Anthropic will do unilaterally vs what is framed as an industry-wide recommenda…

OpenAI Preparedness Framework and DeepMind FSF

Learn

Python

OpenAI Preparedness Framework v2 (April 2025) introduces Research Categories — Long-range Autonomy, Sandbagging, Autonomous Replication and Adaptation, Undermining Safeguards — …

METR Time Horizons and External Evaluation

Learn

Python

METR (ex-ARC Evals) is an independent 501(c)(3) since December 2023. Their Time Horizon 1.1 benchmark (January 2026) fits a logistic curve to task-success probability vs log(exp…

CAIS, CAISI, and Societal-Scale Risk

Learn

Python

The Center for AI Safety (CAIS, San Francisco, founded 2022 by Hendrycks and Zhang) publishes the four-risk framework — malicious use, AI races, organizational risks, rogue AIs …