AI Systems
Distributed Training Job Scheduler
Orchestrate large multi-node GPU training jobs sharing a cluster with other workloads.
Scale to anchor on
Thousands of GPUs in a cluster, multi-week jobs, mixed batch and interactive workloads, multi-tenant teams.
Requirements
Functional
- Gang-schedule N GPUs atomically across nodes.
- Support priority, preemption, and quotas per team.
- Resume jobs after failure from sharded checkpoints.
- Topology-aware placement for collective communication efficiency.
Non-functional
- High cluster utilization without starvation.
- Bounded queueing time for high-priority jobs.
- Survives node failures without restarting the whole job.
High-level architecture
A scheduler tracks GPU inventory and topology. Gang scheduling atomically allocates all required GPUs or queues the job. Topology-aware placement co-locates TP groups intra-node. Checkpoint storage allows job resumption after node failure or preemption.
Components
Inventory service
Live view of GPU availability, health, and topology.
Scheduler
Gang-aware, priority-aware, topology-aware allocation.
Health monitor
Detects degraded nodes (thermal, NIC speed, ECC) and drains them.
Checkpoint store
Sharded, asynchronous, durable storage for resumption.
Job controller (per job)
Drives the training process, handles restart on failure.
Key decisions
Gang scheduling.
Partial allocation wastes GPUs that hold idle while waiting for the rest of the job to land.
Topology-aware placement.
Cross-node TP can be 5–10x slower; placement is the cheapest performance lever.
Periodic checkpoints with hot rolling.
Multi-week jobs without checkpoints are not economically viable on commodity infrastructure.
Per-team quota with preemption.
Prevents one team from monopolizing the cluster while allowing utilization in idle periods.
Pitfalls
- No gang scheduling — partial allocations create deadlock.
- Slow straggler nodes invisible until step time degrades.
- Single-node checkpoint bottleneck.
- No clear preemption policy — political battles instead of automation.
Follow-up questions
- How do you handle a single bad GPU mid-run?
- What's the checkpoint frequency and write strategy?
- How do you reduce mean queue time for high-priority jobs?