AI Systems

Distributed Training Job Scheduler

Orchestrate large multi-node GPU training jobs sharing a cluster with other workloads.

Scale to anchor on

Thousands of GPUs in a cluster, multi-week jobs, mixed batch and interactive workloads, multi-tenant teams.

Requirements

Functional

Gang-schedule N GPUs atomically across nodes.
Support priority, preemption, and quotas per team.
Resume jobs after failure from sharded checkpoints.
Topology-aware placement for collective communication efficiency.

Non-functional

High cluster utilization without starvation.
Bounded queueing time for high-priority jobs.
Survives node failures without restarting the whole job.

High-level architecture

A scheduler tracks GPU inventory and topology. Gang scheduling atomically allocates all required GPUs or queues the job. Topology-aware placement co-locates TP groups intra-node. Checkpoint storage allows job resumption after node failure or preemption.

Components

Inventory service

Live view of GPU availability, health, and topology.

Scheduler

Gang-aware, priority-aware, topology-aware allocation.

Health monitor

Detects degraded nodes (thermal, NIC speed, ECC) and drains them.

Checkpoint store

Sharded, asynchronous, durable storage for resumption.

Job controller (per job)

Drives the training process, handles restart on failure.

Key decisions

Gang scheduling.

Partial allocation wastes GPUs that hold idle while waiting for the rest of the job to land.

Topology-aware placement.

Cross-node TP can be 5–10x slower; placement is the cheapest performance lever.

Periodic checkpoints with hot rolling.

Multi-week jobs without checkpoints are not economically viable on commodity infrastructure.

Per-team quota with preemption.

Prevents one team from monopolizing the cluster while allowing utilization in idle periods.

Pitfalls

No gang scheduling — partial allocations create deadlock.
Slow straggler nodes invisible until step time degrades.
Single-node checkpoint bottleneck.
No clear preemption policy — political battles instead of automation.

Follow-up questions

How do you handle a single bad GPU mid-run?
What's the checkpoint frequency and write strategy?
How do you reduce mean queue time for high-priority jobs?

Related patterns

queue-decoupling leader-election circuit-breaker