Module 3 · Lesson 2 · Core · 30 min

Training Pipelines: Reproducibility as a System Property

Reproducibility isn't a checkbox — it's a system property that requires versioning data, code, environment, randomness, and intent. This lesson covers the 5-Axis Model and how to design for it without paying 10× in operational overhead.

Reproducibility in ML is talked about as a binary. Either your pipeline is reproducible or it isn't. The interviewer asks 'is your training pipeline reproducible?' and the candidate either says 'yes, we set seeds and use git' or 'we try.' Both answers reveal the same gap: there is no single thing called reproducibility. There are five axes, each independently versionable, each independently expensive, and each independently breakable. The Staff move is to name which axes you've committed to, which you've sacrificed for throughput, and why.

The 5-Axis Reproducibility Model is the framework that converts the binary question into a system-design question with explicit trade-offs. It also makes the most common reproducibility failure — 'we can't figure out which version of the code trained this model' — impossible by construction, because Axis 2 (code) is the cheapest axis and the one teams hit fastest. The model's job is not to demand strict reproducibility everywhere; it's to make the right trade visible.

Framework

The 5-Axis Reproducibility Model

Reproducibility isn't a checkbox — it's a system property that requires versioning five distinct axes: data, code, environment, randomness, and intent. Every axis you don't version is an axis on which your model silently changes between training runs. The 5-Axis Model is the framework that lets you defend the level of reproducibility you've designed for, and to refuse the false dichotomy of 'reproducible vs fast.'

1
Axis 1 — Data
The training dataset hash, the labeling pipeline version, the snapshot of the feature store as of training time. A model trained on 'today's data' is not reproducible because today's data is gone tomorrow. The cheapest version of this axis is a content-addressed snapshot reference (Delta, Iceberg, DVC); the most expensive is full materialization of the training set per run.
2
Axis 2 — Code
Git commit hash for the training code, dependency versions, the actual transformation logic that touches data. Code reproducibility is the cheapest axis and the one teams usually have. It's also the axis whose absence is most embarrassing — 'we don't know which code trained this model' is the conversation no one wants to have.
3
Axis 3 — Environment
CUDA version, library versions, kernel versions, hardware revision. Different GPU generations produce different floating-point results; library upgrades change reduction orders. Container images pin most of this; the hardware revision is the often-missed sub-axis, especially when training spans different cloud-vendor instance pools.
4
Axis 4 — Randomness
Seeds for shuffling, dropout, augmentation, weight init, optimizer state. Naively setting a single seed doesn't reproduce — modern frameworks have multiple independent RNGs (Python, NumPy, framework, CUDA), and parallel data loaders use derived seeds. Reproducible randomness means setting and recording all of them, and accepting the small throughput hit from deterministic operations.
5
Axis 5 — Intent
The label on the run — what was this training trying to demonstrate? Hyperparameters, ablation toggles, the experiment-tracking metadata. Without intent, you can reproduce the bit-exact training run and still not be able to answer 'why did we train this?' Intent is the axis that makes the other four useful in retrospect.

When to use

Apply the model to any 'design our training pipeline' or 'how do you handle reproducibility?' question. Also apply it during model post-mortems — most 'I can't reproduce the bug' incidents trace to an unversioned axis.

Worked example

Interview prompt: 'How do you make training reproducible?' Senior answer: 'We use git, set random seeds, and use containers.' Staff answer: 'Five axes. Data — content-addressed snapshots with the training pipeline pinning the snapshot ID. Code — git commit hash, dependency versions, captured by the pipeline. Environment — container image hash plus the GPU generation we ran on, because A100 and H100 produce different bit-level outputs in some operations. Randomness — all four seeds (Python, NumPy, framework, CUDA) set and logged, deterministic ops enabled with the throughput cost accepted. Intent — every run has experiment-tracker metadata explaining what it was trying to demonstrate. Strict reproducibility costs ~10% throughput; we pay it on shipped models, not on hyperparameter searches.'

Calibration ladder

You're three months into production with a model that's behaving unexpectedly. You want to retrain to compare against a known-good baseline. Can you?

Reproducibility probe disguised as a debugging question.

L4 · Mid

Yes, we use git so we can retrain from the same code.

Missed: Treated git as sufficient. Will not be able to reproduce this model.

L5 · Senior

Yes — same git commit, same training data snapshot, same hyperparameters. Should produce a similar model.

Missed: Knew about data and code but missed environment, randomness, and intent.

L6 · Staff

Depends on which axes we versioned. Code via git is fine. Data: if we have a content-addressed snapshot of the training data, yes; if we trained on 'last week's snapshot' without pinning, no. Environment: probably the same container image, but if we trained on a different GPU generation than what's now available, expect bit-level differences. Randomness: only if we set and logged all four seeds, which most teams don't. Realistically, we can get to a 'qualitatively similar model' but not a bit-exact reproduction.

Missed: Strong axis-by-axis analysis. Missing the meta-move — that reproducibility is a tiered policy and the right level depends on the use case.

L7 · Principal

Same five-axis decomposition, with the meta-acknowledgment that strict reproducibility costs throughput and most teams don't pay that cost on hyperparameter searches. The right policy is tiered: dev runs and hyperparameter searches get fast-and-non-reproducible; shipped models get strict-and-slow. The team has to decide which axis to enforce when, and the interview answer is to name that decision. For this specific debugging scenario, what we actually need is not bit-exact reproduction but reproducibility-of-conclusions: can we retrain a model that demonstrates the same behavior on the same data and confirms whether the bug is in the model or in the serving path? That's a softer reproducibility bar — typically requires Axes 1, 2, 3, and 5 (data, code, environment, intent) but tolerates non-bit-exact randomness. Naming the right reproducibility bar for the use case is what separates strict-correctness Senior answers from production-pragmatic Staff answers.

What scored L7

Named the tiered policy (dev runs vs shipped models) and the use-case-specific bar (reproducibility-of-conclusions vs bit-exact). The conclusion-level bar is the practical reproducibility most teams need and the one most candidates miss. Reframing 'can you reproduce' as 'reproduce to what bar' is the L7 move.

Pattern recognition

When you see

Someone says 'we need this pipeline to be reproducible.'

→

Think

Ask: 'reproducible to what bar — bit-exact, statistically equivalent, or conclusion-equivalent?' The three bars cost very different amounts to enforce.

Strict bit-exact reproducibility costs throughput, locks you to specific hardware, and adds operational complexity. Statistical-equivalence reproducibility (same data, same code, similar metrics within tolerance) is much cheaper and adequate for most use cases. Conclusion-equivalence reproducibility (same data, can demonstrate the same behavior) is cheaper still and adequate for most debugging. Picking the wrong bar means either over-paying for reproducibility you don't need, or under-investing in the level you actually need. Naming the bar is the Staff move.

Drill · 10 minutes

Practice this. Time yourself.

You have 10 minutes. Your team trains a model on Monday, ships it Tuesday, and reports excellent offline metrics. On Wednesday you discover a serving bug and need to retrain to compare. The retrain produces metrics 4% lower than Monday's reported numbers. Which of the five axes are most likely unversioned? Write a 4-paragraph diagnostic: (1) ranked axis candidates, (2) how to confirm each, (3) the structural fix per axis, (4) the policy you'd commit to for future training runs.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Axis ranking	Listed axes unranked.	Ranked Data > Randomness > Environment.	Ranked correctly with reasoning per rank.	Same plus: explicitly identified that 4% is in the noise range for unversioned randomness alone, suggesting that Data is likely the dominant cause and Randomness is a secondary contributor.
Confirmation strategy	Generic 'check logs.'	Specific check per axis (data snapshot hash, git commit, container image).	Each check has an expected outcome and a falsification criterion.	Confirmation strategies are runnable in order of cost — cheapest checks first. Demonstrates the iterative-narrowing mental model.
Structural fix per axis	Suggested 'add logging.'	Per-axis fix: content-addressed data snapshots, git+container pinning, seed logging.	Per-axis fix plus the implementation cost (e.g., 'content-addressed snapshots cost ~10% storage; seed logging is free').	Same plus: the fix is wired into the pipeline so it cannot be forgotten on the next run — versioning enforced by the platform, not by the model team.
Future policy	Said 'we'll be more careful.'	Said 'we'll enforce all 5 axes on shipped models.'	Tiered policy: strict on shipped models, relaxed on hyperparameter searches.	Tiered policy with the explicit threshold — what counts as 'shipped' (e.g., any model that serves >1% of production traffic), what counts as 'experimental' (hyperparameter sweeps), and the platform enforcement mechanism that makes the tier visible at the pipeline level.

Reveal model solution

Axis ranking. (1) Data — most likely. 'Trained on Monday's data' usually means whatever was in the warehouse Monday morning; by Wednesday the data has changed because new events landed and possibly some labels were corrected. A 4% gap is consistent with two days of natural label drift. (2) Randomness — second. If the team set one seed and not all four, statistical noise alone can account for 2-3% on most production models. (3) Environment — third. Less likely but possible — a CUDA library update or a different GPU pool can produce small bit-level differences that compound through long training runs. Confirmation strategy. Data: check whether the original training pipeline logged a snapshot ID (Delta, Iceberg, DVC). If yes, retrain from that snapshot and see if the gap closes. If no logged snapshot, this is almost certainly the cause and we can't confirm — we can only commit to fixing it. Randomness: check the run logs for seeds. If only torch.manual_seed was set, retrain with all four seeds set to the same values across runs; gap-from-randomness should be near zero. Environment: check the container image hash and the GPU type for Monday's run vs Wednesday's. If they differ, restart Wednesday's training on the original instance type with the original container. Structural fix per axis. (1) Data: content-addressed snapshot enforced at the pipeline level. The training pipeline can only consume snapshots, not 'today's data.' Storage cost ~10% per shipped model; eliminates Axis 1 failures forever. (2) Randomness: a deterministic_seeds() utility called at training entry that sets all four seeds, enables deterministic CUDA ops, pins data loader worker seeds. Throughput cost ~10%; eliminates randomness drift. (3) Environment: container image hash pinned in the training manifest; GPU type recorded. Cost: free for image hash; GPU type can vary by ~5% in some operations and is the trade we accept for cloud-vendor flexibility. (4) Intent: experiment-tracking metadata captured automatically by the pipeline. Future policy. Tiered. Shipped models — any model serving >1% of production traffic — get strict enforcement on all five axes. The pipeline refuses to ship a model that doesn't pass the five-axis check. Hyperparameter searches and experimental runs get the relaxed policy: code and data versioning, but no enforced randomness or environment pinning. The tiered policy is implemented at the platform level: the model registry has two ingestion paths (experimental, shipped), and the shipped path enforces the axes via pre-publish validation. This makes the right policy the default and the wrong policy impossible.

Common failures

✗Did not rank the axes. Generic 'check everything' answers don't demonstrate prioritization.
✗Suggested 'set the seed' as the randomness fix without naming the four independent RNGs.
✗Suggested manual checking as the structural fix. Manual is not structural; platform enforcement is.
✗Did not name the tiered policy. The single policy 'always strict' is operationally untenable; the policy 'always loose' is the status quo causing the bug.

Artifact · checklist

The 5-Axis Reproducibility Manifest

Per shipped model — record these (manifest fields)

☐Axis 1 — Data: snapshot ID (Delta/Iceberg/DVC), training feature-store version.
☐Axis 2 — Code: git commit hash, training script path, dependency lock file hash.
☐Axis 3 — Environment: container image hash, GPU type and count, CUDA version.
☐Axis 4 — Randomness: all four seeds (Python random, NumPy, framework, CUDA), deterministic ops flag.
☐Axis 5 — Intent: experiment name, hyperparameters, ablation flags, hypothesis being tested.

Per pipeline — enforce these (platform contracts)

☐Pipeline refuses to train on non-snapshotted data.
☐Pipeline captures git commit and dependency lock automatically.
☐Pipeline records container hash and GPU type.
☐Pipeline calls deterministic_seeds() at entry for shipped runs.
☐Pipeline requires intent metadata before submission.

Tier policy

☐Shipped models (>1% production traffic): all 5 axes enforced.
☐Experimental runs (hyperparameter sweeps): Axes 1, 2, 5 enforced; 3 and 4 relaxed for throughput.
☐Throwaway debug runs: Axes 2 and 5 enforced only.

Post-mortem · anonymized

Setup

Large-scale recsys team at a video platform. A senior engineer trained a model on a Friday afternoon, shipped it Saturday, and went on vacation. On Wednesday the team noticed quality regression and rolled back. The senior engineer returned Thursday and tried to reproduce the original model to compare.

What happened

She could not reproduce it. The training pipeline didn't pin data snapshots; she had trained on 'last week's data' and that data had been overwritten by the next week's snapshot in the warehouse. The pipeline pinned code (git hash logged) and environment (container image hash logged) but not data or seeds. A retrained model on the new snapshot produced metrics 6% below Friday's. There was no way to tell whether the gap was data drift, randomness, or a real change. The team rolled forward to the new model because they had no baseline to roll back to in any meaningful sense.

The moment

The retrospective conclusion was that the team had two of five axes versioned and three unversioned. The unversioned axes had been fine for two years; the moment they mattered, they were missing. The platform team had been planning to add data snapshot enforcement 'when there was time'; the cost of adding it was estimated at one engineer-week. The cost of not having it was several weeks of investigation, a quality regression, and a model the team could not confidently characterize.

What they should have said

Two years earlier, when the platform was being built: 'Five axes need to be enforced for shipped models: data, code, environment, randomness, intent. The pipeline must refuse to train on non-snapshotted data and must call deterministic_seeds() automatically. Two of these are free; three cost a small amount of throughput and storage. The total platform investment is one engineer-week. The alternative is that every model team independently learns this lesson, expensively.' That conversation, with the platform team and ML leadership, would have produced a different platform. The fact that the team had been operating successfully without it for two years was not evidence that it wasn't needed — it was evidence that the failure mode hadn't yet been triggered.

Lesson

Reproducibility is a system property that is invisible until it matters and irreplaceable when it does. The 5-Axis Model lets you make the right investment proactively and tier the enforcement by use case. The wrong move is to defer the investment because 'we haven't hit the issue yet' — the issue is always two iterations away, and the investment is small relative to the cost of being unable to compare a problematic model against a known-good baseline.