Module 4 · Lesson 3 · Advanced · 40 min

Fraud Detection (Stripe-scale, <50ms decision) — Simulated Interview

Inline risk scoring at the payment-auth latency budget. Feature freshness, model size constraints, and the rules-vs-ML composition pattern that always comes up. The Inline Risk Budget is the framework that converts 'design fraud detection' from a tour of ML techniques into a budget-driven architectural decomposition.

Fraud-detection interviews are deceptively simple. The candidate hears 'design real-time fraud scoring' and reaches for a familiar shape: collect transaction features, train an XGBoost model, deploy with a feature store, monitor for drift. Every individual move is right. The whole answer is wrong because it does not acknowledge the constraint that defines the system: the entire risk decision must complete inline with payment authorization, with a hard <50 ms budget that the payment processor will not negotiate. That single number — 50 ms — writes the architecture. Candidates who derive the architecture from the budget produce systems that work; candidates who design the model first and then try to fit it inside the budget produce systems that miss the SLA in production.

The Inline Risk Budget is the framework that forces the budget-first derivation. Four components — feature lookup, model inference, rules evaluation, decision logging — each with a sub-budget, each with a forced architectural commitment. Plus the hidden fifth component: the async escalation path for the cases the inline budget cannot afford. Naming all five and deriving the architecture from them is the structural opening to any inline-risk interview.

Framework

The Inline Risk Budget

Real-time fraud-scoring systems have one defining constraint: the entire risk decision happens inline with payment authorization, with a hard <50 ms budget that no other system in the request path negotiates around. The Inline Risk Budget framework names the four budget components — feature lookup, model inference, rules evaluation, decision logging — and the forced architectural choices each one implies. Candidates who don't decompose the budget end up proposing rich models that violate the SLA. The structural answer is to derive the architecture from the budget, not the other way around.

1
Component 1 — Feature lookup (12 ms)
Per-transaction features for the user, merchant, card, device, geo, and behavioral signals. At inline-payment SLA, these must be co-located with the model serving, not in a separate service. The single biggest architectural commitment: feature store is in-process or sidecar, not RPC. Going over the network for features costs the entire 50 ms budget.
2
Component 2 — Model inference (18 ms)
The fraud scoring model itself. At 18 ms budget on tabular features, the model is small — gradient-boosted trees, small MLPs, or distilled deep models. Frontier-scale models don't fit. The architectural commitment that follows: the team operates a model-distillation pipeline because the production model must always be small enough for the budget, regardless of what the research team trains.
3
Component 3 — Rules evaluation (10 ms)
Hard-policy rules that operate alongside the ML score: known-bad lists, regulatory hard-blocks, merchant-specific overrides, operator manual interventions. Rules run in parallel with the model and combine via an explicit policy engine. The architectural commitment: rules are first-class infrastructure with their own deploy pipeline, not strings in a config file.
4
Component 4 — Decision logging (10 ms)
Every decision is logged with the full feature snapshot, model version, rule outcomes, and final decision. This is the training data for the next model and the audit trail for disputes. At inline SLA, logging must be async (fire-and-forget to Kafka), not blocking. The architectural commitment: decision logs are the ground truth source of training data, not a side effect.
5
The hidden fifth component — Async escalation
Inline decisions are necessarily fast and necessarily limited in depth. Borderline cases (high-risk but not auto-block) route to a separate async path with a richer model and human-in-the-loop review. The inline budget protects the SLA; the async path captures the accuracy that the inline budget cannot afford. Designing only the inline path is the canonical Senior failure.

When to use

Apply the framework to any fraud, abuse, or real-time risk-scoring interview prompt. The framework also works for any inline-decision problem at sub-100 ms SLA (ad eligibility, content policy at post-time, transaction approval).

Worked example

Senior answer to 'design Stripe-scale fraud detection': 'Train a deep model on transaction data, deploy with caching.' Staff answer: '50 ms total inline. Feature lookup 12 ms — co-located with serving, in-process or sidecar, not RPC. Model inference 18 ms — distilled small model; the research team's big model trains the distillation but doesn't serve inline. Rules 10 ms — separate policy engine, runs parallel to model. Logging 10 ms async to Kafka. Plus an async escalation path for borderline cases with a richer model and human review. The inline path catches obvious fraud; the async path catches the cases the inline model can't fit. Without the escalation path, accuracy ceiling on inline scoring is the wall.'

Calibration ladder

Your research team trained a 200M-parameter neural model that beats your current gradient-boosted model by 4% on fraud catch rate. How do you ship it?

The interviewer wants to see whether you understand that the inline budget rules out the model entirely, and what the L7 answer is.

L4 · Mid

Deploy it with GPU inference, optimize for latency, add caching for repeated features.

Missed: Treated the model size as a serving-optimization problem. Quantization doesn't make a 200M model fit in 18 ms on tabular data.

L5 · Senior

Quantize it to int8, possibly distill to a smaller variant, deploy with continuous batching. Verify p99 latency is under 50 ms in shadow before rolling out.

Missed: Knew about distillation but didn't articulate the two-model architecture. Will propose distillation as a fallback when it should be the default.

L6 · Staff

Won't fit inline. A 200M-parameter model on tabular features at p99 <18 ms is roughly impossible regardless of quantization. The right path: train a distilled student model (1-5M parameters, GBT-class size) on the teacher's outputs as soft labels, plus the original ground truth. Distilled student deployed inline; teacher used for offline labeling refinement, async escalation scoring, and training-data quality improvements. The 4% catch-rate improvement from the teacher mostly transfers to the student via distillation — typically you recover 70-90% of the gain.

Missed: Strong technical answer. Missing the meta-move — naming the two-model architecture as the standing pattern and the cultural commitment around teacher-student separation.

L7 · Principal

Same distillation answer with the meta-acknowledgment that the question is testing whether the candidate understands the structural constraint. The 200M model is not deployable inline; that's not a tuning problem, it's an architectural fact. The right answer is the two-model architecture: teacher trains, distilled student serves inline, teacher serves the async escalation path for borderline cases where the extra accuracy actually pays for the extra latency. This is also the architecture that lets the team adopt research advances continuously — every research win produces a teacher update, which produces a re-distilled student. The pattern: at inline SLA, the model is always a distilled student; the question 'can we use the bigger model' has one answer (no inline, yes async escalation), and recognizing that without being told is the L7 signal. The candidate also names the cultural commitment: the research team has to accept that their model will be distilled before it ships, which is a real organizational change that's easier to negotiate at design time than at deploy time.

What scored L7

Named the two-model architecture (distilled student inline, teacher for async + training-data refinement) as the standing pattern at inline SLA, and named the cultural commitment around teacher-student separation. The L7 move is recognizing the constraint is architectural, not tuning, and that the organizational change is part of the design.

Architecture

Stripe-scale inline fraud detection. The inline path is the four-budget components plus the async escalation path. Notice the feature store is co-located with serving (sidecar), the model is the distilled student, and the teacher lives in the async path.

Payment service · inline risk call

“Hard <50 ms p99 SLA; payment auth blocks on this.”

Risk service · orchestrates the four budget components

“Owns the 50 ms budget. Parallelizes feature lookup, rules, and model inference where possible.”

Feature store sidecar · in-process / unix socket

“Network RPC for features costs 10-30 ms; sidecar is the architectural commitment that fits the budget.”

Distilled student model (~1-5M params)

“GBT or small MLP. The teacher does not serve inline; ever.”

Rules engine · parallel with model

“Hard policy filters (known-bad, regulatory, merchant overrides). Runs in parallel with model; results combine in policy engine.”

Decision log → Kafka (async)

“Every decision logged with full feature snapshot, rule outcomes, model score. Fire-and-forget; does not block.”

Async escalation path

“Borderline scores route here. Richer teacher model + human review for the cases inline cannot afford.”

Teacher model (~200M params) · offline + async

“Trains the student via distillation. Also serves the async escalation path. Never inline.”

Training pipeline · teacher → student distillation

“Continuous distillation: teacher updates produce new student. The cultural commitment that lets research adopt continuously.”

Per-component observability

“Per-budget-component latency tracked separately; per-class accuracy; rule-firing rates.”

payment → risk-svc · inline auth

risk-svc → feature-sidecar · in-process lookup

risk-svc → model-inline · score

risk-svc → rules · policy check

risk-svc → decision-log · async log

decision-log → async-path · borderline cases

async-path → teacher · richer scoring

teacher → training · distillation

training → model-inline · new student

Latency anatomy · budget 50 ms

Inline budget tree. Feature lookup is the binding constraint; everything else is sized to fit. Notice the model and rules run in parallel — the candidate who serializes them spends 28 ms instead of 18 ms.

Network + auth into risk-svc5 ms

Sidecar pattern; tight.

Feature lookup (in-process)12 ms

Hot features in memory, cold in local SSD cache. Network call for any feature is budget-busting.

Model inference (parallel with rules)18 ms

Distilled student on tabular features. GBT-class latency.

Rules evaluation (parallel with model)10 ms

Policy engine over feature snapshot. Combines with model via policy decision.

Decision + async log + response5 ms

Decision returned to payment service; log writes to Kafka fire-and-forget.

Pattern recognition

When you see

Anyone proposes a frontier-class model for an inline payment-decision workload.

→

Think

The model must be distilled before it serves inline. The architecture is teacher + student; the question is not 'which model' but 'how aggressive is the distillation.'

Inline risk SLAs are non-negotiable because they sit in the payment-auth path that all customer revenue flows through. Any model that violates the SLA is by definition not deployable inline. Candidates who suggest the big model are not testing the SLA; they're hoping it doesn't bind. It binds. The teacher-student pattern is the architectural answer that lets the team adopt research wins without compromising the SLA.

Drill · 12 minutes

Practice this. Time yourself.

You have 12 minutes. A team has been running inline fraud detection on a small XGBoost model. Their false-positive rate is at the limit of what merchants will tolerate (0.3%), and the research team has a new neural model that would reduce FP to 0.15% but takes 120 ms per inference. The inline SLA is 50 ms. Walk through how you'd actually ship the FP improvement. Write a 4-paragraph response: (1) why the obvious answer (replace XGBoost with neural inline) doesn't work, (2) the two-model architecture that does, (3) the rollout plan, (4) the cultural commitment the research team needs to make.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Why obvious answer fails	'It's too slow.'	Specific: 120 ms vs 18 ms budget; quantization won't bridge the gap.	Same plus: even if the model fits, feature lookup wouldn't, because the neural model probably needs richer features that aren't co-located.	Named that the inline SLA is non-negotiable because it sits in the payment-auth path, the gap is 7x not 2x, and optimization doesn't bridge 7x.
Two-model architecture	Suggested distillation without architecture.	Distilled student inline; teacher for batch.	Distilled student inline; teacher for async escalation on borderline cases; teacher's full benefit on the cases that route to async, partial benefit (distilled) on inline.	Same plus: explicit routing policy — borderline = score within X of the inline decision threshold; the bandwidth of async review determines X.
Rollout plan	Said 'shadow then canary.'	Shadow → canary → A/B, per the Rollout Ladder from Lesson 3.3.	Same plus: separate rollout for the inline distilled student vs the async teacher; they have different risk profiles.	Same plus: the async escalation path needs human-review capacity; the rollout has to ramp escalation volume in step with reviewer hiring/training.
Cultural commitment	Did not name.	Said 'research has to accept distillation.'	Named that research's metric becomes 'teacher quality + distillation quality' as a combined system, not 'teacher quality' alone.	Same plus: the distillation pipeline is co-owned. Research can't ship without ML platform support; platform can't update student without research approval. The org structure has to support the architecture.

Reveal model solution

Why the obvious answer fails. The 120 ms neural model cannot serve inline at the 50 ms SLA. Quantization typically yields ~2x latency improvement; pruning another 1.5-2x; together at most 3-4x. The gap is 7x. Optimization doesn't bridge 7x. The inline SLA is set by the payment processor and cannot be negotiated because it sits in the auth path that all customer revenue flows through. Worse, the neural model probably needs richer features (sequential transaction history, deeper user context) that aren't economically co-located with serving — even if the model fit, the feature lookup wouldn't. The 'we'll optimize it' path is a multi-month investment with a high probability of not reaching the SLA, during which the team ships nothing. Two-model architecture. Distilled student (GBT-class, 1-5M params) serves inline at the 18 ms budget. Teacher neural model (200M class) serves two roles: (a) the async escalation path for borderline cases — transactions whose inline score falls within a defined band around the decision threshold get re-scored by the teacher with richer features, with a 30-second latency budget (acceptable because the payment is already authorized but flagged for review), and (b) offline batch scoring of historical transactions to generate the distillation training data for the student. The student inherits most of the teacher's accuracy on the inline-decidable cases; the teacher catches the cases the student can't. Expected FP improvement: 0.3% → ~0.20% on inline alone, → ~0.15% with the escalation path catching the borderline cases the student over-flagged. Rollout plan. Two parallel rollouts. The inline distilled student rolls through the full Quality-Aware Rollout Ladder from Lesson 3.3 — shadow, canary, A/B against the current XGBoost, ramped with holdback. ~3 weeks. The async escalation path is a new system; it rolls out separately starting with a low fraction of borderline cases (~10%) to validate the teacher's accuracy and the human-review capacity, then ramps as reviewers scale. The two rollouts are intentionally decoupled so a regression in one doesn't block the other. Total timeline: ~6-8 weeks to full deployment. Cultural commitment. Research has to accept that their model's success is measured as 'teacher + student distillation quality' as a combined system. A teacher that's 10% better but doesn't distill well (because it relies on features the student can't fit) is worse than a teacher that's 5% better with clean distillation. The research roadmap needs to include distillation-friendliness as a constraint, not an afterthought. Organizationally, the distillation pipeline is co-owned by research and ML platform: research can't ship a model without platform support for distillation; platform can't update the student without research signoff. This is a real org-design commitment that needs to happen at design time, before the first model is built. Getting this in writing prevents the canonical fight where research ships a teacher that platform can't distill and both teams blame each other.

Common failures

✗Suggested replacing XGBoost inline with the neural model and 'optimizing' it. Optimization doesn't bridge 7x.
✗Proposed only the distilled student without naming the async escalation path. Misses the structural answer for the cases the student can't catch.
✗Did not address human-review capacity scaling. Async escalation paths fail when the reviewer pool can't keep up.
✗Did not name the cultural commitment. The architecture works only if research accepts distillation as a constraint.

Artifact · checklist

The Inline Risk Budget Worksheet

Step 1 — Confirm the SLA is non-negotiable

☐Where does this latency budget come from? (Payment processor, customer SLA, downstream system.)
☐Is it actually non-negotiable, or is the team treating it as such by default? Ask.
☐If non-negotiable, the budget writes the architecture. Proceed.

Step 2 — Allocate the four budget components

☐Feature lookup: ____ ms (typical 10-15 at <50 ms SLA)
☐Model inference: ____ ms (typical 15-20)
☐Rules evaluation: ____ ms (typical 8-12, parallel with model)
☐Decision logging: ____ ms (typical 5-10, async)
☐Total: ____ ms ≤ SLA

Step 3 — Forced architectural commitments

☐Feature store co-located with serving (sidecar / in-process). RPC features are budget-busting.
☐Model is distilled student. Frontier models do not serve inline.
☐Rules are first-class infrastructure with their own deploy pipeline.
☐Logging is async to a durable queue. Blocking log is a budget violation.

Step 4 — Design the async escalation path

☐Define the borderline band around the decision threshold.
☐Route borderline cases to the async path (richer model + human review).
☐Async latency budget: tens of seconds, not milliseconds.
☐Human review capacity scales with the borderline volume.

Step 5 — Two-model org commitment

☐Research roadmap includes distillation-friendliness as a constraint.
☐Distillation pipeline co-owned by research and ML platform.
☐Teacher updates trigger student re-distillation as a standing process.

Post-mortem · anonymized

Setup

Payments company, fraud team. Spent six months building a 'big new neural model' (≈300M params) intended to replace the existing GBT inline scorer. Internal benchmark showed 6% FP improvement. The team focused entirely on the model and minimally on serving.

What happened

When the model was ready to ship, the team discovered p99 latency at 180 ms — well outside the 50 ms inline budget. They spent another three months on quantization and inference optimization, getting it to 95 ms — still well outside the budget. The CTO eventually killed the project. The team had ignored the inline budget for six months, then tried to bridge the gap for three more.

The moment

Month seven, an engineer who had recently joined from a peer fintech asked, 'Why aren't we distilling? The teacher is great; the inline serve doesn't need to be the teacher.' The team had not considered distillation because the original framing was 'replace the model.' Replacing the model was the wrong framing. The right framing was always 'use the better model where it fits, distill it where it doesn't.' The two-model architecture was sitting unrecognized for nine months.

What they should have said

At project kickoff: 'The 300M model will not serve inline at our SLA. Architecturally, we should plan for it to be a teacher — it serves the async escalation path with full accuracy, and a distilled student serves inline. The inline target is a 1-5M parameter distilled student that recovers most of the teacher's accuracy via distillation. We measure two things: teacher accuracy on the async path, and student accuracy after distillation. The project is shipping the two-model architecture, not replacing the inline model.' That framing at month zero would have produced a shipped system at month four instead of a killed project at month nine.

Lesson

Inline risk systems are budget-derived. The latency budget forces a model class, which forces an architectural pattern (teacher + student), which forces an organizational commitment (research roadmap includes distillation-friendliness). Skipping any link in the chain produces months of misallocated work. The framework's job is to keep the chain visible from the first design conversation.