ML Systems

Fraud / Risk Detection Pipeline

Score transactions or actions in real time for fraud risk with a feedback loop from human review.

Scale to anchor on

Hundreds of millions of events per day, p99 scoring < 100 ms inline, human review queue of tens of thousands per day.

Requirements

Functional

Score transactions / listings / users in real time.
Apply rule-based and ML-based scoring.
Send borderline cases to human review.
Feed reviewed labels back into training.

Non-functional

Low false-positive rate (cost = lost legitimate revenue).
Auditable decisions for regulatory and dispute review.
Resilient to adversarial drift.

High-level architecture

Online feature store provides real-time features; a model serves inline predictions. Rules engine adds operator-defined logic. Borderline scores route to a review queue. Reviewer labels feed a labeling pipeline; periodic retraining updates the model.

Components

Online feature store

Real-time per-entity features (velocity, history, network signals).

Model server

Inline scoring with strict latency budgets.

Rules engine

Operator-authored rules for known patterns and hard policy.

Review queue

Human reviewers handle borderline cases; labels return to training.

Labeling pipeline

Combines reviewer labels with delayed ground truth (chargebacks).

Key decisions

Hybrid rules + ML.

Pure ML misses sharp policy lines; pure rules miss subtle patterns. Hybrid is the operational reality.

Audit every decision.

Disputes, regulatory inquiries, and post-incident review all need it.

Distinguish false-positive cost from false-negative cost.

These differ by orders of magnitude in payments; the operating point must reflect that.

Delayed ground truth handling.

Real fraud labels arrive weeks later (chargebacks); training must respect that delay to avoid leakage.

Pitfalls

Training on biased reviewer labels without counterfactual handling.
No rule-based override channel for incident response.
Operating point unchanged as adversaries adapt.
Ignoring feature freshness — model decays silently.

Follow-up questions

How do you handle a new fraud pattern that the model has never seen?
How do you balance reviewer load against latency to decision?
What's the feedback loop between review and the model?

Related patterns

queue-decoupling feature-flags rate-limiting