ML Systems

Fraud / Risk Detection Pipeline

Score transactions or actions in real time for fraud risk with a feedback loop from human review.

Scale to anchor on

Hundreds of millions of events per day, p99 scoring < 100 ms inline, human review queue of tens of thousands per day.

Requirements

Functional

  • Score transactions / listings / users in real time.
  • Apply rule-based and ML-based scoring.
  • Send borderline cases to human review.
  • Feed reviewed labels back into training.

Non-functional

  • Low false-positive rate (cost = lost legitimate revenue).
  • Auditable decisions for regulatory and dispute review.
  • Resilient to adversarial drift.

High-level architecture

Online feature store provides real-time features; a model serves inline predictions. Rules engine adds operator-defined logic. Borderline scores route to a review queue. Reviewer labels feed a labeling pipeline; periodic retraining updates the model.

Components

Online feature store
Real-time per-entity features (velocity, history, network signals).
Model server
Inline scoring with strict latency budgets.
Rules engine
Operator-authored rules for known patterns and hard policy.
Review queue
Human reviewers handle borderline cases; labels return to training.
Labeling pipeline
Combines reviewer labels with delayed ground truth (chargebacks).

Key decisions

Hybrid rules + ML.
Pure ML misses sharp policy lines; pure rules miss subtle patterns. Hybrid is the operational reality.
Audit every decision.
Disputes, regulatory inquiries, and post-incident review all need it.
Distinguish false-positive cost from false-negative cost.
These differ by orders of magnitude in payments; the operating point must reflect that.
Delayed ground truth handling.
Real fraud labels arrive weeks later (chargebacks); training must respect that delay to avoid leakage.

Pitfalls

  • Training on biased reviewer labels without counterfactual handling.
  • No rule-based override channel for incident response.
  • Operating point unchanged as adversaries adapt.
  • Ignoring feature freshness — model decays silently.

Follow-up questions

  • How do you handle a new fraud pattern that the model has never seen?
  • How do you balance reviewer load against latency to decision?
  • What's the feedback loop between review and the model?

Related patterns

Further reading