Module 6 · Lesson 1 · Interview Craft · 32 min

The AI-Leadership Story Bank: STAR-DRIVE

Every Staff AI loop has 1-2 behavioral rounds. Most Senior candidates show up with STAR stories and score Senior because two moves are missing: naming the trade-off and reflecting on what would change. STAR-DRIVE adds both as required parts. This lesson also names the 8 story archetypes Staff AI engineers must have a STAR-DRIVE story for — and the trap of trying to cover them with three rehashed stories.

There is a recognizable shape to behavioral debriefs for Staff candidates who downlevel. The technical content of the loop is fine, the design rounds are solid, and the behavioral rounds come back with 'good stories, but felt more Senior-tier in scope.' The candidate had no idea this happened. They told the same stories they had told in every previous loop where they passed. The difference is that the previous loops were for L5 roles, where STAR is the right tool. STAR was designed for entry-level interviews; it stops working at the level where the interviewer is grading whether you can navigate trade-offs and learn from them. Two moves separate Senior STAR from Staff STAR — and STAR-DRIVE bakes them into the structure.

The other invisible downlevel comes from story coverage. Staff candidates often arrive at the loop with three or four well-rehearsed stories and try to flex them into whatever question the interviewer asks. The interviewer notices when the same project shows up as the answer to 'tell me about leading a difficult migration,' 'tell me about handling conflict with a peer,' and 'tell me about a technical decision you'd reverse.' Each of these probes is testing a different archetype, and the story bank needs to cover all eight archetypes Staff AI engineers are expected to have lived through. This lesson names the archetypes and the trap of trying to cover them with fewer stories than there are slots.

Framework

STAR-DRIVE

The textbook STAR (Situation, Task, Action, Result) framework was designed for entry-level behavioral interviews and stops working at Staff. Senior candidates running STAR produce stories that hit the structural marks and still score Senior because two things are missing: the trade-off being navigated and the reflection on what would change next time. STAR-DRIVE adds those two as required moves. Every Staff-level behavioral story has six parts in this order — Situation, Trade-off, Action, Result, Driver of impact, Reflection — and the two added parts are where the L7 signal lives. Most behavioral failures at Staff level are not story-quality failures; they are STAR-instead-of-STAR-DRIVE failures.

1
S — Situation (≤30 seconds)
Set the scene with three specifics: scale (team size, traffic, dollars), stakes (what could go wrong), and your role (lead, contributor, the person paged). Keep it short. The interviewer needs context, not a tour. Senior candidates spend 90 seconds here; Staff candidates spend 25.
2
T — Trade-off (the missing first move)
Name the explicit trade-off the situation forced you to navigate. 'We were trading speed-to-ship against retraining-data quality.' 'It was a research-team-roadmap vs platform-stability decision.' Without this, the story is just 'here's what we did'; with it, the story is 'here's the decision space and where we landed.' This is the missing front-half of every Staff-level story.
3
A — Action (what YOU did)
First-person, specific actions. Not 'the team decided' — 'I proposed X, pushed back on Y, brought Z into the room.' If you can't honestly use 'I' for the load-bearing action, this is the wrong story for a Staff round; pick one where you were actually the driver. The most common Staff-level downlevel is the story where the candidate used 'we' for everything important.
4
R — Result (with a number)
Outcomes with at least one number — percent improvement, dollars saved, incident-rate reduction, hires made, models shipped. Vague results ('it went well, the team was happy') downlevel. Hard numbers don't have to be perfect; 'roughly 30% fewer incidents in Q3' beats 'significant improvement.' If the story has no number, find a different story.
5
D — Driver of impact (the second missing move)
Why did this matter beyond the immediate metric? Connect the result to a broader system or org outcome. 'The 30% incident reduction freed our on-call to ship features instead of chase pages, which let us close two roadmap items that had been deferred for a year.' Driver-of-impact is the move that converts a single project story into a Staff-scope story by surfacing the second-order consequences.
6
E — Reflection (the third missing move)
What would you do differently if you ran it again, knowing what you know now? This single move is the highest-signal closer because it demonstrates the metacognitive ability that the rest of the interview only tests indirectly. Senior candidates skip it because they think it shows weakness; Staff candidates use it because it shows maturity. The right answer is specific and small: 'I'd have negotiated the willingness-to-trade ratio upfront instead of two weeks in.' Not 'I'd be better.'

When to use

Run STAR-DRIVE on every behavioral story in your Staff-loop bank. Audit your existing STAR stories — almost certainly missing T (trade-off) and E (reflection), possibly missing D (driver of impact). Each missing letter is a level you're leaving on the table.

Worked example

STAR version: 'We had to ship the new ranker by Q3 OKR deadline. I led the team through the design, we shipped on time, and CTR went up 4%. The team was proud.' STAR-DRIVE version: 'Q3, we had a CTR-improvement OKR and a competing infra-migration the platform team needed. (S) The trade-off was: ship a new ranker on the old infra and pay migration cost twice, or wait for the migration and miss the OKR. (T) I proposed a third option — build the new ranker against the new infra's API contract before the migration shipped, accepting the risk that the contract might change. I drove the API-contract negotiation with the platform team and ran the ranker development against the contracted interface. (A) We shipped on time, CTR went up 4.2%, the migration landed two weeks later with zero ranker rework. (R) The pattern — negotiate the API contract early so two teams can work in parallel against it — became how our org defaulted to handling concurrent migrations. (D) Looking back, I'd have written the contract as a formal RFC instead of a Slack agreement; we got lucky that the platform team didn't change it under us. (E) Same content, four times the signal.'

The 8 archetypes every Staff AI engineer needs a story for

The interviewer's question is not the story they want; it's the archetype they're probing. Memorize the archetypes; you'll recognize the question type within five seconds. Most loops sample 3-5 archetypes; you need a STAR-DRIVE story ready for all 8 because you can't predict which sample comes up.

Dimension	1. Contested technical decision	2. Killed-my-own-project	3. Research-vs-shipping	4. AI-safety/policy escalation	5. Quality regression incident	6. Cost negotiation with product	7. Mentoring through ambiguity	8. Ethical pushback
Interviewer probe (what they actually ask)	'Tell me about a technical decision you got pushback on.'	'Tell me about a project you killed (or should have killed sooner).'	'Tell me about choosing between shipping fast and getting it right.'	'Tell me about pushing back on a launch for safety reasons.'	'Walk me through an incident you owned.'	'Tell me about a tough conversation with a stakeholder.'	'Tell me about coaching someone through a hard problem.'	'Tell me about a time you disagreed with an AI-related product decision on ethical grounds.'
Trade-off you should name (the T)	Your judgment vs. team consensus. Cost of being wrong vs. cost of slowing the team.	Sunk cost vs. opportunity cost. Team morale vs. honest assessment.	Research depth vs. user value delivered. Generalizable model quality vs. shippable narrow win.	Product velocity vs. trust risk. Internal pressure vs. external commitment.	Detection speed vs. monitoring noise. Quick fix vs. structural fix.	Capability the customer wants vs. cost you'll pay forever. Short-term revenue vs. unit economics.	Giving the answer vs. growing the engineer. Your time vs. their development.	Your conscience vs. team alignment. Refusing vs. shaping the design.
Common downlevel trap	Refusing to name the conditions under which you'd update your view. Reads as inflexibility.	Picking a project someone else killed. Reads as evasion; the interviewer wants YOUR call.	Story where 'right' won and 'fast' lost. The interviewer wants nuance; bias toward research reads as out of touch with product.	Story without a clear escalation. Says you saw the issue but didn't act with appropriate authority. Reads as passive.	Story where you fixed the symptom. Staff stories end with the structural change that made the failure class impossible.	Story where you 'compromised.' Compromise without naming the willingness-to-trade ratio reads as concession, not negotiation.	Story where you solved the problem yourself and called it mentorship. Reads as taking credit.	Story where you 'felt uncomfortable but didn't say anything.' Internal monologue is not a story; action is.
Choose when	When asked about pushback, disagreement, or being overruled. Lead with the trade-off.	When asked about kill decisions, reversals, or things you'd undo. Bring a YOUR call story.	When asked about velocity, MVP vs. polish, or research vs. ship. Show nuance both ways.	When asked about safety, trust, or launch pushback. Show the escalation explicitly.	When asked about incidents or on-call. End with the structural fix.	When asked about cost, customer, or product negotiation. Name the trade ratio.	When asked about coaching, mentorship, or developing others. Lean into THEIR growth.	When asked about ethics, fairness, or refusing to build something. Action, not feeling.

Verdict

Eight archetypes, eight STAR-DRIVE stories. Most candidates try to cover them with three rehashed stories — the interviewer notices when the same project keeps showing up. The investment is roughly 8 hours of story prep to write all eight and rehearse them aloud; the return is roughly one level of behavioral signal across the loop.

Calibration ladder

The interviewer asks: 'Tell me about a time you killed a project that you had been championing.'

Archetype 2 probe. The interviewer wants to see whether you can make and own a kill decision against your own prior advocacy.

L4 · Mid

We had a project that wasn't working out and we ended up cancelling it. The team moved on to other things.

Missed: No specifics, no first-person ownership, no result. The interviewer learns nothing about the candidate's judgment.

L5 · Senior

I had been championing a research project for two quarters that wasn't delivering. After it missed its third checkpoint, we held a review and decided to cancel it. I moved my team onto a different initiative that had stronger signal. It was hard but the right call.

Missed: First-person but missing the trade-off articulation and the reflection. Reads as 'I made a hard call and moved on,' not 'I made a hard call and grew from it.'

L6 · Staff

I had championed a 6-month research project to replace our ranking model with a transformer-based architecture. Two months in, our offline experiments showed the new model was matching the old one on primary metric but losing on long-tail recall. The trade-off was sunk cost (4 engineer-months invested, team morale, my own advocacy) against opportunity cost (the same team could be shipping per-slot ranking improvements that we knew would move metrics). I made the kill call in a team meeting, explained the data, took responsibility for having underestimated the long-tail problem upfront, and we pivoted to per-slot improvements. We landed three of those in the next quarter with cumulative metric impact larger than the transformer would have delivered.

Missed: Strong STAR coverage. Missing the structural change (D — Driver of impact) and the specific reflection (E). The two missing letters are the level.

L7 · Principal

I had championed a 6-month research project for our ranking infrastructure — moving from gradient-boosted to transformer-based. (S) The trade-off was sunk cost on a high-conviction research bet against opportunity cost of the same team shipping known-good incremental improvements. (T) Two months in, my own offline experiments showed the transformer matching primary metric and losing on long-tail recall — a slice I hadn't analyzed upfront. I took it to the team meeting, presented the data, called the kill myself before anyone else suggested it, and proposed the pivot to per-slot improvements. I also did the harder ask — I sent a written postmortem to my skip-level explaining what I'd missed and what I'd do differently. (A) We landed three per-slot wins in the next quarter; cumulative impact roughly 2.5x what the transformer projection showed. (R) The pattern stuck — our team's research-bet template now requires explicit upfront analysis of which slices the bet must beat the baseline on, not just the aggregate. It's saved at least two other bets from going six months before discovery. (D) What I'd do differently: I should have insisted on the long-tail analysis in the bet's design doc, before we committed. We'd have caught the gap in two weeks instead of two months. The framework for this — 'what slice could this lose on and still match aggregate' — is something I now insist on in every research roadmap review. (E)

What scored L7

Full STAR-DRIVE with all six parts: scoped Situation, named Trade-off, first-person Action including the harder ask (writing the postmortem to skip-level), quantified Result (2.5x cumulative impact), Driver of impact (the pattern became a template that saved other bets), and specific Reflection (the slice-analysis-in-design-doc move that scales). The L7 signal is the conversion of one project's lesson into a process change that compounds across other bets — that's the structural-change signature interviewers grade for.

Unspoken rubric

What behavioral interviewers actually score on a Staff AI loop, beyond the surface 'good story' rating.

What they score

·Did the story have a specific trade-off named, or was it 'we had a hard problem and we solved it'?
·Was the action first-person ('I proposed', 'I drove'), or did 'we' do all the load-bearing work?
·Did the result include a number, or was it 'it went well'?
·Did the reflection include something specific the candidate would do differently, or did it skip reflection or say 'I'd be better'?
·Did the story include a structural change — process, framework, template — that compounded beyond the single project?
·Did the candidate use one project per archetype, or did the same project keep showing up?

Why it's not on the rubric

Behavioral rubrics say 'demonstrates leadership' and 'shows growth mindset' — abstractions that interviewers translate into the bullets above when they grade. The translation is consistent across interviewers because the bullets are what 'leadership' actually looks like in story form. Practice the bullets, not the abstractions.

How to signal it

→Open each story with the trade-off, not the situation: 'The trade-off was X vs Y; here's the situation that forced it.' Inverting the order alone separates Staff stories from Senior ones.
→Use 'I' for the load-bearing action even when it feels uncomfortable. 'I proposed,' 'I escalated,' 'I made the kill call' — these are the verbs interviewers grade for.
→End every story with a Driver-of-impact + Reflection couplet, even on stories where they feel optional. The couplet is the closer that converts a project story into a Staff-scope story.
→Before the loop, audit your story bank: do you have one story per archetype, or are three stories doing the work of eight? If the latter, write the missing ones. Eight hours of prep, one level of behavioral signal.
→If you're caught with a question that matches no story in your bank, name it: 'I don't have a perfect example of that, but the closest is X — let me adapt.' The honesty scores better than forcing the wrong story.

Real-world reference · Will Larson

StaffEng.com — real Staff promotion narratives

StaffEng.com publishes interviews with engineers who reached Staff and Principal at major tech companies. Reading 10-20 of these reveals the recurring pattern: every successful promotion story is structured around a specific trade-off the engineer navigated and a specific structural change they drove, not a list of projects they shipped. The successful candidates told stories closer to STAR-DRIVE shape than to STAR shape, often without naming the framework.

Takeaway: The structure of a successful Staff-promotion story is almost identical to the structure of a successful Staff-interview story: trade-off named, first-person action, result quantified, structural change that scales, specific reflection. The framework isn't novel; what's novel is using it deliberately under interview pressure instead of hoping the story comes out structured. Read 5-10 StaffEng narratives before your loop and notice the recurring shape — you'll recognize it in your own bank.

StaffEng — Staff stories ↗

Drill · 25 minutes

Practice this. Time yourself.

You have 25 minutes. Pick three of the eight archetypes you currently have NO solid story for. Write a STAR-DRIVE outline (one paragraph per letter, 6 paragraphs total) for each. Time-box at 8 minutes per story. The goal is not a polished narrative — it's surfacing whether you actually have the story material in your career and identifying which archetypes are gaps. If you can't find a story for an archetype, that's the most important diagnostic finding of this drill.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Coverage discovery	Wrote three stories from the same project. Did not discover gaps.	Three different projects, three different archetypes covered.	Three different projects with one of them being a stretch (archetype you weren't sure you had).	Three different projects AND explicit identification of which archetypes you still don't have material for. Naming the gaps is more valuable than writing the stories you already had.
Trade-off articulation	Skipped the T or wrote a generic 'fast vs right.'	Specific trade-off named in each story.	Trade-off named in dimensions the interviewer can grade — e.g., 'platform stability vs research velocity' instead of 'tough call.'	Same plus: the trade-off is one the team explicitly negotiated, not one you internalized alone. Stories where the trade-off was made in conversation with other senior people score higher because they demonstrate cross-team influence.
Reflection specificity	Skipped E or wrote 'I'd be better/more careful.'	Concrete thing you'd do differently.	Concrete change AND why it would have helped — connected to the specific failure mode.	Same plus: the change became a template or process you applied to subsequent work. Demonstrates that the reflection compounded into structural change.
Structural-change driver	Result was the project outcome only.	Result included some downstream consequence.	Result connected to a process, framework, or pattern that scaled beyond the project.	Same plus: the structural change is something the org still does. The most credible Driver-of-impact is the one with continuing artifacts (the design-doc template, the release-gate rule, the on-call rotation policy you introduced that's still running).

Reveal model solution

Archetype 5 — Quality regression incident. (S) Situation. Q2, our fraud-scoring model started silently false-positiving 0.3% additional transactions for two weeks before we noticed. ~2,400 affected customers, ~$180K in held-up legitimate revenue. (T) Trade-off. The pressure was to roll back immediately to stop the bleeding vs. investigate first to avoid rolling back to a model that might have its own issue we hadn't found yet. The org's instinct was the fast rollback; my instinct was a 24-hour investigation before committing. (A) Action. I made the call to delay rollback by 24 hours while we ran the diagnostic. Specifically I (a) pulled the per-class confidence distribution per model version from our observability stack — the regression was concentrated in one merchant category whose feature distribution had shifted; (b) confirmed the previous model version did not have the same shift sensitivity (so rollback was safe); (c) flagged the affected customers for manual reversal in parallel with the rollback; (d) made the rollback decision and notified leadership within 24 hours of the diagnosis. (R) Result. Rolled back the model; recovered $165K of the held revenue via manual reversal within 48 hours; the previous-model rollback caused no secondary issues. Mean-time-to-detect on this class of failure went from 14 days to 4 hours after we shipped the structural fix. (D) Driver of impact. The structural fix was the per-merchant-category prediction distribution alerting — Tier 2 of the Detection Latency Hierarchy — that we shipped two weeks after the incident. It has caught three subsequent regressions on similar slice patterns, each within hours instead of weeks. That alerting pattern became the team's required release gate for any new ranker, and it's still in production. (E) Reflection. What I'd do differently: I'd have shipped the per-class distribution alerting before this incident, not after. We'd had two earlier near-misses where aggregate accuracy looked fine and slice-specific issues caused user-visible problems; I noticed the pattern but treated it as 'something to do when we have time' rather than as the next-incident waiting to happen. The lesson — when a near-miss recurs, treat the next instance as the certainty it is — is something I now invoke in every quarterly observability roadmap.

Common failures

✗Wrote three stories from the same project to make it easy. Defeats the diagnostic purpose of the drill — the point is to surface which archetypes you don't have material for.
✗Skipped E because it felt like admitting weakness. The reflection is the highest-signal closer; skipping it is exactly the level-leaving move the lesson named.
✗Used 'we' for all action verbs. Reads as committee work, not Staff-level driving.
✗Result was just 'we shipped on time' or 'it went well.' Without a number, the interviewer can't grade the impact.
✗Driver-of-impact was the immediate metric only. Staff stories connect to the structural change that compounded — process, framework, template, policy.

Artifact · checklist

The Story Coverage Matrix

Audit your story bank — one entry per archetype

☐Archetype 1 — Contested technical decision: ___________________ (project / year)
☐Archetype 2 — Killed-my-own-project: ___________________
☐Archetype 3 — Research-vs-shipping: ___________________
☐Archetype 4 — AI-safety/policy escalation: ___________________
☐Archetype 5 — Quality regression incident: ___________________
☐Archetype 6 — Cost negotiation with product: ___________________
☐Archetype 7 — Mentoring through ambiguity: ___________________
☐Archetype 8 — Ethical pushback: ___________________

Per story — STAR-DRIVE checklist

☐S — Situation in ≤30 seconds with scale, stakes, your role.
☐T — Trade-off explicitly named in dimensions the interviewer can grade.
☐A — First-person action verbs ('I proposed', 'I drove'); no 'we' for load-bearing work.
☐R — Result with at least one number.
☐D — Driver of impact — second-order consequence or structural change.
☐E — Reflection — specific thing you'd do differently and why.

Pre-loop checks

☐Each archetype has its own story (no project covers more than two archetypes).
☐Each story rehearsed aloud at least twice — once timed, once with a friend pushing back.
☐Each story's number is defensible — can survive 'where did that come from?'
☐Each story's reflection is small and specific (no 'I'd be better' answers).
☐At least 2 stories have structural-change Drivers (process / template / policy that scaled).

Post-mortem · anonymized

Setup

L6 candidate at a top AI lab, six-round on-site. Strong technical content in all four design rounds. Behavioral debrief came back as 'good engineer, more Senior-tier on the leadership side than we'd want for L6.'

What happened

Across two behavioral rounds, the candidate told four stories. Three of them were variants of the same project — a big training-infrastructure overhaul they had led. Each version emphasized a different aspect (collaboration, technical decision, mentorship) but all three were the same shipped project. The fourth story was a different project but was a peer-collaboration anecdote with no clear first-person action. None of the stories had explicit trade-off articulation; all of them used 'we' for the load-bearing actions. None of them had a reflection beyond 'I'd communicate more proactively.' The candidate's actual project portfolio was strong — they had killed projects, navigated research-vs-shipping calls, handled incidents, and mentored others. They just hadn't surfaced those stories in interview prep because they were proud of the training-infra project and defaulted to it.

The moment

The interviewer's debrief note: 'I asked four behavioral questions hitting four different archetypes; I got back what felt like one and a half stories told from different angles. Couldn't grade leadership scope from this; the technical depth was clear but the leadership signal was inconsistent.' The candidate had not failed at storytelling; they had failed at coverage. The investment to fix this would have been 6-8 hours of writing the missing archetype stories from projects they had actually lived through. The cost of not doing it was one level.

What they should have said

Two weeks before the loop: 'Audit my story bank against the 8 archetypes. For each archetype I don't have a strong story for, write one from the projects I've actually done. Specifically write a kill-decision story from the recsys project I deprioritized in Q3, a research-vs-ship story from the embedding model I shipped before benchmarking, an incident story from the data pipeline regression last winter, an ethics-pushback story from the time I delayed a content-moderation launch.' That's roughly 6-8 hours of writing and rehearsal. The technical content of the loop would have produced an L6 outcome with that prep instead of an L5 one.

Lesson

Behavioral preparation is investment, not improvisation. Staff candidates who arrive at the loop with 8 archetype stories rehearsed in STAR-DRIVE format score Staff on behavioral. Staff candidates who arrive with 3-4 favorite stories score Senior. The technical content of the loop does not save you on this — the behavioral signal is graded independently. The framework here gives you the structure; the work to apply it is yours, and the ROI per hour is the highest of any prep activity for a Staff loop.