Module 4 · Lesson 5 · Advanced · 42 min

AI Coding Assistant (Copilot-scale) — Simulated Interview

Copilot-class coding assistants serve completions inline in IDE keystrokes with sub-100 ms perceived latency. The Ghost-Text Completion Stack names the five layers — anticipatory fetch, FIM prompting, speculative cancellation, privacy-preserving tenancy, acceptance feedback — and shows where the architectural decisions live.

Coding assistants are the most demanding latency product in the AI space. The user perceives completions slower than ~150 ms as 'broken' — the cursor sits there and the user keeps typing, which races the completion to wrongness. Beat the perceived latency budget and the product feels magical; miss it and the feature is uninstallable. The 100 ms p99 SLA is not a stretch goal; it is the difference between a usable product and a demo.

The Ghost-Text Completion Stack is the architecture that makes the 100 ms budget achievable. Five layers, each addressing a constraint that the obvious approach (LLM + IDE plugin + prompt-with-context) fails to satisfy. The interview prompt 'design Copilot' usually surfaces all five constraints during follow-up questioning; candidates who name the layers up front are doing in the first three minutes what the interviewer would have spent twenty minutes drawing out. The framework's job is to make the constraints structural, not surprises.

Framework

The Ghost-Text Completion Stack

Copilot-class coding assistants serve completions in <100 ms p99 because users perceive any slower latency as 'broken.' The Ghost-Text Completion Stack is the four-layer architecture that makes that latency budget achievable: aggressive context prediction, fill-in-the-middle (FIM) prompt construction, speculative cancellation of in-flight requests when the user keeps typing, and privacy-preserving repo access that doesn't ship the user's source code to a vendor by default. Each layer has its own architectural commitment; missing any one produces a system that's either slow, wrong, or a compliance incident.

1
Layer 1 — Context prediction (anticipatory fetch)
When the user pauses for ~50 ms after typing, the assistant fetches relevant context — the current file's relevant other functions, recently-edited files, declared imports, similar files in the repo. The fetch happens before the user explicitly requests a completion. This anticipatory fetch is what makes the inline latency budget achievable; reactive fetch on completion request costs the budget.
2
Layer 2 — Fill-in-the-middle (FIM) prompt construction
Code completion is not text completion. The model needs prefix (what comes before the cursor), suffix (what comes after), and relevant context (other functions, imports). FIM-trained models accept this structure natively; non-FIM models can be tricked into it with prompt engineering but produce worse completions. Architectural commitment: the model choice is constrained to FIM-capable models.
3
Layer 3 — Speculative cancellation
Users keep typing. When the user types one more character, the in-flight completion is now wrong. Cancel it cheaply and start a new one. The cancellation has to be cheap because users type at ~200 ms intervals — every wasted in-flight inference is wasted GPU. Architecturally: requests carry a cancellation token; serving infrastructure honors cancellations within ~20 ms.
4
Layer 4 — Privacy-preserving repo access
Enterprise customers will not ship source code to a vendor. The architecture must support a tenancy model — single-tenant deployment on customer infrastructure, federated retrieval where embeddings leave but raw code does not, or on-device inference for the most sensitive customers. The privacy posture is not a feature; it is the precondition for selling the product to any company larger than ~50 engineers.
5
Layer 5 — Acceptance feedback as training signal
Users implicitly label every completion: accepted (typed past it), partially accepted (edited it), rejected (deleted it, kept typing). This signal is the highest-quality training data the coding assistant team has. The architectural commitment: capture acceptance signals with enough context to retrain — the prompt, the completion, the surrounding edit state, the time-to-accept-or-reject. Most coding assistant teams underuse this signal because the logging path was bolted on instead of designed in.

When to use

Apply the framework to any coding-assistant, IDE-integration, or low-latency-completion product interview. The framework also applies to in-app suggestions (Gmail Smart Compose, Notion AI inline) where the latency budget is similarly tight and the context prediction lever is similar.

Worked example

Senior answer to 'design Copilot-class assistant': 'LLM with code training, integrate with IDE, prompt with surrounding context.' Staff answer: 'Five layers. Anticipatory fetch on type-pause — without it, reactive context lookup costs the latency budget. FIM-capable model — non-FIM models produce worse completions, so the model choice is constrained. Speculative cancellation — users keep typing, so in-flight inferences must be cheap to cancel. Privacy-preserving repo access — enterprise won't ship code to a vendor, so the tenancy model is mandatory not optional. Acceptance feedback loop — implicit user labels are the training data, captured by design not bolted on. The 100 ms budget forces anticipatory + speculative; the enterprise market forces privacy; the FIM and feedback layers determine completion quality. Each layer is a different design constraint.'

Calibration ladder

How do you handle the privacy concerns of enterprise customers who don't want their source code shipped to your servers?

This question rules out 60% of theoretically-correct candidate answers because most candidates have not thought about the enterprise tenancy model.

L4 · Mid

Add encryption in transit and at rest. Promise we don't train on customer code.

Missed: Treated privacy as compliance theater. Enterprise won't buy on these grounds.

L5 · Senior

Offer a no-training-on-data tier. SOC 2 and similar compliance. Possibly air-gapped on-premise deployment for the most sensitive customers.

Missed: Knew that on-premise exists, didn't name the three tiers or the architectural implications.

L6 · Staff

Three-tier tenancy model. Tier 1: cloud multi-tenant with strong promises (no training, encryption, audit logs); fine for small teams and individual developers. Tier 2: single-tenant cloud — customer's instance, customer's data plane, vendor's control plane; fits most enterprise. Tier 3: fully on-premise or VPC-deployed; fits regulated industries. The architecture has to support all three with the same product surface; that constrains the model choice (must be deployable in customer infra) and the inference stack (must run without vendor cloud dependencies for Tier 3).

Missed: Strong three-tier model. Missing the meta-move — that privacy posture is product strategy and the architecture has to be designed for the most-restrictive tier from day one.

L7 · Principal

Same three-tier model with the meta-acknowledgment that the privacy posture is the product strategy, not a feature checkbox. Enterprise customers will pay 10× for Tier 2/3 deployment, and the vendors that ship Tier 2/3 will dominate the enterprise market regardless of who has the best Tier 1 product. The architectural decision is therefore: do we make the model and inference stack deployable in customer infrastructure from day one, or do we build for cloud-only first and retrofit later? The latter is the canonical mistake — retrofitting on-prem deployment onto a cloud-native architecture costs years and frequently produces a degraded product. The right move is to design for the most-restrictive tenancy from the start, and let the cloud version be a deployment mode of the same architecture, not a separate product. The pattern: when the customer market segments by privacy posture, the privacy posture writes the architecture. Same shape as 'latency budget writes the architecture' from earlier lessons.

What scored L7

Named that privacy posture writes the architecture and that retrofitting on-prem is the canonical mistake. Connected back to the 'constraints write the architecture' pattern from earlier lessons. The L7 move is recognizing that enterprise market segmentation is itself a design constraint.

Architecture

Copilot-class coding assistant. The IDE-side components handle anticipatory fetch and speculative cancellation; the model service handles FIM-aware completion; the tenancy layer determines deployment topology; the feedback loop captures acceptance signals.

IDE extension · type-pause detection + cancellation tokens

“Owns anticipatory fetch trigger and cancellation token issuance. The IDE side is where the latency budget is won or lost.”

Context fetcher · relevant files, imports, symbols

“Triggered on type-pause, not on completion request. Anticipatory; results are cached for the imminent completion.”

Prompt assembler · FIM-format prefix + suffix + context

“Constructs the FIM prompt with prefix (before cursor), suffix (after cursor), and ranked context. Constrained to FIM-capable model.”

Tenancy router · cloud / single-tenant / on-prem

“Routes the inference call to the appropriate tenancy. Same product surface, three deployment topologies.”

Code LLM (FIM-capable) · serving fleet

“Continuous batching, speculative decoding, cancellation honoring within ~20 ms. Code-tuned model with FIM training.”

Acceptance feedback collector

“Captures accept/edit/reject signals with prompt, completion, and edit-state context. Highest-quality training data the team has.”

Training pipeline · acceptance signals + curated code

“Continuous re-training on the feedback signal with bias-aware sampling. Acceptance is implicit labeling at production scale.”

Per-tenancy per-language observability

“TTFT per tenancy, completion quality per language, acceptance rate per model version. The feedback loop's measurement layer.”

ide → context · anticipatory fetch on pause

ide → prompt · completion request

context → prompt · ranked context

prompt → router

router → model · tenancy-appropriate route

model → ide · streaming tokens

ide → feedback · accept / edit / reject

feedback → training

training → model · new model versions

Latency anatomy · budget 100 ms

Inline completion latency budget for a 100 ms p99. The anticipatory fetch shifts context lookup outside the visible budget; FIM prompting keeps the model small enough to fit; speculative decoding accelerates decode.

IDE → completion request (network)8 ms

Local network or persistent connection.

Prompt assembly (context already pre-fetched)5 ms

Anticipatory fetch already completed; assembly is cheap.

Tenancy routing2 ms

Trivial routing decision.

Model inference (FIM, ~30-50 tokens output)70 ms

Small code-tuned FIM model with speculative decoding. Output is short, so total inference is dominated by prefill of a few hundred tokens.

Response framing + stream to IDE15 ms

Token-by-token to IDE. First token is what user sees.

Drill · 12 minutes

Practice this. Time yourself.

You have 12 minutes. Your coding assistant's p99 inline completion latency just regressed from 90 ms to 160 ms after a model update. The new model produces higher-quality completions (acceptance rate up 5%). The product team is asking you to ship it anyway. Walk through: (1) why the latency regression is product-killing despite quality being better, (2) the three things you'd try to recover latency without giving up quality, (3) what you'd tell product if none of them recovered the budget.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Why latency dominates	'Users like it fast.'	User-perception threshold is ~150 ms; 160 ms is past the cliff.	Same plus: at 160 ms, users race the completion by typing — the in-flight result is wrong by arrival, acceptance rate measurement is corrupted, the 5% quality gain disappears in production.	Same plus: connected to the perception-vs-measurement gap. The offline 5% acceptance gain is on a benchmark that doesn't include the race condition; in production with the slower model, the acceptance rate measurement itself is affected by the user behavior change.
Three latency interventions	Generic 'optimize.'	Quantization, speculative decoding, smaller model.	Ranked: (1) speculative decoding tuning (no quality loss), (2) anticipatory fetch tightening (pre-fetch more context so prompt assembly drops), (3) FIM-specific batching (cluster requests by language for better batch efficiency).	Same plus: noted that quantization is the wrong instinct here (Lesson 2.1 — memory fix, not latency fix), and that the model change itself may be wrong direction — should consider rolling back and shipping the previous model's architecture with the new training data instead.
Pushback to product	'We can't ship this.'	Articulated the user-experience cliff with data.	Same plus: proposed a measurement-aware experiment — A/B the new model on willing users for two weeks, measuring acceptance rate AND retention. If retention drops, latency cliff confirmed.	Same plus: named the willingness-to-trade ratio question — product needs to commit to a quality-vs-latency ratio explicitly. 'How much quality regression are we willing to accept to recover 60 ms of latency?' That ratio is what tells us whether to ship a smaller model variant.

Reveal model solution

Why latency dominates. Inline code completion has a hard perceptual threshold around 150 ms — past that, users race the completion by typing, so the in-flight result arrives after the cursor has already moved past where it would have been useful. The 160 ms p99 sits at the wrong side of the cliff. The 5% acceptance gain measured offline is misleading because the offline benchmark doesn't include the race condition; in production, users will stop accepting completions that arrive late, and the acceptance rate measurement itself will be corrupted. The right framing for product: we have a model that's better on benchmarks but worse in user perception; shipping it will regress retention even if the offline number looks good. Three latency interventions. (1) Speculative decoding tuning — the new model may have lower acceptance rate for speculative drafts, which under-uses the speedup. Tune the draft model and acceptance threshold to recover ~15-25 ms with no quality loss. Highest impact, lowest risk. (2) Anticipatory fetch tightening — extend the type-pause detection to fetch more context earlier, so prompt assembly time during the visible budget drops. ~10-20 ms recovery. Requires careful tuning to avoid over-fetching that wastes bandwidth. (3) FIM-specific batching — cluster requests by language or by similar context shape to improve batch efficiency at the model serving level. ~5-15 ms recovery. Operationally tricky but real win at production QPS. Notably absent from this list: quantization. Lesson 2.1 — quantization is a memory fix, helps decode, doesn't help TTFT meaningfully. Wrong tool for the regression we're seeing. What I'd tell product. If after the three interventions we're still at ~135 ms p99, ship it; we're under the cliff. If we're at 145+ ms, do not ship — the quality gain disappears in production due to user behavior, and the regression in retention will be invisible until the long-term holdback measures it. The conversation with product needs to be: 'Offline benchmarks are not the right measurement for inline-completion quality. We need an A/B that measures retention, not just acceptance rate, because the 160 ms regression will change user behavior in ways that corrupt the acceptance signal.' Propose a 2-week A/B on willing users measuring both metrics. If retention drops, latency cliff confirmed and we don't ship at this latency. Also propose to product: commit to a willingness-to-trade ratio between latency and quality before we run more experiments — 'how much quality regression are we willing to accept to recover 30 ms of latency?' That ratio tells us whether to ship a smaller-model variant that trades some of the 5% gain for the latency recovery. Right now product is asking the engineering team to make that trade-off implicitly, which is the wrong altitude for the decision.

Common failures

✗Treated this as a pure latency optimization problem without naming the perceptual cliff.
✗Suggested quantization. Wrong fix for TTFT-bound systems.
✗Did not propose retention as the measurement, not acceptance rate. Acceptance rate is corrupted by the latency regression itself.
✗Did not name the willingness-to-trade ratio question for product. That's the conversation product owes the team before more experiments.

Artifact · checklist

The Coding-Assistant Design Checklist

Layer 1 — Anticipatory fetch

☐Type-pause detection in IDE (~50 ms).
☐Pre-fetch ranked context (current file, recently-edited, similar files, imports).
☐Cache pre-fetched context for the imminent completion request.

Layer 2 — FIM prompting

☐Model is FIM-trained (non-FIM models lose ~20% on code completion quality).
☐Prompt structure: prefix + suffix + ranked context blocks.
☐Context budget tuned per language (longer for Python, shorter for terse languages).

Layer 3 — Speculative cancellation

☐Every completion request carries a cancellation token.
☐Serving infrastructure honors cancellations within ~20 ms.
☐Cancelled-but-paid-for GPU time is a tracked metric (waste indicator).

Layer 4 — Privacy-preserving tenancy

☐Cloud multi-tenant (Tier 1): no training on customer data, strong promises, audit logs.
☐Single-tenant cloud (Tier 2): customer instance, customer data plane.
☐Fully on-prem (Tier 3): no vendor cloud dependencies.
☐Architecture designed for most-restrictive tier from day one.

Layer 5 — Acceptance feedback loop

☐Accept / edit / reject captured per completion.
☐Prompt + completion + edit-state context captured for retraining.
☐Bias-aware sampling: don't only retrain on accepted completions (that's overfitting to current strengths).
☐Per-tenancy logging that respects the privacy posture (Tier 3 may not export labels at all).

Post-mortem · anonymized

Setup

Series B coding assistant startup, 30 engineers. Built a cloud-native architecture optimized for inference cost and developer velocity. Two years in, signed first enterprise customer requiring on-premise deployment.

What happened

The on-prem deployment took 14 months and required substantial re-architecture. The cloud-native assumptions (managed inference endpoints, vendor-specific compute, centralized telemetry pipeline, multi-tenant model routing) did not translate to on-prem. The team built a parallel deployment topology that lagged the cloud version by ~6 months on model updates. The enterprise customer was furious. A larger enterprise customer in the pipeline canceled because they wanted on-prem and the team couldn't commit to a timeline.

The moment

The retrospective conclusion was that the original architecture had been optimized for the wrong customer. Cloud multi-tenant was the cheapest deployment but represented the smallest enterprise opportunity. The team had implicitly bet that they could retrofit on-prem when needed; the bet was wrong by 12+ months. Two years of cloud-native architectural decisions were difficult to undo once they had accumulated.

What they should have said

At founding: 'The enterprise coding-assistant market segments by privacy posture. We will need to deploy on-prem to win the customers that pay for it. The architecture should be designed for on-prem from day one; the cloud version is a deployment mode of the same architecture, not a separate product. This adds ~30% engineering cost upfront and saves years of retrofitting cost later. We accept the upfront cost.' That decision at founding would have produced a product that scales to enterprise smoothly. Deferring the decision produced a product that did not.

Lesson

When the customer market segments by privacy posture, the privacy posture writes the architecture. Cloud-first then on-prem-retrofit is a strategic mistake for products targeting enterprise. Design for the most-restrictive deployment from day one; the cloud version becomes a special case of the on-prem architecture, not vice versa. The Ghost-Text Completion Stack's privacy layer is not a feature — it is the architectural commitment that determines which customers you can win.