AI Coding Assistant (Copilot-scale) — Simulated Interview
Copilot-class coding assistants serve completions inline in IDE keystrokes with sub-100 ms perceived latency. The Ghost-Text Completion Stack names the five layers — anticipatory fetch, FIM prompting, speculative cancellation, privacy-preserving tenancy, acceptance feedback — and shows where the architectural decisions live.
Coding assistants are the most demanding latency product in the AI space. The user perceives completions slower than ~150 ms as 'broken' — the cursor sits there and the user keeps typing, which races the completion to wrongness. Beat the perceived latency budget and the product feels magical; miss it and the feature is uninstallable. The 100 ms p99 SLA is not a stretch goal; it is the difference between a usable product and a demo.
The Ghost-Text Completion Stack is the architecture that makes the 100 ms budget achievable. Five layers, each addressing a constraint that the obvious approach (LLM + IDE plugin + prompt-with-context) fails to satisfy. The interview prompt 'design Copilot' usually surfaces all five constraints during follow-up questioning; candidates who name the layers up front are doing in the first three minutes what the interviewer would have spent twenty minutes drawing out. The framework's job is to make the constraints structural, not surprises.
The Ghost-Text Completion Stack
Copilot-class coding assistants serve completions in <100 ms p99 because users perceive any slower latency as 'broken.' The Ghost-Text Completion Stack is the four-layer architecture that makes that latency budget achievable: aggressive context prediction, fill-in-the-middle (FIM) prompt construction, speculative cancellation of in-flight requests when the user keeps typing, and privacy-preserving repo access that doesn't ship the user's source code to a vendor by default. Each layer has its own architectural commitment; missing any one produces a system that's either slow, wrong, or a compliance incident.
- 1Layer 1 — Context prediction (anticipatory fetch)When the user pauses for ~50 ms after typing, the assistant fetches relevant context — the current file's relevant other functions, recently-edited files, declared imports, similar files in the repo. The fetch happens before the user explicitly requests a completion. This anticipatory fetch is what makes the inline latency budget achievable; reactive fetch on completion request costs the budget.
- 2Layer 2 — Fill-in-the-middle (FIM) prompt constructionCode completion is not text completion. The model needs prefix (what comes before the cursor), suffix (what comes after), and relevant context (other functions, imports). FIM-trained models accept this structure natively; non-FIM models can be tricked into it with prompt engineering but produce worse completions. Architectural commitment: the model choice is constrained to FIM-capable models.
- 3Layer 3 — Speculative cancellationUsers keep typing. When the user types one more character, the in-flight completion is now wrong. Cancel it cheaply and start a new one. The cancellation has to be cheap because users type at ~200 ms intervals — every wasted in-flight inference is wasted GPU. Architecturally: requests carry a cancellation token; serving infrastructure honors cancellations within ~20 ms.
- 4Layer 4 — Privacy-preserving repo accessEnterprise customers will not ship source code to a vendor. The architecture must support a tenancy model — single-tenant deployment on customer infrastructure, federated retrieval where embeddings leave but raw code does not, or on-device inference for the most sensitive customers. The privacy posture is not a feature; it is the precondition for selling the product to any company larger than ~50 engineers.
- 5Layer 5 — Acceptance feedback as training signalUsers implicitly label every completion: accepted (typed past it), partially accepted (edited it), rejected (deleted it, kept typing). This signal is the highest-quality training data the coding assistant team has. The architectural commitment: capture acceptance signals with enough context to retrain — the prompt, the completion, the surrounding edit state, the time-to-accept-or-reject. Most coding assistant teams underuse this signal because the logging path was bolted on instead of designed in.
Apply the framework to any coding-assistant, IDE-integration, or low-latency-completion product interview. The framework also applies to in-app suggestions (Gmail Smart Compose, Notion AI inline) where the latency budget is similarly tight and the context prediction lever is similar.
Senior answer to 'design Copilot-class assistant': 'LLM with code training, integrate with IDE, prompt with surrounding context.' Staff answer: 'Five layers. Anticipatory fetch on type-pause — without it, reactive context lookup costs the latency budget. FIM-capable model — non-FIM models produce worse completions, so the model choice is constrained. Speculative cancellation — users keep typing, so in-flight inferences must be cheap to cancel. Privacy-preserving repo access — enterprise won't ship code to a vendor, so the tenancy model is mandatory not optional. Acceptance feedback loop — implicit user labels are the training data, captured by design not bolted on. The 100 ms budget forces anticipatory + speculative; the enterprise market forces privacy; the FIM and feedback layers determine completion quality. Each layer is a different design constraint.'
How do you handle the privacy concerns of enterprise customers who don't want their source code shipped to your servers?
This question rules out 60% of theoretically-correct candidate answers because most candidates have not thought about the enterprise tenancy model.
Add encryption in transit and at rest. Promise we don't train on customer code.
Offer a no-training-on-data tier. SOC 2 and similar compliance. Possibly air-gapped on-premise deployment for the most sensitive customers.
Three-tier tenancy model. Tier 1: cloud multi-tenant with strong promises (no training, encryption, audit logs); fine for small teams and individual developers. Tier 2: single-tenant cloud — customer's instance, customer's data plane, vendor's control plane; fits most enterprise. Tier 3: fully on-premise or VPC-deployed; fits regulated industries. The architecture has to support all three with the same product surface; that constrains the model choice (must be deployable in customer infra) and the inference stack (must run without vendor cloud dependencies for Tier 3).
Same three-tier model with the meta-acknowledgment that the privacy posture is the product strategy, not a feature checkbox. Enterprise customers will pay 10× for Tier 2/3 deployment, and the vendors that ship Tier 2/3 will dominate the enterprise market regardless of who has the best Tier 1 product. The architectural decision is therefore: do we make the model and inference stack deployable in customer infrastructure from day one, or do we build for cloud-only first and retrofit later? The latter is the canonical mistake — retrofitting on-prem deployment onto a cloud-native architecture costs years and frequently produces a degraded product. The right move is to design for the most-restrictive tenancy from the start, and let the cloud version be a deployment mode of the same architecture, not a separate product. The pattern: when the customer market segments by privacy posture, the privacy posture writes the architecture. Same shape as 'latency budget writes the architecture' from earlier lessons.
Named that privacy posture writes the architecture and that retrofitting on-prem is the canonical mistake. Connected back to the 'constraints write the architecture' pattern from earlier lessons. The L7 move is recognizing that enterprise market segmentation is itself a design constraint.
Copilot-class coding assistant. The IDE-side components handle anticipatory fetch and speculative cancellation; the model service handles FIM-aware completion; the tenancy layer determines deployment topology; the feedback loop captures acceptance signals.
Inline completion latency budget for a 100 ms p99. The anticipatory fetch shifts context lookup outside the visible budget; FIM prompting keeps the model small enough to fit; speculative decoding accelerates decode.
Practice this. Time yourself.
You have 12 minutes. Your coding assistant's p99 inline completion latency just regressed from 90 ms to 160 ms after a model update. The new model produces higher-quality completions (acceptance rate up 5%). The product team is asking you to ship it anyway. Walk through: (1) why the latency regression is product-killing despite quality being better, (2) the three things you'd try to recover latency without giving up quality, (3) what you'd tell product if none of them recovered the budget.
Self-assessment rubric
| Dimension | Weak | Passing | Strong | Staff bar |
|---|---|---|---|---|
| Why latency dominates | 'Users like it fast.' | User-perception threshold is ~150 ms; 160 ms is past the cliff. | Same plus: at 160 ms, users race the completion by typing — the in-flight result is wrong by arrival, acceptance rate measurement is corrupted, the 5% quality gain disappears in production. | Same plus: connected to the perception-vs-measurement gap. The offline 5% acceptance gain is on a benchmark that doesn't include the race condition; in production with the slower model, the acceptance rate measurement itself is affected by the user behavior change. |
| Three latency interventions | Generic 'optimize.' | Quantization, speculative decoding, smaller model. | Ranked: (1) speculative decoding tuning (no quality loss), (2) anticipatory fetch tightening (pre-fetch more context so prompt assembly drops), (3) FIM-specific batching (cluster requests by language for better batch efficiency). | Same plus: noted that quantization is the wrong instinct here (Lesson 2.1 — memory fix, not latency fix), and that the model change itself may be wrong direction — should consider rolling back and shipping the previous model's architecture with the new training data instead. |
| Pushback to product | 'We can't ship this.' | Articulated the user-experience cliff with data. | Same plus: proposed a measurement-aware experiment — A/B the new model on willing users for two weeks, measuring acceptance rate AND retention. If retention drops, latency cliff confirmed. | Same plus: named the willingness-to-trade ratio question — product needs to commit to a quality-vs-latency ratio explicitly. 'How much quality regression are we willing to accept to recover 60 ms of latency?' That ratio is what tells us whether to ship a smaller model variant. |
Reveal model solution
Common failures
- ✗Treated this as a pure latency optimization problem without naming the perceptual cliff.
- ✗Suggested quantization. Wrong fix for TTFT-bound systems.
- ✗Did not propose retention as the measurement, not acceptance rate. Acceptance rate is corrupted by the latency regression itself.
- ✗Did not name the willingness-to-trade ratio question for product. That's the conversation product owes the team before more experiments.
The Coding-Assistant Design Checklist
Layer 1 — Anticipatory fetch
- ☐Type-pause detection in IDE (~50 ms).
- ☐Pre-fetch ranked context (current file, recently-edited, similar files, imports).
- ☐Cache pre-fetched context for the imminent completion request.
Layer 2 — FIM prompting
- ☐Model is FIM-trained (non-FIM models lose ~20% on code completion quality).
- ☐Prompt structure: prefix + suffix + ranked context blocks.
- ☐Context budget tuned per language (longer for Python, shorter for terse languages).
Layer 3 — Speculative cancellation
- ☐Every completion request carries a cancellation token.
- ☐Serving infrastructure honors cancellations within ~20 ms.
- ☐Cancelled-but-paid-for GPU time is a tracked metric (waste indicator).
Layer 4 — Privacy-preserving tenancy
- ☐Cloud multi-tenant (Tier 1): no training on customer data, strong promises, audit logs.
- ☐Single-tenant cloud (Tier 2): customer instance, customer data plane.
- ☐Fully on-prem (Tier 3): no vendor cloud dependencies.
- ☐Architecture designed for most-restrictive tier from day one.
Layer 5 — Acceptance feedback loop
- ☐Accept / edit / reject captured per completion.
- ☐Prompt + completion + edit-state context captured for retraining.
- ☐Bias-aware sampling: don't only retrain on accepted completions (that's overfitting to current strengths).
- ☐Per-tenancy logging that respects the privacy posture (Tier 3 may not export labels at all).
Series B coding assistant startup, 30 engineers. Built a cloud-native architecture optimized for inference cost and developer velocity. Two years in, signed first enterprise customer requiring on-premise deployment.
The on-prem deployment took 14 months and required substantial re-architecture. The cloud-native assumptions (managed inference endpoints, vendor-specific compute, centralized telemetry pipeline, multi-tenant model routing) did not translate to on-prem. The team built a parallel deployment topology that lagged the cloud version by ~6 months on model updates. The enterprise customer was furious. A larger enterprise customer in the pipeline canceled because they wanted on-prem and the team couldn't commit to a timeline.
The retrospective conclusion was that the original architecture had been optimized for the wrong customer. Cloud multi-tenant was the cheapest deployment but represented the smallest enterprise opportunity. The team had implicitly bet that they could retrofit on-prem when needed; the bet was wrong by 12+ months. Two years of cloud-native architectural decisions were difficult to undo once they had accumulated.
At founding: 'The enterprise coding-assistant market segments by privacy posture. We will need to deploy on-prem to win the customers that pay for it. The architecture should be designed for on-prem from day one; the cloud version is a deployment mode of the same architecture, not a separate product. This adds ~30% engineering cost upfront and saves years of retrofitting cost later. We accept the upfront cost.' That decision at founding would have produced a product that scales to enterprise smoothly. Deferring the decision produced a product that did not.
When the customer market segments by privacy posture, the privacy posture writes the architecture. Cloud-first then on-prem-retrofit is a strategic mistake for products targeting enterprise. Design for the most-restrictive deployment from day one; the cloud version becomes a special case of the on-prem architecture, not vice versa. The Ghost-Text Completion Stack's privacy layer is not a feature — it is the architectural commitment that determines which customers you can win.