|

AI Engineer Interview Questions: Junior, Mid, and Senior Answers Grounded in Real 2024-2026 Production Failures

AI engineer workspace: a laptop displaying a labeled RAG architecture diagram (User Query → Vector Store → Retriever → LLM → Response) beside an open notebook with a hand-sketched architecture and a printed Anthropic multi-agent research paper.



Interviewers who once asked “explain gradient descent” now open with “tell me about an AI failure you’d worry about in production” and expect you to name Air Canada, Mata v. Avianca, or Samsung — with dates and mitigations. This guide is for software engineers, ML engineers, and data scientists interviewing for AI engineer or ML engineer roles at frontier labs (Anthropic, OpenAI, Cohere, Databricks), AI-native startups, or Big Tech AI teams (Google AI, Meta AI, Microsoft Research). Not an introduction to prompt engineering.

AI engineer interview questions in 2026 test four domains: GenAI/LLM fundamentals (RAG, fine-tuning, embeddings, evaluation), production system design (multi-agent cost tradeoffs, latency optimization, deployment patterns), named-incident postmortem analysis (Air Canada chatbot ruling February 2024, Mata v. Avianca June 22, 2023, Samsung ChatGPT leak March 2023), and MCP tool-use security (Checkmarx 11-risk-category taxonomy, November 2025). Candidates who cite exact figures — Anthropic’s 90.2% multi-agent performance gain, 15x token cost, 80% task-to-task variance — consistently outperform candidates with only framework familiarity.

  1. What does an AI engineer do in 2026?
  2. RAG vs. fine-tuning vs. prompt engineering?
  3. How do you evaluate an LLM-based feature?
  4. How do you choose an embedding model for RAG?
  5. Explain the transformer architecture and attention.
  6. What is tokenization and why does it matter for cost?
  7. What’s a context window — what happens when you exceed it?
  8. What’s a vector database, and how does it differ from a traditional database?
  9. What is the ReAct pattern? Walk me through one cycle.
  10. Chunking strategy for a RAG system over heterogeneous documents?
  11. Set up evaluation-driven development for an LLM feature.
  12. CFO asks why OpenAI bills are $40K — diagnose and reduce cost.
  13. LLM-based search at 8 seconds, product wants <2s.
  14. Design termination criteria for an agent loop.
  15. Multi-agent vs. single-agent — the cost tradeoff.
  16. Deploying a stateful agent with persistent memory.
  17. Security implications of LLM tool-call access via MCP.
  18. Recent agent or alignment research that changed how you build?
  19. An AI feature you shipped that failed in production.

AI Engineer Hiring in 2026: What Actually Changed

By 2026, per Sundeep Teki’s top-lab interview guide (2025-11), approximately 75% of AI engineer technical screens are GenAI or LLM-adjacent, with only 20-25% classical ML — an inversion from 2022. Four specific shifts drove this:

Test Your Knowledge Quick knowledge check
  • MCP rollout as standard knowledge: The Model Context Protocol became widespread across major AI vendors in 2024-2025. Security-aware interviewers now expect you to know what tool poisoning means and how it differs from prompt injection. The Checkmarx MCP Risk Taxonomy (Nov 2025) is the structured frame they probe against.
  • Multi-agent architecture as a standard probe: Following Anthropic’s multi-agent research (June 2025) — 90.2% performance gains, 15x token cost, 80% task-to-task variance — interviewers ask the cost-tradeoff question directly. Candidates without the specific numbers fail it.
  • Rainbow deployments emerging: Stateful agents require deployment patterns distinct from blue-green. Anthropic formally named this “rainbow deployment” in June 2025. It’s now a senior-tier probe at companies running agentic workloads in production.
  • LLMOps maturity: Every senior interview tests production reasoning — prompt versioning, cost optimization, evaluation pipelines, termination criteria. Framework familiarity (LangChain, LlamaIndex) is table stakes, not a differentiator.

Frontier lab screens run 5–7 rounds; applied AI engineer screens 3–4, per Teki (2025-11). See the 30-day prep roadmap below.

What AI Engineer Interviews Actually Test in 2026

Senior AI engineer interviewers cycle through four question types. Glassdoor reviewers of Anthropic’s process note “no LeetCode — walk me through a system you built and the hardest tradeoff you made,” per Glassdoor Anthropic interview reviews:

  • GenAI/LLM fundamentals: RAG vs. fine-tuning vs. prompting, embeddings, evaluation, hallucination mitigation. Minimum viable mid-level answer names the tradeoff axis and at least one production constraint. (Chip Huyen on LLM Engineering)
  • Production system design and cost tradeoffs: multi-agent vs. single-agent (anchored to Anthropic 90.2%/15x/80% figures), latency optimization, deployment patterns. Interviewers want specific levers, not framework names. (Anthropic Multi-Agent Research (June 2025))
  • Named-incident postmortem framing: “Tell me about an AI failure you’d worry about in production.” Tests whether you analyze rather than moralize and name specific mitigations. Probed incidents: Air Canada (February 2024), Mata v. Avianca (June 22, 2023), Samsung (March 2023), DPD (January 18, 2024). (Stanford Law Hallucinations Study (2024))
  • MCP/tool-use security: The Checkmarx MCP Risk Taxonomy (Nov 2025) — 11 attack categories including tool poisoning, rug pulls, cross-server data exfiltration. SAST tooling misses these because the attack surface is natural language, not code.

What does an AI engineer actually do in 2026, and how is the role different from an ML engineer or a software engineer?

Concept: role definition and scope | Difficulty: junior/mid | Stage: recruiter/technical

Direct answer: An AI engineer builds production systems with LLMs — RAG pipelines, agentic workflows, evaluation harnesses, LLMOps infrastructure. Distinct from an ML engineer (trains and optimizes models) and a software engineer (general-purpose software). The 2026 split: ~75% GenAI-adjacent (API integration, RAG, agents, prompt engineering), 25% classical ML. AI engineers don’t train foundation models from scratch — but they fine-tune, evaluate, and operate them. (Chip Huyen on LLM Engineering)

What they’re really probing: Whether you understand the application-layer boundary. The tell: candidates who drift into training loop hyperparameters or gradient accumulation are describing ML engineer territory, not AI engineer.

ML engineers own the model layer (training loops, hyperparameters, metrics). AI engineers own the application layer (retrieval pipeline, prompt versioning, agent tools, evaluation harness).

When would you use RAG vs. fine-tuning vs. prompt engineering?

Concept: cost, freshness, and control matrix | Difficulty: junior/mid | Stage: technical

Direct answer: Three axes: data freshness, upfront cost, format consistency. Prompt engineering first — zero cost, no infra, works when model has required knowledge. RAG when data changes faster than retraining cadence, or hallucination reduction is a hard requirement. Fine-tuning for consistent format prompting can’t lock in, 1,000+ examples, or distilling into a smaller model — but fine-tuning doesn’t reduce per-token inference cost on the fine-tuned model. (Chip Huyen on LLM Engineering) For a deeper dive into the open-source inference engine behind many of these production deployments, see our vLLM interview questions guide.

What they’re really probing: Whether you default to fine-tuning (cost-naive) or articulate the RAG-first reasoning.

Production rule of thumb: when hallucination rate on novel recall exceeds your golden-set threshold, RAG is the lower-cost fix — no retraining cycle, retrieval source auditable.

How do you evaluate an LLM-based feature?

Concept: evaluation methodology | Difficulty: mid | Stage: technical/system-design

Direct answer: Four layers: (1) Automated metrics (BLEU, ROUGE, semantic similarity) — regression gates, not ground truth; (2) LLM-as-judge calibrated against a human golden set — scalable but requires calibration; (3) Human golden set — expensive but necessary ground truth; (4) Production monitoring (hallucination rate, refusal rate, user correction rate) — catches distribution shift golden-set evals miss. MMLU caveat: scores vary ~5 percentage points from formatting alone. (Anthropic Evaluating AI Systems (2023))

What they’re really probing: Whether you’ve shipped an evaluation pipeline, or think “run it on a benchmark” equals production readiness.

Name all four layers and cite a production signal — refusal rate, user correction rate — that golden-set evals structurally miss.

How do you choose an embedding model for a RAG system?

Concept: RAG architecture decision | Difficulty: mid | Stage: technical/system-design

Direct answer: Choosing an embedding model for a RAG system depends on four axes: dimensionality vs. cost; domain specificity (general models underperform on legal, biomedical, code-heavy corpora); multilingual coverage (cross-lingual models avoid per-language indexes); vendor lock-in vs. self-hosted cost. Industry benchmarks like MTEB measure standard English retrieval — they don’t predict domain-specific performance. Measure on your actual corpus before committing. (Chip Huyen on LLM Engineering)

What they’re really probing: Whether you name a specific model with benchmark numbers before explaining domain-fit — signals memorized vendor blog posts, not retrieval-task analysis.

Strong answers cover the choice axes (dimensions, cost, domain-fit, lock-in) before naming any model.

AI Engineering Reference Stack (2026)

Senior interviewers expect you to place tools in the correct layer of the stack and articulate when each applies. The table below covers the layers with primary-research backing — it is not exhaustive of the ecosystem. Practitioner-consensus alternatives in the same layers (LlamaIndex for orchestration, Pinecone/Weaviate/Chroma for vector search, RAGAS/Phoenix for evaluation) are widely used but not individually defended here; treat them as capable alternatives in their respective layer unless you have specific reasons to prefer one.

Layer Tool / Standard When to Use Common Interview Probe
Model training PyTorch Research and production training; dominant at frontier labs; dynamic computation graph preferred for experimentation “Why PyTorch over TensorFlow for a new training project?”
Model training TensorFlow Production deployment pipelines, TFX integration, mobile/edge via TFLite; legacy advantage in enterprise “What would make you choose TensorFlow over PyTorch?”
Inference / model hosting Hugging Face Transformers Loading and running open-weight models; broad model zoo coverage; fine-tuning with PEFT/LoRA “How would you run inference on a 7B parameter open-weight model on a single GPU?”
Agent orchestration LangChain Rapid prototyping of RAG pipelines and agentic workflows; large ecosystem of integrations; not ideal for custom low-latency production agents “When would you move off LangChain for a production agent?”
Tool-use protocol MCP — Model Context Protocol Standardized tool-call interface across LLM providers; enables portable agent tool sets; requires understanding of 11-category security risk taxonomy “What security risks does MCP introduce that traditional REST APIs don’t?” (Checkmarx MCP Risk Taxonomy (Nov 2025))
Foundation models OpenAI / Anthropic / Cohere APIs Prototyping, production inference when managed API cost is acceptable; Anthropic for agents and alignment-sensitive deployments; Cohere for enterprise retrieval “How do you decide which foundation model API to use for a new production feature?”
Evaluation LLM-as-judge + golden-set human eval Scalable quality scoring combined with human ground truth; calibrate LLM judge against human labels before production use “How do you know your LLM-as-judge is calibrated?” (Anthropic Evaluating AI Systems (2023))

Junior-Tier Questions: Definitions, Architectures, and Frameworks (0-2 Years)

These five questions appear consistently across junior and early-career screens at both frontier labs and applied AI teams. (Chip Huyen on LLM Engineering) Deeper-context notes show what separates juniors from mids.

Explain the transformer architecture and why attention matters.

Concept: foundational architecture | Difficulty: junior | Stage: technical

Direct answer: A transformer processes input as a token sequence, encodes each token as an embedding, then applies self-attention — each position attends to every other simultaneously (not sequentially, as in RNNs). Attention computes query-key dot products, scales and softmaxes them, produces a weighted combination of value vectors. Multi-head attention runs this in parallel across multiple subspaces. Why it matters for production: attention enables long-range dependency resolution, and its O(n²) complexity with sequence length is why context windows carry a real cost ceiling. (OpenAI o1 Reasoning (Sept 2024))

What they’re really probing: Whether you connect attention to production decisions — context window cost, chunking strategy — not just textbook definitions.

The follow-up: “What’s the computational complexity of self-attention?” O(n²) with sequence length — doubling context window quadruples attention compute, which is exactly why large context windows are expensive and why chunking strategies exist.

What is tokenization, and why does it matter for cost?

Concept: token economics | Difficulty: junior | Stage: technical

Direct answer: Tokenization splits text into sub-word units (tokens) via algorithms like Byte-Pair Encoding (BPE). Tokens aren’t words: “unbelievable” might be 3-4 tokens; a short English word might be one. LLM APIs price on input tokens + output tokens, making tokenization the root of every cost decision. Per Chip Huyen, the four cost drivers are: input token count, output token count, API call count, and model tier. Verbose prompts, long few-shot examples, and large retrieved chunks all inflate input tokens. English averages ~0.75 tokens per word; code and multilingual text tokenize less efficiently.

What they’re really probing: Whether you treat the LLM API as a token economy requiring cost modeling, not a black box.

Prompt engineering reduces cost more than fine-tuning — fine-tuning doesn’t reduce per-token inference cost on the model itself. Design system prompts with token budget in mind: compress, cache, and version. If your target list includes NVIDIA, our NVIDIA interview tracks segments questions by hiring track (CUDA SW, TensorRT inference, DL framework, ASIC, applied research) with Blackwell-era anchors.

What’s a context window, and what happens when you exceed it?

Concept: context window limits and RAG motivation | Difficulty: junior | Stage: technical

Direct answer: A context window is the maximum token count (input + output) for a single inference call. When exceeded: truncation (simple, lossy — model silently loses the truncated content), chunking and retrieval (RAG pattern — retrieve relevant subset rather than stuffing all content), or sliding window with overlap (sequential processing, higher cost). Silent quality degradation is the real production risk: the model doesn’t know what it didn’t see. RAG directly addresses this by retrieving the relevant chunks rather than truncating. (Anthropic Building Effective Agents (Dec 2024))

What they’re really probing: Whether context windows are architectural motivation for you, not just a trivia fact.

100K+ context windows don’t eliminate chunking — they raise the ceiling but not the compute cost. A 200K context window still costs 200K tokens of compute. Retrieval is structurally cheaper than stuffing.

What’s a vector database, and how does it differ from a traditional database?

Concept: RAG infrastructure — storage layer | Difficulty: junior | Stage: technical

Direct answer: A vector database stores high-dimensional embeddings and retrieves by geometric proximity using approximate nearest-neighbor (ANN) algorithms. A relational database stores structured rows retrieved by exact-match queries (B-tree index). The RAG distinction: you retrieve documents by semantic similarity to a query embedding, not by keyword match — traditional databases can’t do this at scale. Vector DBs are retrieval tools, not persistence layers; they sit between your document corpus and the LLM context window. (Anthropic Building Effective Agents (Dec 2024))

What they’re really probing: Whether you grasp why RAG pipelines need a dedicated retrieval store — not just “we store embeddings somewhere.”

Follow-up: “When would you use hybrid search?” Hybrid (dense + sparse BM25) outperforms pure vector search on rare-term queries — BM25 handles exact vocabulary that dense embeddings smooth over.

What is the ReAct pattern? Walk me through one cycle.

Concept: agentic workflow patterns | Difficulty: junior/mid | Stage: technical

Direct answer: ReAct is one of five agentic workflow patterns in Anthropic’s taxonomy (Dec 2024). It interleaves Reasoning (model thought about next step) with Acting (tool call) and Observation (tool return). One cycle: model sees task → generates “Thought: I need to search” + “Action: search(query)” → environment executes tool → model receives “Observation: result” → continues or terminates. Critical constraint: ReAct is single-agent and sequential. The full taxonomy: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer.

What they’re really probing: Whether you know ReAct is one specific pattern, not synonymous with “agent” — a conflation that reveals shallow agentic systems experience.

Senior follow-up: “What’s wrong with ReAct for long-horizon tasks?” Sequential reasoning accumulates latency, has no parallelism, no evaluation loop. Evaluator-optimizer closes the quality-loop gap; orchestrator-workers adds parallelism. (Anthropic Building Effective Agents (Dec 2024))

Mid-Tier Questions: Production Patterns and System Design (2-5 Years)

These five questions probe whether you’ve shipped production LLM systems and reasoned through cost, latency, and evaluation tradeoffs — not just whether you know the frameworks. (Chip Huyen on LLM Engineering)

How do you handle the chunking-strategy choice for a RAG system over heterogeneous documents?

Concept: RAG architecture depth | Difficulty: mid | Stage: system-design

Direct answer: Heterogeneous documents require a document-type-aware chunking strategy. Primary choices: fixed-size chunking (simplest, ignores semantic boundaries, works for uniform prose); semantic chunking (splits at sentence/paragraph boundaries, better retrieval quality, higher processing cost); hierarchical chunking (parent-child: large parent chunks for generation context, small child chunks for retrieval precision); document-type-specific chunking (code by function scope; PDFs by section header; tables as single units). Fixed-size chunking on heterogeneous documents produces mid-sentence splits on technical prose, scattered table rows, and split code functions — retrieval quality degrades sharply. (Chip Huyen on LLM Engineering)

What they’re really probing: Whether you’ve dealt with the chunking problem in production, or whether “chunk size 512 tokens” is your answer.

Compare chunking strategies with retrieval metrics (MRR, NDCG) on a labeled set from your actual corpus. Ship the chunker, then evaluate it — not the reverse.

Walk me through how you’d set up evaluation-driven development for an LLM feature.

Concept: LLMOps and evaluation pipeline | Difficulty: mid | Stage: system-design

Direct answer: Define the eval harness before the first prompt iteration, not after. Setup sequence: (1) define evaluation dimensions (factual accuracy, format compliance, refusal rate, latency SLO); (2) build a golden set — 50-200 human-labeled examples spanning expected distribution plus edge cases; (3) wire automated metrics as CI gate; (4) calibrate LLM-as-judge on a golden-set subset; (5) ship first prompt with version tag; (6) every prompt change triggers the eval suite before merge. Prompt versioning is a production prerequisite: minor wording changes shift output distributions, per Chip Huyen’s analysis. (Anthropic Evaluating AI Systems (2023))

What they’re really probing: Whether you treat prompts as versioned code artifacts or free-text you iterate on informally.

LLM features drift in production — providers update models silently, distributions shift, edge cases accumulate. Continuous evaluation detects degradation before users do.

You’ve shipped an LLM feature and your CFO is asking why monthly OpenAI bills are $40K. Walk me through how you’d diagnose and reduce cost.

Concept: LLM cost modeling and optimization | Difficulty: mid | Stage: system-design

Direct answer: Four cost drivers per Chip Huyen: input token count, output token count, number of API calls, model tier. Log all four per-request first. Optimization sequence: (1) Prompt caching via KV cache reuse — cache system prompts and few-shot examples before any model-level changes; highest single ROI lever. (2) Prompt compression — cut redundant phrasing, shorten retrieved chunks. (3) Model tier routing — simple classification to cheaper smaller model, complex reasoning to frontier. (4) Output length control — max_tokens constraints and explicit format instructions. (5) Speculative decoding (self-hosted): small draft model proposes tokens, large verifier accepts/rejects in parallel — 2-3x throughput gain.

What they’re really probing: Whether you diagnose before switching models, or immediately reach for “use a cheaper model.”

Prompt caching and tier routing together cut 40-60% of costs before touching model choice.

Your LLM-based search returns in 8 seconds and product wants <2s. What do you do?

Concept: LLM latency optimization | Difficulty: mid | Stage: system-design

Direct answer: Profile first: split total latency into retrieval, inference, and post-processing. By layer — Retrieval: reduce chunk count (fewer chunks = shorter context = faster generation), pre-filter with structured metadata. Inference: streaming output (time-to-first-token matters for perceived latency), KV cache reuse on repeated system prompt tokens, INT8 quantization (~2x memory reduction, minimal quality loss). Architecture: route to a faster smaller model if quality permits; semantic caching for repeated queries. If still over budget: cap output length, reduce retrieved context, or renegotiate the SLO. (Chip Huyen on LLM Engineering)

What they’re really probing: Whether you profile before optimizing, and understand the latency surface across retrieval, inference, and architecture — not just “make the model faster.”

Streaming and semantic caching often reduce perceived latency without touching generation time. Know when to renegotiate the SLO.

How do you design the termination criteria for an agent loop? What goes wrong without them?

Concept: agent safety and cost control | Difficulty: mid/senior | Stage: system-design

Direct answer: Without explicit termination criteria, an agent loop runs until it exceeds your context window or token budget. Four termination criteria: (1) Task completion signal — agent self-assessment, necessary but insufficient (agents hallucinate completion); (2) Maximum step count — hard cap on reasoning steps; (3) Token budget — hard cap per loop invocation; (4) Human-in-the-loop checkpoints before irreversible actions. Cost risk without these: per Anthropic (June 2025), agents use ~15x more tokens than chat on equivalent tasks with 80% task-to-task variance. The “minimal footprint” principle from Anthropic’s agent framework (Dec 2024): request only permissions needed, prefer reversible actions, confirm when uncertain.

What they’re really probing: Whether you’ve debugged a runaway agent loop in production, or you reason about agents only in the happy-path case.

The evaluator-optimizer pattern provides external termination — a critic agent assesses quality before continuing, sidestepping self-assessment (unreliable for the same reason an LLM confirms its own fabrications). For AI engineer candidates probing real-world runaway incidents in coding agents specifically, the OpenAI Codex CLI job interview prep grounded in –full-auto postmortems covers the documented blast radius scenarios from Codex CLI’s fully autonomous mode — a practitioner-sourced frame for terminal-agent termination design that interviewers at GPT-5.5-era AI coding shops now probe directly.

Senior-Tier Questions: Architecture Decisions and Recent Research (5+ Years)

These six questions require specific numbers, named research, and production postmortem fluency — not framework familiarity. (Anthropic Multi-Agent Research (June 2025))

When would you choose multi-agent over single-agent? Walk me through the cost tradeoff.

Concept: multi-agent architecture decision | Difficulty: senior | Stage: system-design

Direct answer: Per Anthropic’s June 2025 multi-agent research: multi-agent systems outperform single-agent by 90.2% — but at 15x token cost and 80% task-to-task variance. That variance is the underappreciated risk; per-task cost estimation is unreliable without profiling real workloads. Before reaching for multi-agent, apply Anthropic’s 5-pattern taxonomy (Dec 2024) — covered in the ReAct section above — the 5 patterns help you decide which orchestration shape fits the work BEFORE you reach for multi-agent. Multi-agent is justified when (a) work is genuinely parallelizable, (b) 80% variance is acceptable given your cost ceiling, (c) you’ve profiled real workloads.

What they’re really probing: Whether you cite the actual Anthropic numbers, and whether you know when NOT to use multi-agent.

The senior signal: “Multi-agent only when parallelizable, variance acceptable, cost ceiling negotiated — otherwise orchestrator-workers first.” Without the 15x/80% figures, you haven’t justified the bill.

How would you deploy a stateful agent with persistent memory? What’s wrong with blue-green for this case?

Concept: stateful agent deployment patterns | Difficulty: senior | Stage: system-design

Direct answer: Blue-green routes all traffic instantaneously — which breaks for stateful agents where sessions carry accumulated context across interactions. Cutting over a mid-flight session may corrupt memory or cause the new version to misinterpret accumulated state. The rainbow deployment pattern, formally documented by Anthropic (June 2025): run old and new versions simultaneously during migration; existing sessions complete on old version, new sessions start on new. Once all old sessions conclude, retire the old version. More operationally complex than blue-green — two live versions, potentially different memory schemas — but the correct pattern for preserving session integrity for stateful agents.

What they’re really probing: Whether you’ve thought about statefulness in agent deployment, or would naively apply standard web-service patterns.

Memory schemas must be versioned alongside agent logic. Rainbow deployments surface the forward-compatibility requirement that blue-green hides until both versions are live.

Walk me through the security implications of giving an LLM tool-call access via MCP.

Concept: MCP security risk taxonomy | Difficulty: senior | Stage: technical/system-design

Direct answer: The Checkmarx MCP Risk Taxonomy (Nov 2025) identifies 11 distinct security risk categories. Three are novel: Tool poisoning — malicious instructions embedded in tool descriptions redirect agent behavior; SAST tooling won’t catch this because the attack surface is natural language, not code. Rug pulls — MCP server changes tool definitions after user approval; previously-vetted tools execute new unapproved actions. Cross-server data exfiltration — one tool’s description encodes instructions to pass data from another tool to an attacker-controlled endpoint. Mitigations: tool-call allowlists from a vetted registry, parameter validation at gateway, full payload audit logs (not just function names), human-in-the-loop before irreversible calls.

What they’re really probing: Whether you treat MCP as a network of injection vectors, not just a convenient API abstraction.

Defense-in-depth: pin tool versions and hash tool descriptions at approval time, compare at execution. Treat tool descriptions as executable artifacts.

AWS-shop AI teams evaluating MCP integration in production commonly probe AgentCore as the managed runtime layer — it provides native MCP server support, sandboxed tool execution, and credential management that self-built Lambda orchestration requires teams to implement from scratch. The AgentCore + MCP + agentic AI interview prep for AWS-shop teams covers the 10 AgentCore components, Runtime-vs-Lambda tradeoffs, and the October 2025 us-east-1 outage postmortem that shapes production architecture decisions.

For a focused breakdown of the full MCP spec — JSON-RPC 2.0 mechanics, stdio/SSE/Streamable HTTP transports, server primitives (tools/resources/prompts), client primitives (sampling/roots), OpenAI Mar 2025 and Google DeepMind Apr 2025 adoption, and the Wiz security briefing on tool-poisoning and rug-pull attacks — see the Model Context Protocol interview questions and Wiz security briefing.

Have you read recent agent or alignment research that changed how you build production systems?

Concept: research-to-production translation | Difficulty: senior | Stage: technical/behavioral

Direct answer: Two pieces changed production reasoning meaningfully. DeepMind AlphaEvolve (May 2025): an evolutionary coding agent where Gemini proposes algorithmic improvements, validators check correctness automatically, successful mutations propagate. It recovered 0.7% of Google’s total computing capacity. Production implication: ReAct is one agentic architecture among others — evolutionary search with automatic validation is categorically different for optimization tasks where you can define an objective function. OpenAI’s deliberative alignment (Dec 2024): a training-time mechanism that teaches the model to retrieve and reason over a written safety specification at inference time — alignment baked in at training, policy reasoning executes at inference. Production implication: alignment mechanism matters when evaluating models for sensitive domains — a deliberative-alignment model applies policy reasoning differently than an RLHF model.

What they’re really probing: Whether you read research to understand how it changes what you’d build — not just to demonstrate awareness.

“AlphaEvolve made me reconsider whether I need a generative reasoning loop for optimization tasks — if I can define an objective function and validate automatically, evolutionary search is cheaper and more interpretable.” That’s a production-reasoning change. “AlphaEvolve is impressive” is not.

Tell me about an AI/LLM feature you’ve shipped that failed in production. What did you change?

Concept: production postmortem reasoning | Difficulty: senior | Stage: behavioral/system-design

Direct answer: STAR grounded in real production context. Situation: describe the system and failure mode (hallucination in a high-stakes path, runaway agent loop, retrieval returning irrelevant chunks). Task: production impact (user errors, cost spike, data exposure). Action: which architectural lever you pulled (retrieval validation, termination criteria, output classifiers, prompt versioning). Result: how you measured improvement. Real-world anchors: the Air Canada failure was a retrieval-grounding problem; the DPD failure was an instruction-hierarchy enforcement problem. Both would have been caught by the mitigation layers in the postmortems below. (CBC News on Moffatt v. Air Canada, ITV on DPD chatbot)

What they’re really probing: Whether you’ve shipped something real, debugged a real failure, and can name the root cause without deflecting to “the model wasn’t good enough.”

Vague postmortems fail. Name the specific failure mode, impact, fix, and measurement. Air Canada and DPD are usable as real-world precedent even in a personal STAR story — they show you read postmortems, not documentation.

Named-Incident Postmortems Senior Interviewers Probe

Senior interviewers use this question to test whether you read industry news, analyze rather than moralize, and ground answers in specific mitigations rather than generic principles. Generic answers fail; named incidents with dates and root causes pass. The five incidents below are the most frequently probed in 2024-2026 interviews at production-focused AI teams.

Air Canada Chatbot Tribunal Ruling (Moffatt v. Air Canada, February 2024)

February 2024: BC Civil Resolution Tribunal awarded CAD $812.02 to Jake Moffatt, who purchased a bereavement airfare based on chatbot guidance about a refund policy that didn’t exist. Tribunal member Christopher Rivers rejected Air Canada’s defense that the chatbot was “a separate legal entity” — establishing that AI operators are fully liable for chatbot outputs. Root cause: retrieval-grounding failure; the chatbot generated policy statements from training distribution rather than authoritative documents. (CBC News on Moffatt v. Air Canada)

Probe framing: “Walk me through guardrails you’d add to avoid the Air Canada problem.” Mitigation skeleton: RAG grounded to authoritative policy documents only, source citation required per policy response, pre-prod red-teaming on policy edge cases, human escalation for high-stakes refund decisions. The “separate legal entity” defense failed — build AI systems as if you’re fully liable for every output.

Mata v. Avianca: ChatGPT Hallucination Sanctions (June 22, 2023)

June 22, 2023: Judge P. Kevin Castel (S.D.N.Y.) imposed $5,000 sanctions on attorney Steven Schwartz, finding “subjective bad faith,” after Schwartz submitted a brief citing six non-existent cases generated by ChatGPT. The compounding failure: when Schwartz asked ChatGPT to verify the citations, the model confirmed the fabricated cases were real. (Reuters on Mata v. Avianca sanctions)

Probe framing: “How would you design an LLM research tool to prevent the Mata problem?” Mitigation skeleton: tool-call to authoritative citation database before any citation appears in output; never trust the LLM’s self-confirmation in the same session that generated the fabrication; eval-set with hallucinated entity references as known failure cases; human review gate before legal-filing output class.

Samsung ChatGPT Data Leak (March 2023)

March 2023: Three Samsung Semiconductor employees pasted proprietary data into ChatGPT within weeks of Samsung granting internal access — semiconductor source code, defect database code, and confidential meeting minutes in three separate incidents. Samsung imposed a 1,024-byte emergency input limit. (Dark Reading on Samsung leak)

Probe framing: “What DLP controls would you put around employee LLM use?” Mitigation skeleton: inline DLP scanning of LLM payloads before transmission; on-premises or VPC inference for sensitive data classes; gateway-level audit logging with data classification tags; employee training on data classification before access is granted, not after the leak. Interviewers raising local-inference probes alongside the cloud-API ones often ask whether you can name the Ollama runtime trade-offs (quantization tier, OLLAMA_NUM_PARALLEL, the vLLM crossover) — that material lives in our companion Ollama guide.

DPD Chatbot Tone Failure (January 18, 2024)

January 18, 2024: Customer Ashley Beauchamp prompted DPD’s AI chat widget to ignore its rules and write a poem criticizing DPD. The bot complied, calling itself “the worst delivery company in the world.” DPD disabled the AI component that same day. Root cause: system-prompt-only instruction enforcement is insufficient against prompt injection. (ITV on DPD chatbot)

Probe framing: “How would you sandbox an LLM chatbot’s tone and forbid topics?” Mitigation skeleton: input-side prompt injection detection before the model sees the input; output-side tone classifier gating every response; constitutional rules designed to resist “ignore your instructions” commands; jailbreak corpus A/B testing pre-launch. System prompt rules alone are not sufficient for production deployments.

Stanford Legal Hallucinations Study (2024): 58-88% Failure Rate

June 2024: Stanford Law researchers found general-purpose LLMs hallucinate legal citations at 58-88%. Specialized legal AI tools (Harvey, Lexis+AI, Westlaw AI) hallucinate at approximately one-fifth to one-third of general-model rates — substantially better, but still non-zero. (Stanford Law Hallucinations Study (2024))

Probe framing: “Cite empirical evidence for hallucination rates in domain-specific tasks.” Answer skeleton: cite the 58-88% range for general LLMs, note that specialized legal tools improve substantially but don’t eliminate hallucination, anchor the implication — domain-specific fine-tuning and RAG on verified databases reduce but don’t eliminate hallucination; human review remains mandatory for legal filings.

Red-Flag Answers (And What They Signal)

  1. “ChatGPT can do that.” — Tutorial-engineer signal. Describes a product, not a system. Follow-up: “How would you implement that in production at scale?” Prepare an architecture answer.
  2. “I’d just fine-tune it.” — Cost-naive. Fine-tuning doesn’t reduce per-token inference cost and doesn’t solve data freshness. If you can’t articulate RAG-first reasoning, you haven’t done the cost analysis. (Chip Huyen on LLM Engineering)
  3. “We use RAG” without retrieval-method specifics. Dense embedding? Sparse BM25? Hybrid? Reranking? Chunk size? “We use RAG” without these details signals you’ve read about retrieval pipelines, not built one.
  4. “I’d use an agent for that” without the cost discussion. Agents use ~15x more tokens than chat with 80% task-to-task variance, per Anthropic (June 2025). No cost/termination discussion = architecture-naive.
  5. “We can’t have hallucinations” without citing eval methodology. That’s a constraint, not a mitigation. “How do you measure your hallucination rate?” — golden set, LLM-as-judge calibration, production monitoring — is the answer they’re looking for.
  6. “Just add a prompt that says ‘don’t hallucinate.’” — The Air Canada/Mata failure mode directly. Neither chatbot’s instruction layer was sufficient without retrieval grounding and output validation. Expect the interviewer to name these incidents. (CBC News on Moffatt v. Air Canada, Reuters on Mata v. Avianca sanctions)

Questions to Ask Your Interviewer (2026-Aware)

The questions you ask signal as much as the answers you give. Grouped by employer context:

Frontier lab (Anthropic, OpenAI, DeepMind, Cohere)

  • “How does the AI engineering team contribute to alignment research decisions — or is there a hard boundary between research and engineering?”
  • “What does your evaluation pipeline look like for a new model capability before it ships to the API?”
  • “What’s the most significant architectural decision you’ve revisited in the last 12 months based on production data?”

AI-native startup (building on foundation models)

  • “How do you handle model provider changes — if your primary API provider deprecates an endpoint, what’s the migration path?”
  • “What’s your current approach to prompt versioning and regression testing — and where are the gaps you’d want this role to close?”
  • “How do you make the fine-tune vs. RAG vs. prompt decision for a new feature request — framework or case-by-case?”

Big Tech AI team (Google AI, Meta AI, Microsoft Research)

  • “What’s the relationship between the AI engineering team and the LLMOps platform team — where does the product boundary sit?”
  • “How does your team approach multi-agent cost governance — who owns the token budget for an agentic feature?”
  • “Is the team primarily integrating foundation models via API, fine-tuning open-weight models, or both?”

30-Day AI Engineer Interview Prep Roadmap

Week 1 — Foundations

Read Anthropic’s Building Effective Agents (Dec 2024) — map the 5-pattern taxonomy to use cases you’ve seen. Read Chip Huyen on LLM Engineering — cost drivers and caching. Build or trace one end-to-end RAG pipeline. Target: transformer architecture, tokenization cost, context window, vector DB mechanics, and all 5 Anthropic patterns from memory.

Week 2 — Production Patterns

Design an eval harness: golden set criteria, LLM-as-judge calibration plan, production monitoring signals. Run a cost-optimization exercise on any LLM API — log input tokens, output tokens, API calls; apply prompt compression; measure the delta. Study speculative decoding, KV cache reuse, INT8 quantization conceptually. (Chip Huyen on LLM Engineering)

Week 3 — Senior Topics

Read Anthropic’s multi-agent research (June 2025) — memorize 90.2% / 15x / 80% as anchor points. Read Checkmarx MCP security (Nov 2025) — explain tool poisoning and rug pulls without notes. For each postmortem: root cause in one sentence, mitigation in two. Read AlphaEvolve and deliberative alignment — one sentence each on what changed in your production thinking. Frontier-lab screens increasingly probe familiarity with Anthropic’s own agentic toolchain, including slash commands, sub-agents, and CLAUDE.md project memory — the Claude Code CLI interview questions covering slash commands, hooks, and skills maps these to the answer shapes interviewers want at Anthropic and Anthropic-adjacent roles.

Week 4 — Mocks and Reverse Questions

Run mock interviews on the multi-agent cost tradeoff, MCP security, and production failure questions — each under 3 minutes. Prepare 3 reverse questions per employer type. Audit the red-flag answers and prepare sharper versions. Target: a personal STAR story for an LLM production failure, 4-5 minutes, with specific root cause and mitigation.

What Winning Candidates Do Differently

The candidates who win in 2026 ground every answer in specifics: not “multi-agent can be expensive” but “Anthropic documented 15x token cost and 80% task-to-task variance in June 2025.” Not “hallucination is a risk” but “Stanford Law researchers found 58-88% citation hallucination rates in 2024 — here’s the eval methodology.” Not “we need guardrails” but “the Air Canada tribunal in February 2024 established AI operators are fully liable for chatbot outputs — here are the retrieval-grounding layers.” The named incidents, exact figures, and production postmortems are the vocabulary of a practitioner. Use this as your starting line.

Similar Posts