vLLM Interview Questions: 2026 PagedAttention + Production Guide

Abstract visualization of GPU memory blocks arranged in a paged grid representing vLLM's PagedAttention architecture and continuous batching

By mid-2025, vLLM had become the open-source LLM serving engine that most production teams reach for first. The 2024 release cadence settled it: v0.6 shipped chunked prefill in September 2024, eliminating the head-of-line blocking that had been the biggest throughput tax for long-context serving (Source: vLLM blog).

v0.7 and v0.8 in early 2025 added first-class structured output and tool calling, closing the gap with proprietary inference APIs. By the time enterprise teams hired against “vLLM experience” as a real line item, interview questions had drifted toward production-deployment judgment, multi-LoRA fleet economics, and FP8/NVFP4 quantization on Hopper and Blackwell.

This guide maps the questions vLLM-using teams actually ask in 2025-2026 — across AI engineer interview questions loops, ML platform engineer screens, and senior research-engineer rounds at Anthropic, Mistral, Together AI, and Cloudflare Workers AI. Each question carries the evaluation rubric, the named release or paper that anchors a strong answer, and a citation.

In this article, we’ll cover the following 19 questions:

  1. Walk me through PagedAttention. Why does vLLM use it instead of contiguous KV cache?
  2. What’s the difference between PagedAttention and FlashAttention?
  3. How do you size the KV cache for a 70B model on an 8xH100 node?
  4. How does prefix caching work in vLLM and when does it actually help?
  5. When does vLLM swap to CPU vs recompute on preemption?
  6. Explain continuous batching. How is it different from static batching?
  7. What is chunked prefill and why did vLLM add it in v0.6?
  8. When would you turn on speculative decoding in vLLM, and what’s the failure mode?
  9. Walk me through disaggregated prefill and decode. Why do papers like DistServe and Mooncake argue for it?
  10. If you had to choose between higher throughput and lower TTFT, what scheduler settings would you change?
  11. How does multi-LoRA serving work in vLLM and what’s the cost vs spinning up separate replicas?
  12. When would you choose FP8 over INT4 quantization for inference?
  13. GPTQ vs AWQ vs FP8 — how do you pick?
  14. Tensor parallelism vs pipeline parallelism in vLLM — when would you reach for each?
  15. A vLLM deployment is hitting 40% of expected throughput. How do you debug?
  16. How does vLLM handle tool calling and structured output as of v0.8?
  17. How would you build multi-tenant fairness on top of vLLM?
  18. What metrics would you put on a vLLM SLO dashboard?

Why vLLM interviews shifted in 2024-2026

vLLM in 2026 is not the vLLM of its original research paper. The interview rubric has moved with it. Three shifts matter most.

Test Your Knowledge Quick knowledge check
  • Chunked prefill (vLLM v0.6, September 2024) changed how interviewers reason about throughput vs time-to-first-token — the “which knob do you turn” question now has a deterministic answer that didn’t exist 18 months ago.
  • Multi-LoRA hot-swap serving went from research demo to production feature, making fleet economics for fine-tuned model families an interview topic in its own right.
  • The FP8 and NVFP4 quantization wave on Hopper and Blackwell GPUs (Source: NVIDIA developer blog) made cost-down questions a default screen for ML platform roles.

The role spectrum spans four bands. ML inference engineers own the serving stack and get the deepest technical probes. ML platform engineers focus on multi-tenant scheduling across many fine-tuned variants. AI engineers consume vLLM as a black box and get more workload-shaping questions.

Senior research engineers at frontier labs — including hires into Anthropic, Mistral, and the Together AI / Anyscale serving teams — get cross-paper questions connecting vLLM internals to academic papers like DistServe and Mooncake.

What does NOT belong in your prep: the original PagedAttention paper abstract from a few years back. It is still the foundational reference, but the 2024-2025 release notes carry the questions actually being asked in 2026.

PagedAttention and KV cache memory questions

PagedAttention is the central abstraction that lets vLLM pack more concurrent requests onto the same GPU than a naive contiguous-cache server. Every vLLM technical screen opens here because the depth of the candidate’s answer correlates with how much they have actually deployed vLLM versus read about it. The five questions in this category form the foundation interviewers test before moving on to scheduling and quantization. (Source: vLLM 2023 paper at SOSP, vLLM project blog.)

Side-by-side diagram comparing contiguous KV cache layout with internal fragmentation versus vLLM PagedAttention paged layout
PagedAttention vs contiguous KV cache: the central memory abstraction that lets vLLM pack more concurrent requests onto the same GPU.

Walk me through PagedAttention. Why does vLLM use it instead of contiguous KV cache?

Concept: KV cache memory management | Difficulty: junior to mid | Stage: technical screen

Direct answer: PagedAttention treats the KV cache like virtual memory in an operating system — instead of pre-allocating a contiguous chunk per request sized to the maximum sequence length, vLLM allocates the cache in fixed-size blocks (commonly 16 tokens per block) that can be scattered across non-contiguous GPU memory. The block table per request maps logical positions to physical blocks. This eliminates the internal fragmentation a contiguous scheme suffers when requests finish at different lengths, and it enables copy-on-write block sharing for parallel sampling and beam search. The practical effect is 2-4x more concurrent requests on the same GPU memory budget compared to the contiguous-cache baseline reported in the original SOSP paper.

What they’re really probing: Whether the candidate understands that PagedAttention is a memory-layout idea, not an attention-compute idea — and whether they can name the OS analogy without prompting.

The strongest answers also flag the cost: block-table lookups add a small per-token overhead, and the attention kernel needs to be rewritten to follow indirect block pointers. vLLM’s kernel handles this; a naive PyTorch attention does not. Candidates who can describe the kernel-level work — coalesced memory access patterns across paged blocks, the FlashInfer integration in newer releases — separate themselves from candidates who only know the high-level analogy. (Source: vLLM docs.)

What’s the difference between PagedAttention and FlashAttention?

Concept: KV cache vs attention kernel | Difficulty: mid | Stage: technical screen

Direct answer: PagedAttention is a KV cache memory-management scheme; FlashAttention is an attention compute kernel. They operate at different layers of the stack and modern vLLM uses both. FlashAttention (Tri Dao, 2022; FlashAttention-2 in 2023; FlashAttention-3 in 2024) restructures the attention computation to keep intermediate tensors in fast on-chip SRAM rather than spilling to HBM — it cuts memory traffic for the attention math itself. PagedAttention restructures how the KV cache is laid out in memory so the server can pack more requests. A candidate who calls them alternatives is treating an “OR” as an “AND” — vLLM’s attention kernel is a paged variant that reads from FlashAttention-style tiled computation on top of paged block pointers.

What they’re really probing: Whether the candidate has actually read kernel-level code or only marketing summaries.

The follow-up usually digs into when each matters more. For short-context, high-concurrency workloads, PagedAttention’s packing wins are larger than FlashAttention’s kernel wins — you’re memory-bandwidth-bound on KV reads, not compute-bound on attention math. For long-context, lower-concurrency workloads (think 100K-token analysis queries), FlashAttention’s compute win matters more because the attention math grows quadratically. Strong candidates will mention FlashInfer as the next step in this lineage — a 2024 attention serving library that vLLM integrates for some configurations. (Source: FlashAttention GitHub.)

How do you size the KV cache for a 70B model on an 8xH100 node?

Concept: capacity planning | Difficulty: mid to senior | Stage: technical deep-dive

Direct answer: Start from the math: a 70B model in FP16 weights uses about 140 GB of GPU memory, leaving roughly 500 GB across 8 H100s (640 GB total minus weights and framework overhead). The KV cache per token, per layer, is 2 × num_heads × head_dim × dtype_bytes — for Llama 3 70B with 80 layers, 8 KV heads (GQA), 128 head_dim, that’s about 2.5 KB per token in FP16, or 1.25 KB in FP8. A naive 4K-context, 100-concurrent-request workload uses about 1 GB of KV per request, fitting comfortably. The interesting case is 32K context and 64 concurrent requests — that’s about 5 GB per request, 320 GB total, leaving little headroom and forcing the scheduler to preempt mid-request. Use vLLM’s –gpu-memory-utilization flag (default 0.9) to set the target fraction and –max-model-len to cap context per request rather than letting it auto-derive from the model config.

What they’re really probing: Whether the candidate can do the back-of-the-envelope without reaching for a calculator, and whether they know the GQA simplification (KV cache scales with num_kv_heads, not num_attention_heads).

The senior-tier follow-up: what happens when you swap to FP8 KV cache? You roughly double concurrent-request capacity for the same context length. vLLM supports FP8 KV cache via the –kv-cache-dtype fp8 flag. The risk is small accuracy loss on long-tail queries — calibration is required, and not all model families behave the same. (Source: vLLM documentation, GitHub release notes.)

How does prefix caching work in vLLM and when does it actually help?

Concept: cache reuse for shared prompts | Difficulty: mid | Stage: technical deep-dive

Direct answer: vLLM’s automatic prefix caching (APC), enabled with –enable-prefix-caching, hashes the prompt tokens in fixed-size chunks (one chunk = one KV block, typically 16 tokens) and reuses already-computed KV blocks when a new request shares a leading prefix. The hit rate depends entirely on workload shape: workloads with long shared system prompts (RAG with the same instructions, agent loops with the same tool definitions, multi-turn chats replaying the same conversation history) see large TTFT reductions — sometimes 5-10x for the prefill portion. Workloads with diverse prompts get near-zero benefit and pay a small hash-lookup overhead. The honest answer in an interview is “measure the prefix overlap on a sample of production traffic before turning it on” — not “always enable it.”

What they’re really probing: Whether the candidate reflexively reaches for an optimization or whether they ask about workload shape first.

Implementation details that signal depth:

  • The cache is per-engine, not per-replica — horizontal scaling needs sticky routing for cross-request hits.
  • Eviction is LRU on the block level, not the request level.
  • Hashing is content-based, so two requests with identical prompts hit the cache even if they came from different users.
  • That last point matters for privacy: candidates with security backgrounds should flag the side-channel risk and the mitigations (per-tenant cache partitions or hashing with a tenant salt). (Source: vLLM PR for automatic prefix caching, available since v0.4.x.)

When does vLLM swap to CPU vs recompute on preemption?

Concept: preemption policy | Difficulty: senior | Stage: technical deep-dive

Direct answer: When the scheduler runs out of GPU KV blocks and needs to preempt a lower-priority request, it has two options: swap the request’s KV blocks to CPU memory and restore them later, or discard the KV blocks and recompute the prefill when the request resumes. vLLM picks based on the –preemption-mode flag (defaults to recompute) and the request’s current decoded length. Recompute is cheaper when the prefill cost is small relative to PCIe transfer time — short prompts, fast GPU. Swap is cheaper when the prefill was expensive and CPU memory has the bandwidth — long prompts, big batches, modern PCIe Gen5 systems. The break-even point on H100s is roughly 8K-16K prompt tokens.

What they’re really probing: Whether the candidate has actually watched a vLLM deployment hit memory pressure under load, or is reading from documentation.

The follow-up question often goes meta: why does this matter for SLOs? Recompute spikes latency on the resumed request by the full prefill cost — visible in p99 TTFT. Swap spikes latency by the PCIe transfer cost — visible in p99 inter-token latency. Different SLO contracts call for different choices. A senior answer connects this to broader scheduler design, not just the flag toggle. (Source: vLLM scheduler source, vLLM GitHub.)

Throughput, chunked prefill, and request scheduling questions

The scheduler is where vLLM separates from a research demo. Continuous batching landed in 2023; chunked prefill in 2024; speculative decoding integration matured across 2024-2025; and the 2025 academic work on disaggregated prefill/decode (DistServe, Mooncake, Splitwise) became required reading for senior interview prep. Interviewers test depth in this category because the scheduler choices are the operator’s main lever for TTFT vs throughput tradeoffs. (Source: vLLM scheduling documentation.)

Timeline diagram comparing traditional prefill-blocking-decode batching to vLLM chunked prefill interleaving prefill chunks with decode iterations
Chunked prefill (vLLM v0.6) interleaves small prefill chunks with ongoing decode iterations, eliminating the head-of-line blocking pattern.

Explain continuous batching. How is it different from static batching?

Concept: dynamic request batching | Difficulty: junior to mid | Stage: technical screen

Direct answer: Static batching waits for B requests to arrive, runs all of them through generation step by step, and returns the entire batch when the longest sequence finishes. Short sequences sit idle while the long one completes. Continuous batching (also called iteration-level scheduling or in-flight batching) treats each generation step as an opportunity to add new requests to the active batch and remove finished ones — the batch composition changes every iteration. The practical effect is that GPU utilization stays near 100% even with heterogeneous request lengths, and new requests don’t wait for an arbitrary batch boundary to start.

What they’re really probing: Whether the candidate can articulate the head-of-line blocking problem that continuous batching solves.

The crisp follow-up is “what’s the cost?” — and the honest answer is implementation complexity in the kernel layer, because batched matmuls now have to handle requests at different positions in their generation. vLLM’s solution is the paged-attention kernel that operates on per-request block tables, allowing variable-length sequences to share a single kernel launch. Most production serving stacks now use continuous batching as the default; static batching survives only in offline-throughput jobs where job-level batching is acceptable. (Source: Orca paper at OSDI 2022 introduced the pattern; vLLM productionized it.)

What is chunked prefill and why did vLLM add it in v0.6?

Concept: prefill scheduling | Difficulty: mid to senior | Stage: technical deep-dive

Direct answer: A long prefill (say, 32K tokens) processed in one shot would block all decode iterations for many milliseconds — every other in-flight request sees its inter-token latency spike because the GPU is stuck on the long prefill. Chunked prefill, shipped in vLLM v0.6 in September 2024, splits the prefill into smaller chunks (default 512 tokens) and interleaves them with decode iterations. The decode requests keep their inter-token-latency SLO; the prefill takes the same total time but gives up its monopoly. The flag is –enable-chunked-prefill with –max-num-batched-tokens controlling the chunk-plus-decode budget per iteration.

What they’re really probing: Whether the candidate knows the v0.6 release feature by name, and whether they can connect it to the head-of-line latency story.

The senior follow-up: when does chunked prefill hurt? Pure-throughput workloads (offline batch jobs with no per-request SLO) sometimes do worse because chunking adds a small per-chunk overhead. And on workloads where prefills are short anyway (e.g., chatbot turns averaging 200 tokens), the feature is mostly inert. As of v0.7 and v0.8, the scheduler heuristics auto-tune chunk size, but the operator can still override. (Source: vLLM v0.6 performance blog.)

When would you turn on speculative decoding in vLLM, and what’s the failure mode?

Concept: speculative decoding economics | Difficulty: senior | Stage: technical deep-dive

Direct answer: Speculative decoding uses a small “draft” model to propose K tokens and the large “target” model to verify them in parallel — when proposals match, you decode K tokens for the cost of one forward pass. vLLM supports several draft strategies: a separate small draft model, n-gram speculation (matches in the prefix), and EAGLE-style trained speculators. Turn it on for latency-bound workloads where TTFT and ITL matter more than raw throughput — interactive chat, code completion. The failure mode is workloads where the draft model’s acceptance rate is low: throughput drops because you’re paying for both models with little benefit, and tail latency gets worse because verification batches have variable acceptance counts.

What they’re really probing: Whether the candidate understands speculative decoding is not a free lunch — acceptance rate is the dominant variable.

Strong answers connect the choice to model family and workload. EAGLE-2 and Medusa-2 hit higher acceptance rates than vanilla speculative for code and chat workloads. N-gram speculation is essentially free to enable but only helps on repetitive prompts. The senior tier discriminator is what you do when acceptance rate is borderline: ship a feature flag, A/B test on production traffic, and decide based on measured p50/p99 latency — not on benchmarks from the speculative-decoding paper. (Source: vLLM speculative decoding docs; Leviathan et al. speculative decoding paper.)

Walk me through disaggregated prefill and decode. Why do papers like DistServe and Mooncake argue for it?

Concept: disaggregated serving architecture | Difficulty: senior | Stage: senior research engineer

Direct answer: Prefill is compute-bound (matmul-heavy, benefits from high FLOPS); decode is memory-bandwidth-bound (one token at a time, benefits from high HBM bandwidth). Co-located serving forces both onto the same GPU, where the optimal hardware profile differs. Disaggregated serving splits prefill and decode onto separate GPU pools — DistServe (2024), Mooncake (Moonshot AI, 2024), and Microsoft’s Splitwise paper formalized this. Each pool can be sized and scheduled for its own bottleneck. vLLM has shipped first-class disaggregated serving primitives across 2024-2025; the KV cache transfer between pools happens over NVLink or RDMA, and the orchestration layer routes prefill-complete requests to the decode pool.

What they’re really probing: Whether the candidate has read the 2024-2025 papers and can argue both sides — disaggregation is not always a win.

The pushback question every senior screen includes: when does disaggregation NOT pay off? Short prefills don’t justify the KV-transfer cost. Workloads with high inter-pool traffic relative to compute do worse. And the operational complexity — two pools to monitor, two sets of failures to debug — is real. The honest answer is that disaggregation is a win for long-context, large-fleet workloads at frontier labs; it is overkill for a single-tenant deployment serving sub-second chat turns. (Source: DistServe paper, Mooncake / Moonshot AI technical report 2024.)

If you had to choose between higher throughput and lower TTFT, what scheduler settings would you change?

Concept: scheduler tuning | Difficulty: senior | Stage: domain deep-dive

Direct answer: For higher throughput, raise –max-num-batched-tokens (the per-iteration token budget) and –max-num-seqs (concurrent request cap) and accept that TTFT will degrade because new requests wait longer for batch room. For lower TTFT, lower those numbers, enable chunked prefill with smaller chunks, and lean into prefix caching for the workload’s shared system prompts. The single most-leveraged setting is –gpu-memory-utilization: pushing toward 0.95 buys more concurrent capacity at the risk of OOM on tail-context requests; staying at 0.85 gives headroom for tail latency. Production-grade answers cite specific numbers from a benchmark on the actual workload, not generic guidance.

What they’re really probing: Whether the candidate has actually tuned vLLM under load and knows where the levers are.

The strongest answers also flag the second-order effects. Aggressive batching helps throughput but spreads variance in inter-token latency. Chunked prefill helps p99 TTFT but costs a few percent throughput. Prefix caching is essentially free if the workload has shared prefixes and harmless otherwise. The senior signal is naming three numbers — your target throughput, your TTFT SLO, your decode-latency SLO — and walking through which knob each maps to. (Source: vLLM benchmarking guides; vLLM docs.)

Multi-LoRA, quantization, and fleet economics questions

Multi-LoRA hot-swap serving and quantization are the 2024-2025 economic story. Teams running fleets of fine-tuned model variants asked vLLM for a way to serve N adapters from one base-model replica; vLLM shipped LoRA serving across 2024 and refined the adapter-management API in v0.6/v0.7. Quantization questions track the FP8 / NVFP4 / INT4 wave on NVIDIA Hopper and Blackwell hardware (Source: NVIDIA developer blog). Together they form the cost-down toolkit every ML platform interview now tests.

How does multi-LoRA serving work in vLLM and what’s the cost vs spinning up separate replicas?

Concept: LoRA adapter fleet serving | Difficulty: mid to senior | Stage: technical deep-dive

Direct answer: vLLM’s multi-LoRA serving loads N small LoRA adapters (typical adapter size 10-200 MB) on top of one base model replica and routes each request to the adapter named in the request payload. The base model’s KV cache and weights are shared; only the adapter weights swap per request. The cost vs separate replicas is dramatic: serving 50 adapters this way uses one base-model GPU instead of 50, with a small per-request overhead for the adapter lookup and a slight throughput penalty (typically 5-15%) compared to a single dedicated model. The flag is –enable-lora with –max-loras controlling how many adapters can be hot at once.

What they’re really probing: Whether the candidate understands the economic case — many teams have a fleet of fine-tunes from PEFT/LoRA training and need to serve them cost-effectively.

The senior follow-up: what’s the failure mode? If the adapter set is bigger than –max-loras and request mix touches all of them, vLLM has to evict and reload adapters mid-request, adding latency. The fix is right-sizing –max-loras to the working set, sticky routing per adapter, or sharding the adapter set across replicas. The “when does this break” answer separates engineers who have run this in production from engineers who read the docs. (Source: vLLM LoRA serving documentation.)

When would you choose FP8 over INT4 quantization for inference?

Concept: quantization format tradeoffs | Difficulty: mid to senior | Stage: domain deep-dive

Direct answer: Choose FP8 when accuracy preservation matters and the hardware supports it natively — Hopper and Blackwell GPUs have FP8 tensor cores that hit the same FLOPS as BF16 with half the memory footprint. FP8 typically loses under 1% on common evals and needs minimal calibration. Choose INT4 (via GPTQ, AWQ, or similar) when memory is the binding constraint — packing a 70B model into a 24 GB consumer GPU, or fitting more concurrent requests onto a server-grade GPU. INT4 needs careful calibration (some layers degrade badly) and the kernel ecosystem is less mature, but the memory savings are roughly 2x vs FP8 and 4x vs FP16.

What they’re really probing: Whether the candidate treats quantization as a one-knob decision or as a layered tradeoff (memory, compute, accuracy, hardware support).

The follow-up that surfaces depth: how do you validate? A strong answer names a held-out eval suite specific to the deployment (not just MMLU), runs the quantized model against the FP16 baseline at multiple confidence intervals, and ships a feature flag to roll back if production-traffic metrics regress. Junior answers cite a single benchmark number; senior answers describe the rollout plan. The 2024-2025 NVFP4 format on Blackwell shifts this calculus — strong candidates flag it without prompting. (Source: NVIDIA developer blog, Hugging Face quantization docs.)

GPTQ vs AWQ vs FP8 — how do you pick?

Concept: quantization algorithm choice | Difficulty: senior | Stage: domain deep-dive

Direct answer: GPTQ (Frantar et al., 2022) is post-training INT4 with second-order error compensation — cheap to apply, broad model support, slightly less accurate than AWQ on average. AWQ (Lin et al., 2023) is activation-aware INT4 weight quantization — protects salient weight channels by scaling them, hits better accuracy than GPTQ on most evals, and is the current default for community-distributed INT4 checkpoints on Hugging Face. FP8 needs Hopper or Blackwell hardware but skips most of the accuracy questions — it’s a near-drop-in. The decision tree: if your hardware is Hopper/Blackwell and weight memory isn’t the binding constraint, use FP8. If you need INT4 for memory and your model has community AWQ checkpoints, use AWQ. GPTQ survives mostly as a fallback for models without AWQ versions.

What they’re really probing: Whether the candidate can name a real eval workflow, not just cite an algorithm.

One trap interviewers like: candidates who reach for INT8 first. INT8 was the 2023 default; it has been displaced by FP8 on modern hardware (better accuracy, same memory) and by INT4 for the memory-constrained case (much less memory, acceptable accuracy with AWQ). Strong candidates flag this transition rather than recommending INT8 as a still-current default. (Source: GPTQ paper, AWQ paper.)

Tensor parallelism vs pipeline parallelism in vLLM — when would you reach for each?

Concept: model parallelism strategy | Difficulty: senior | Stage: domain deep-dive

Direct answer: Tensor parallelism (TP) splits each layer’s weight matrices across N GPUs and runs all-reduce on activations after each layer — high communication volume per token, but low latency per layer. Pipeline parallelism (PP) splits the layers across stages and passes activations forward — low per-stage communication, but pipeline bubbles introduce latency. Inside a single node with NVLink (8 H100s, 8 H200s, an HGX board), use tensor parallelism — NVLink bandwidth is high enough to absorb the all-reduce traffic. Across nodes over InfiniBand or Ethernet, prefer pipeline parallelism or hybrid TP-within-node + PP-across-node. vLLM flags: –tensor-parallel-size and –pipeline-parallel-size.

What they’re really probing: Whether the candidate has actually deployed a model bigger than one node, or only read about it.

The senior signal in this answer is naming the interconnect — NVLink, InfiniBand, NCCL. Candidates who handwave “use TP” without distinguishing intra-node vs inter-node show they have only deployed in a single-node setting. The 2024-2025 deployment trend has been toward sequence parallelism on top of TP for very long context, and expert parallelism for MoE models like Mixtral and DeepSeek-V3 — both of which vLLM supports. (Source: Megatron-LM paper for the original TP/PP framework; vLLM distributed serving docs.)

Production deployment and debugging questions

Production debugging questions separate the engineer who has owned a vLLM deployment from the engineer who has only read about one. The four questions in this category cover the failure modes that show up in real fleets — throughput regression with no obvious cause, tool-calling reliability after the v0.8 update, multi-tenant fairness, and the SLO observability that lets you debug at 3am. (Source: r/LocalLLaMA production threads, vLLM GitHub issue tracker.)

A vLLM deployment is hitting 40% of expected throughput. How do you debug?

Concept: production debugging | Difficulty: senior | Stage: domain deep-dive / system design

Direct answer: Work from the outside in. First, check GPU utilization via nvidia-smi dmon — if it’s pegged at 100%, the problem is elsewhere; if it’s at 40%, you’re feeding the GPU too slowly. Second, check request mix — long prefills interleaved with short decodes will pull throughput down even with chunked prefill enabled, and the metric to watch is num_running_seqs from vLLM’s Prometheus endpoint. Third, check KV cache pressure — if num_swapped_seqs or recompute count is non-trivial, you’re hitting memory and the scheduler is preempting requests. Fourth, check tokenizer throughput — slow Python-side tokenization can starve the GPU on workloads with very short requests.

What they’re really probing: Whether the candidate has an actual debugging playbook or reaches for “increase batch size” reflexively.

The senior tier discriminator is the second-order failures:

  • PCIe bandwidth saturation when running tensor parallelism without NVLink (common in cloud VMs that advertise 8 GPUs but route them through different PCIe roots).
  • NCCL communication overhead during multi-GPU all-reduce, visible only with profiling.
  • Kernel selection regressions across vLLM versions — a v0.6 to v0.7 upgrade silently swapping attention kernels can change throughput in either direction.

Strong candidates name two or three of these without prompting and describe how they would isolate each. (Source: vLLM GitHub issue tracker, NVIDIA Nsight profiling docs.)

How does vLLM handle tool calling and structured output as of v0.8?

Concept: structured output + tool calling | Difficulty: mid to senior | Stage: technical deep-dive

Direct answer: vLLM’s v0.7 and v0.8 releases (early 2025) landed first-class tool calling and structured output, closing the feature gap with proprietary inference APIs that orchestration layers like LangChain consume. Structured output is implemented via guided decoding — the engine constrains token sampling to match a JSON schema, regex, or grammar (lark / EBNF), refusing tokens that would violate the constraint. Tool calling is layered on top: the OpenAI-compatible API exposes tools in the request, vLLM uses guided decoding to force the model’s output into the expected tool-call format. The flag is –guided-decoding-backend with options like outlines, lm-format-enforcer, and xgrammar.

What they’re really probing: Whether the candidate knows the v0.8 feature by name and can speak to the implementation choice between backends.

The follow-up that tests depth: which backend would you pick? xgrammar (2024) is fastest because grammars are precompiled and the constraint check is a fast finite-automaton step. outlines is more flexible (regex, JSON schema, Pydantic models) but slower for complex grammars. lm-format-enforcer is the middle ground. Production teams usually pick xgrammar for hot-path tool calling, outlines for one-off structured extraction. (Source: vLLM structured output docs, xgrammar paper 2024.)

How would you build multi-tenant fairness on top of vLLM?

Concept: multi-tenant scheduling | Difficulty: senior | Stage: system design

Direct answer: vLLM’s default scheduler is FCFS with priority preemption — it does not natively enforce per-tenant fairness. To add fairness, you have three layers to work with. At the gateway, implement token-bucket rate limiting per tenant before requests reach vLLM. At the orchestrator (Ray Serve, Kubernetes ingress), use weighted round-robin across vLLM replicas with sticky routing so a tenant’s prefix-cache hits are preserved. Inside vLLM, use the priority API to mark requests with tenant priority; the scheduler then preempts lower-priority requests when memory is tight. The combination — rate limit at the edge, weighted routing in the middle, priority preemption at the engine — is what production multi-tenant deployments at Together AI, Anyscale, and Cloudflare Workers AI look like in practice.

What they’re really probing: Whether the candidate can architect across the stack or reaches for one layer.

The senior follow-up asks about isolation. Two tenants sharing a base model share the KV cache — a noisy neighbor can evict the other’s cached prefixes. The mitigations are per-tenant cache partitions (vLLM does not currently support this natively; needs custom work) or per-tenant replicas above a usage threshold. Strong candidates flag that fairness in a shared serving system has a hard limit set by the cache-sharing model and that hard isolation requires separate replicas. (Source: Anyscale Ray Serve + vLLM deployment guides; production threads on r/LocalLLaMA.)

What metrics would you put on a vLLM SLO dashboard?

Concept: production observability | Difficulty: mid to senior | Stage: system design

Direct answer: Four metric families, with p50, p95, p99 on each. Latency: TTFT (time to first token), ITL (inter-token latency), total request duration. Throughput: tokens-per-second per replica, requests-per-second, active concurrent sequences. Memory pressure: KV cache utilization percent, num_swapped_seqs, num_preemptions, recompute count. Capacity headroom: GPU utilization, GPU memory free, max_num_seqs vs running_seqs ratio. The vLLM Prometheus exporter at /metrics ships most of these natively — you wire it to Prometheus, alert on TTFT p99 or ITL p99 breaches, and use the throughput and memory-pressure metrics for capacity planning.

What they’re really probing: Whether the candidate distinguishes user-facing SLO metrics from operator-facing capacity metrics.

The strongest answers add two more. Cache hit rate for prefix caching — if it dips, your workload shape changed and TTFT will follow. Quantization-related accuracy regression signals — a small canary eval that runs against production traffic samples nightly catches FP8 calibration drift on tail data. Bonus signal: candidates who frame this in agent-loop language (where tool-call latency multiplies across turns) show they understand modern agentic AI interview questions overlap with vLLM SRE work. (Source: vLLM Prometheus metrics docs; Anyscale + Together production posts.)

Questions to ask the interviewer (the senior-signal set)

The reverse-questions section separates senior candidates from mid candidates. These six questions signal that you have run vLLM in production — interviewers note which ones you reach for:

  1. What’s your KV cache hit rate on production traffic? The answer reveals whether the team uses prefix caching seriously and how shared their workload prompts are.
  2. Are you running chunked prefill with a fixed budget or adaptive? Asks about scheduler maturity — adaptive tuning is more advanced.
  3. How do you handle multi-LoRA adapter hot-swap latency in your SLO? Reveals whether the team has shipped multi-LoRA or is still pre-LoRA.
  4. What’s your TTFT p99 versus ITL p99 — and which one does the business actually care about? Tests whether the team has had to defend SLO definitions to product.
  5. Are you considering disaggregated prefill / decode, or staying co-located? Probes how senior the architecture conversation has gotten.
  6. How is the team handling the FP8 to NVFP4 migration on Blackwell? Forward-looking — signals you know what’s coming.
  7. What’s the most painful vLLM upgrade you’ve shipped, and what broke? The behavioral question disguised as technical — interviewers love it because the answer is always specific.

vLLM interview prep: a 7-day reading and lab sequence

This is a concrete sequence, not a motivational closer:

  1. Day 1 — read the 2023 PagedAttention paper (Kwon et al., SOSP 2023). Skip the related-work section; focus on §3 (PagedAttention) and §4 (system design). 90 minutes.
  2. Day 2 — read the v0.6 chunked prefill release notes on the vLLM blog and the corresponding GitHub PR. Read the v0.7 and v0.8 release notes too. Note the dates — interviewers remember timelines.
  3. Day 3 — stand up vLLM on a single GPU. Llama 3 8B fits on a 24 GB consumer card. Try --enable-prefix-caching and --enable-chunked-prefill with a multi-turn prompt mix. Watch the Prometheus metrics.
  4. Day 4 — read the DistServe and Mooncake papers (both 2024). 2-3 hours. You don’t need to remember every number; you need to be able to articulate the prefill-bound vs decode-bound split.
  5. Day 5 — implement a small multi-LoRA serve. Fine-tune two LoRA adapters on different small tasks (a 1B model is fine for the exercise). Serve both through one vLLM replica with --enable-lora. Measure the per-adapter latency overhead.
  6. Day 6 — read the AWQ and GPTQ papers (skim, not deep) and run an FP8 quantization of an 8B model with vLLM’s --kv-cache-dtype fp8 flag. Run a small eval (MMLU or your own) and compare to FP16.
  7. Day 7 — read three r/LocalLLaMA production threads about vLLM deployments that broke. Look for failure modes: PCIe bandwidth, KV cache thrashing, tokenizer overhead, LoRA eviction storms. These war stories are the language interviewers use.

Skip generic interview-prep advice for this loop. The vLLM-using teams hire for depth on the engine, the v0.6-v0.8 release timeline, and the production failure surface. The candidate who has spent 7 days as above will outperform the candidate who has spent 30 days on STAR-method drilling.

Similar Posts