DSPy Interview Questions (2026): MIPROv2, Compile vs Runtime, and Production Failure Modes

Abstract banner illustration of DSPy compiled prompt pipelines flowing through declarative modules — a visual representation of programming, not prompting, foundation models

DSPy’s interview footprint changed in late 2024 and never looked back. The DSPy 3.0 release at the 2025 Databricks Data + AI Summit baked native MLflow tracing into every compile run, and the DSPy 3.2.1 release on May 5, 2026 stabilized 108 versions of optimization machinery into something senior engineers expect candidates to reason about.

The GEPA optimizer paper landed in July 2025 with reflective prompt evolution as an alternative to MIPROv2, and the MIPROv2 paper (Opsahl-Ong et al., June 2024) replaced hand prompt-engineering at JetBlue and inside several Databricks R&D groups. That shift is what interviewers now test for at Mistral, Databricks, Anthropic-adjacent labs, and Stanford-NLP-adjacent startups.

This guide covers what those interviews probe in 2026 — the declarative-vs-imperative paradigm that separates DSPy from LangChain at a layer level, the three-stage internals of MIPROv2 that senior loops walk through line by line, and the production failure modes practitioners on r/LocalLLaMA and the Databricks blog flagged before the docs caught up (LM-swap demo transfer breakage, non-English locale issues, and the inspect_history() debug workflow).

Abstract banner illustration of DSPy compiled prompt pipelines flowing through declarative modules — a visual representation of programming, not prompting, foundation models

Why DSPy interviews shifted in 2024-2026: from teleprompters to MIPROv2 and Databricks DBRX

DSPy interview questions in 2026 anchor on three release cycles: the April 8, 2024 Databricks integration that opened DBRX, Mixtral-8x7B, and Llama 2 70B as first-class compile targets; the June 2024 MIPROv2 paper by Opsahl-Ong, Khattab, Singhvi and colleagues that gave the discrete-search optimizer its current shape; and the DSPy 3.0 release at DAIS 2025 that wired MLflow observability directly into compile runs (Databricks engineering, April 8, 2024). Together they shifted the interview question stack from “what is a teleprompter?” trivia toward “walk me through the three stages of MIPROv2 and tell me where each one breaks in production.”

Test Your Knowledge Quick knowledge check

The other shift is who’s interviewing for DSPy literacy. Until early 2024, DSPy was largely a Stanford NLP research artifact. By DAIS 2025, JetBlue was using DSPy in production for customer feedback classification and RAG-powered predictive maintenance chatbots (Databricks DAIS, 2024), and Databricks’ own ML teams were running internal DSPy stacks.

That production footprint changed what senior interviewers probe: not the surface API, but the optimizer-choice tradeoffs, the dev-vs-prod LM mismatch, and the cost ceiling of running MIPROv2 against GPT-4-class compile targets.

Declarative paradigm foundations: signatures, modules, and the compile-vs-runtime split

The foundations bucket is where junior-to-mid DSPy candidates either earn or lose the room. Interviewers expect you to articulate the declarative-programming framing — that LM calls are compiled artifacts, not hand-tuned prompt strings — and to walk a signature, module, and compile step at a level that proves you’ve actually run optimizer.compile() end-to-end.

The four questions below match the order most loops use:

  • Paradigm framing first — articulate “programming, not prompting” in one sentence.
  • Signature mechanics next — what does context, query -> answer do internally?
  • Module choice — when do you reach for Predict versus ChainOfThought?
  • Compile-vs-runtime boundary — as a senior-engineer mental model anchor.

What do interviewers mean when they say DSPy is “programming, not prompting” foundation models?

Concept: declarative paradigm framing | Difficulty: junior | Stage: recruiter / technical

Direct answer: “Programming, not prompting” is DSPy’s framing for the shift from hand-tuned prompt strings to compiled declarative modules. In DSPy you write a typed signature like question -> answer: str, wrap it in a module such as dspy.ChainOfThought, define a metric function, build a small trainset, and call optimizer.compile() — the optimizer then generates few-shot demonstrations or refined instructions automatically. The compiled program becomes a deterministic artifact you save and load, replacing the brittle prompt strings that most LangChain-era pipelines glued together by hand. DSPy founder Omar Khattab frames it as treating LM calls as parameterized graph nodes that learn from data, the same way deep-learning frameworks treat tensor ops (Khattab et al., ICLR 2024).

What they’re really probing: whether you can articulate the paradigm shift in one sentence without falling back on feature-list jargon. The trap is reciting “DSPy has signatures and modules” without explaining what changed.

Most senior interviewers want to hear you contrast it with LangChain’s imperative chain-of-calls pattern. The right phrase to drop is “compiled prompts versus hand-tuned strings.” If you can name MIPROv2 or BootstrapFewShot as the compilers that do this work, you’ve answered at mid-level. If you can describe how the optimizer’s metric function turns prompt engineering into a search problem, that’s a senior signal — see the official optimizer docs for the canonical phrasing.

Walk me through what a DSPy signature like context, query -> answer does internally.

Concept: signature mechanics | Difficulty: junior | Stage: technical

Direct answer: A DSPy signature like context, query -> answer is a typed contract that tells the framework what fields to extract from the LM’s output and how to populate the prompt. Internally, DSPy turns it into a structured prompt template with input field labels (context, query) and output field labels (answer), then parses the LM response by field. You can express the same signature as a class with dspy.InputField() and dspy.OutputField() when you need typed constraints or per-field descriptions. The signature is what the optimizer mutates during compile — it can rewrite the instruction prefix, add few-shot demos, or refine field descriptions without you touching prompt text directly (dspy.ai docs).

What they’re really probing: whether you treat signatures as the unit of optimization rather than as cosmetic syntax. Candidates who reach for hand-written prompts inside the module’s forward() are flagged immediately.

The follow-up most interviewers ask is what happens when the LM returns a malformed output. The right answer references DSPy’s parsing layer — the signature’s field structure is what the runtime uses to detect a parse failure, and typed output fields (answer: float versus the untyped default) constrain the LM enough to recover. This is the practitioner gotcha behind a Reddit-reported failure: untyped signatures lead to brittle parsing in production, especially on smaller LMs.

What is the difference between dspy.Predict and dspy.ChainOfThought, and when would you pick each?

Concept: module choice | Difficulty: mid | Stage: technical

Direct answer: dspy.Predict is the baseline module that emits a signature’s output fields directly from the LM. dspy.ChainOfThought is a thin wrapper that adds a rationale field to the signature before the output fields, so the LM emits a reasoning trace before the final answer. Pick Predict for classification, extraction, or any task where intermediate reasoning would just bloat tokens with no metric lift; pick ChainOfThought for arithmetic, multi-hop reasoning, or anything where the rationale itself helps the LM land a better answer. Both are subclasses of dspy.Module, both work with every optimizer in the DSPy lineup, and both are interchangeable at the optimizer-API level — you can swap one for the other without changing the compile call (dspy.ai docs).

What they’re really probing: whether you understand that ChainOfThought is mechanically built on top of Predict — not a separate code path — and whether you can defend the cost tradeoff.

The senior follow-up is usually whether you’d reach for dspy.ProgramOfThought (which generates and executes Python code) or dspy.ReAct (which combines reasoning with tool calls). The module-choice pattern to articulate:

  • ProgramOfThought for arithmetic and structured-output problems.
  • ReAct for tool-using agents.
  • ChainOfThought for everything else where intermediate reasoning helps.

A practitioner pattern flagged in the Lovelytics consulting blog: start with Predict, instrument with MLflow, and only switch to ChainOfThought when the metric improves enough to justify the token cost.

Explain DSPy’s compile-vs-runtime split as if I were a senior engineer who has only used LangChain.

Concept: compilation boundary | Difficulty: mid-senior | Stage: technical / system-design

Direct answer: In LangChain, the prompts you write are what run in production — every call to the LM uses the strings you typed. In DSPy, the prompts you write are signature contracts, and what runs in production is the compiled artifact from optimizer.compile(program, trainset=trainset, metric=metric). Compile is the one-time step that mutates the program against a metric and trainset; runtime is the deterministic execution of the compiled artifact against new inputs. You save the compiled program to disk and version it alongside your code, the same way you’d version a fine-tuned model checkpoint. This is the load-bearing distinction: DSPy treats prompts as the output of an optimization pass, not as source code (Khattab et al., ICLR 2024).

What they’re really probing: whether you treat the compiled artifact as a versioned asset, or whether you mentally collapse compile and runtime into “just calling the LM.”

The follow-up is almost always operational. Production teams cache MIPROv2-compiled programs aggressively and version them in MLflow alongside the LM, since a 300-example trainset against GPT-4-class models can run for hours and cost real money. The practitioner gotcha is that compiled programs are LM-specific — demos optimized against GPT-4 degrade on a smaller LM at runtime.

Optimizer three-stage internals: the MIPROv2 walkthrough senior interviewers actually probe

This is the section that separates candidates who have only used DSPy from candidates who’ve debugged it under cost pressure. Senior interviewers at Databricks, JetBlue, and Stanford-adjacent labs use the optimizer walkthrough as a wedge — can you name the three MIPROv2 stages, defend the data-size rule of thumb for picking BootstrapFewShot versus MIPROv2 versus BootstrapFinetune, and articulate why “teleprompter” still appears in some docs? The four questions below are the canonical loop for this bucket.

When would you reach for BootstrapFewShot vs MIPROv2 vs BootstrapFinetune?

Concept: optimizer selection | Difficulty: mid | Stage: technical

Direct answer: Pick by trainset size and what you’re optimizing. With 10 or fewer labeled examples, use BootstrapFewShot — it uses a teacher module to generate candidate demos from the trainset itself. Around 50 examples, switch to BootstrapFewShotWithRandomSearch for cheap demo-combination search. At 300+ examples, reach for MIPROv2, which jointly optimizes instructions and few-shot examples via Bayesian optimization. Use BootstrapFinetune when you can afford to update LM weights and you have hundreds of examples (official optimizer docs). The rule of thumb is community-stable but the boundaries are softer than the numbers suggest — Weaviate’s DSPy optimizer comparison walks through cases where the boundaries flex.

What they’re really probing: whether you’d jam BootstrapFewShot onto a 500-example trainset (wrong tool) or whether you’d reach for MIPROv2 on 8 examples (Bayesian search needs more signal than that).

Senior follow-ups usually involve GEPA (Reflective Prompt Evolution), introduced in the July 2025 paper. GEPA mutates prompts using textual feedback from the metric — not just a scalar score — so reach for it when your metric naturally produces explanatory text. The wrong answer in any senior loop is defaulting to one optimizer for every task.

MIPROv2 has three stages — bootstrapping, grounded proposal, discrete search. Walk me through each and tell me where it can break.

Concept: optimizer internals | Difficulty: senior | Stage: technical / system-design

Direct answer: MIPROv2’s three stages are (1) bootstrapping — run the unoptimized program over the trainset and collect successful traces as candidate few-shot demos; (2) grounded proposal — sample instruction candidates from a proposer LM, grounded in the program’s structure, the bootstrapped demos, and dataset summaries; (3) discrete search — run mini-batch evaluations across (instruction, demo-set) combinations and update a Bayesian surrogate model to find the best joint configuration. Each stage has a documented failure mode: bootstrapping fails when the unoptimized program scores too poorly to produce any usable traces; grounded proposal fails when the dataset summaries are too generic to ground meaningful instructions; discrete search fails when the search space is too large for the evaluation budget and the surrogate model can’t converge (MIPROv2 API docs).

What they’re really probing: whether you can localize a debugging hypothesis to a specific MIPROv2 stage, not just say “the optimizer isn’t improving the score.”

The walkthrough above is from the Optimizing Instructions and Demonstrations paper (Opsahl-Ong et al., June 2024); the December 2024 comparative study at arxiv 2412.15298 benchmarked MIPROv2 against COPRO and BootstrapFewShot and found MIPROv2 showed the strongest correlation with human evaluation. The practitioner debug pattern is to fail forward through the stages with verbose=True — check bootstrapping for successful traces, then proposer variance, then whether discrete search actually explored.

Why does the DSPy library still call optimizers “teleprompters” in some places, and why was the rename done?

Concept: framework history | Difficulty: mid | Stage: technical / culture

Direct answer: “Teleprompter” was the original name in DSP and early DSPy releases — the metaphor was that the framework prompts the LM the way a teleprompter feeds lines to a speaker. The community renamed them to optimizers around the DSPy 2.x release train because the teleprompter framing understated what they actually do: they’re search algorithms that mutate program parameters (instructions, few-shot demos, sometimes LM weights) against a metric, much closer in spirit to PyTorch optimizers than to a prompt-feed UI. The old name still appears in some module paths — dspy.teleprompt resolves via alias to dspy.optimize for backward compatibility, and a fair amount of community tutorial content from 2023-2024 still uses the old vocabulary (stanfordnlp/dspy README).

What they’re really probing: whether you’ve been following DSPy long enough to remember the teleprompter era, or whether you only read the current docs.

Senior interviewers at Stanford-adjacent labs and Databricks use this as a low-stakes culture probe. Treating the rename as cosmetic is the wrong answer — it actually reflects a maturity shift, away from a prompt-engineering tool and toward a compilation framework. If you can name the rough timeframe (the late 2024 transition window across the DSPy 2.4 / 2.5 train) and reference that dspy.teleprompt still resolves via alias, you’ve answered at the right depth.

What is dspy.BetterTogether and when would you compose optimizers?

Concept: optimizer composition | Difficulty: senior | Stage: technical / system-design

Direct answer: dspy.BetterTogether is a composite optimizer that runs multiple optimizers sequentially against the same program. The canonical pattern is to bootstrap demonstrations first with BootstrapFewShot, then refine instructions on top with MIPROv2 or COPRO — the second optimizer inherits the demos from the first and only mutates the instructions, which converges materially faster than running MIPROv2 from scratch against an unprimed program. Reach for it when you have enough trainset to support both passes (typically 100+ examples), your compile budget allows a longer optimization window, and single-optimizer scores have plateaued. The composition order itself matters: bootstrapping first establishes demo quality, then instruction-tuning refines what frame the LM applies to those demos (official optimizer docs).

What they’re really probing: whether you’d treat optimizer choice as a one-shot decision or as a pipeline you can layer.

The senior-level follow-up is whether you’ve seen BetterTogether beat a single-optimizer run in practice. The honest answer is that it depends on the metric — for sharp metrics that benefit from both demo and instruction tuning, composition wins; where instruction quality dominates, running MIPROv2 alone is cheaper. The defensible interview position is that composition is a tool you reach for when single-optimizer scores plateau, not a default.

Production failure modes: what Databricks and senior NLP teams probe after deployment

The production failure-modes bucket is where senior loops separate candidates who’ve shipped DSPy from candidates who’ve only run it in a notebook. The four questions below map to the failure modes Databricks and JetBlue teams have documented at DAIS sessions and that practitioners on r/LocalLLaMA have flagged before the official docs caught up.

Each question has a debugging-rooted answer. If you can name inspect_history(), MLflow tracing, and the dev-vs-prod LM mismatch as your default toolkit, you’re answering at the level the room expects.

How does DSPy handle non-English use cases, and what is the production workaround?

Concept: locale failure mode | Difficulty: senior | Stage: technical / system-design

Direct answer: DSPy’s default signatures and module templates are English-first, and the proposer LM that MIPROv2 uses for grounded proposal will draft instructions in English unless explicitly configured otherwise. Practitioners report that BootstrapFewShot handles non-English trainsets reasonably well because the bootstrapped demos inherit the language of the trainset, but MIPROv2‘s instruction-mutation stage often produces English instructions that don’t match the dataset language — the result is a compiled program with English instructions wrapping non-English demos. The production workaround is to explicitly specify the proposer language via custom signature instructions or to pass a localized instruction prompt template that the proposer extends rather than rewrites.

What they’re really probing: whether you’ve actually compiled a non-English DSPy program, or whether you assume the framework is locale-neutral by default.

This is one of the failure modes flagged in r/LocalLLaMA threads where engineers running DSPy against Spanish, Hindi, and Mandarin trainsets reported instruction-mutation regressions. The Databricks blog covers the recommended workaround for production: pin the proposer instructions explicitly and use a metric that penalizes language drift. The interview signal is whether you treat locale as an active design constraint or as an afterthought — Anthropic-adjacent and Mistral-adjacent loops weight this heavily because their target deployments span languages.

When a DSPy program produces unexpected outputs, what is your debug workflow?

Concept: debug workflow | Difficulty: mid-senior | Stage: technical

Direct answer: The DSPy debug workflow has three layers. First, call lm.inspect_history(n=5) to dump the last n LM calls — input prompts, raw outputs, and parse results. Second, run with dspy.settings.configure(trace=[]) and inspect the trace list for module-level decisions and intermediate outputs. Third, if you’re on DSPy 3.0 or later, the MLflow integration emits structured traces per compile step and per runtime call, queryable by signature and module — this is the production-grade view that replaces ad-hoc print debugging. The community-cited workhorse is inspect_history() because it shows the exact prompt the LM saw, including bootstrapped demos that the optimizer inserted (Databricks DAIS 2025 DSPy 3.0 session).

What they’re really probing: whether you reach for print statements (junior) or for the structured trace tooling (mid-senior).

The follow-up most interviewers ask is what you do when inspect_history() shows well-formed output but the metric still scores it badly. The right answer points at the metric itself — DSPy’s optimizers are only as good as the signal you give them, and the most common debug session ends in rewriting the metric, not tuning the optimizer. The Lovelytics blog notes this is the failure mode that catches DSPy newcomers most often.

Show me how you would build a multi-hop RAG system on Databricks using DSPy and DBRX.

Concept: production architecture | Difficulty: senior | Stage: system-design / coding

Direct answer: The canonical DSPy-on-Databricks multi-hop RAG stack composes a dspy.Retrieve module wired to Databricks Vector Search, a dspy.ChainOfThought hop module wrapped around a signature like context, query -> sub_question, a final answer module with signature context, query -> answer, and an LM client configured via dspy.Databricks pointing at DBRX Instruct or Mixtral-8x7B Instruct on Databricks Model Serving. You compose the modules in a custom dspy.Module subclass whose forward() calls retrieve → hop → answer in sequence, then compile against a metric using a HotPotQA-style trainset. The Databricks integration blog walks through this stack end-to-end and reports up to 46% improvement over hand-written few-shot prompts on GPT-3.5-class targets (Databricks engineering, April 8, 2024).

What they’re really probing: whether you can name the integration components by their actual API names and articulate the metric design, not just the module composition.

The senior follow-up is usually about observability. The right answer references MLflow Tracing for per-module spans, the MLflow Model Registry for versioning the compiled DSPy program alongside the DBRX model, and Databricks Lakehouse Monitoring for drift on the retrieved-context distribution. Naming MLflow as the observability spine — and explaining why compiled programs need versioning as model artifacts — lands the answer at the right depth.

A practitioner on r/LocalLLaMA said DSPy’s compiled prompts “are not generalizable beyond the training samples.” How do you respond to that critique in an interview?

Concept: framework critique | Difficulty: senior | Stage: technical / culture

Direct answer: The critique is partially valid and worth engaging directly. DSPy’s bootstrapped demonstrations are sampled from the trainset, so a compiled program optimizes for the distribution it saw — when production traffic shifts away from that distribution, the demos can become less helpful or actively misleading. The defensible response in an interview: this is a generalization problem, not a DSPy-specific one, and the production discipline is to (a) split trainset / devset / testset and only trust scores on held-out data, (b) re-compile when the production distribution drifts, and (c) prefer MIPROv2‘s instruction-mutation over BootstrapFewShot‘s demo-stacking when the trainset is small and demos are likely to overfit. The framework gives you the tools; the discipline is yours to apply.

What they’re really probing: whether you can engage a real practitioner critique without dismissing it and without conceding the framework is broken.

This question tests whether candidates know what compiled DSPy programs actually contain — bootstrapped demos that look like the trainset — versus an idealized view of “optimized prompts that just work everywhere.” A defensible answer names both the generalization risk and the mitigation toolkit (held-out evaluation, periodic recompile, optimizer-choice driven by trainset size) rather than waving the critique away.

DSPy vs LangChain: the philosophical question senior interviewers use as a wedge

This is the wedge bucket. Senior interviewers use the DSPy-vs-LangChain question not to test framework allegiance but to probe whether you understand they sit at different layers — one is a compilation framework that optimizes LM-call internals, the other is an orchestration framework that glues components together. Candidates who say “DSPy is just LangChain but better” fail the question; candidates who articulate the layering and acknowledge both have a place pass it. The two questions below cover the philosophical framing and the practical migration path.

When would you choose DSPy over LangChain? Frame it as a philosophical question, not a feature comparison.

Concept: framework layering | Difficulty: senior | Stage: technical / culture

Direct answer: DSPy and LangChain solve different problems at different layers. LangChain is an imperative orchestration framework — you wire together retrievers, tools, memory modules, and LM calls into a chain, and the chain runs the way you wrote it. DSPy is a declarative compilation framework — you describe LM calls as typed signatures, define a metric, and an optimizer compiles the prompts and weights against that metric. Choose DSPy when your bottleneck is prompt quality and you have a metric to optimize against; choose LangChain when your bottleneck is component integration and you need orchestration plumbing. Many production systems use both: LangChain for the outer orchestration shell, DSPy for the inner LM calls that need optimization.

What they’re really probing: whether you can articulate framework choice without falling into “X versus Y” tribalism.

The trap is treating them as direct competitors when they actually sit at different layers in the same stack. The defensible position is to name the layer each operates at and frame choice as a function of which layer your bottleneck lives in. Naming a concrete production pattern — LangChain agents whose tool-call prompts are DSPy-compiled artifacts — lands the senior-level answer.

How would you migrate a LangChain RAG pipeline to DSPy? What changes structurally?

Concept: migration mechanics | Difficulty: senior | Stage: system-design

Direct answer: The structural changes are concentrated in three places. First, hand-written prompt strings inside LangChain prompt templates become DSPy signatures"Given context, answer the question..." becomes context, question -> answer. Second, retrieval stays largely the same — LangChain retrievers and DSPy Retrieve modules wrap the same vector stores; you mostly swap the call surface. Third, you add a compile step with a metric and trainset — this is the new piece that LangChain doesn’t have an equivalent for. The LangGraph orchestration shell can stay if you want it; you just replace the LM-call leaves with DSPy modules and add the compile step to your CI pipeline so the compiled artifact is versioned alongside the code.

What they’re really probing: whether you’d rip out LangChain entirely (often the wrong move) or whether you’d surgically replace the LM-call layer while keeping the orchestration.

The follow-up is what you’d gain from the migration: automated prompt optimization replacing hand-tuned strings, structured signatures replacing brittle string parsing, and a compile step that yields an auditable artifact. The cost: compile time, compile money, and a new mental model. Framing it as replacing the LM-call layer while keeping orchestration — and pointing candidates at adjacent LangChain interview prep — lands the nuance.

Common candidate red flags DSPy interviewers screen for

Before the reverse-questions section, a quick inventory of tells that mark a candidate as DSPy-shallow in any senior loop. If you catch yourself heading toward any of these, course-correct mid-sentence:

  • Framing DSPy as a LangChain alternative — signals you haven’t internalized the layering distinction.
  • Hand-writing prompts inside a DSPy Module’s forward() — defeats the whole framework.
  • Using BootstrapFewShot with a 500-example trainset — wrong tool for the size; should be MIPROv2.
  • Not specifying typed output fields in the Signature — leads to brittle parsing in prod, especially on smaller LMs.
  • Not saving the compiled program — recompiling every run wastes money and time.
  • Treating the optimizer’s score as the production metric without a held-out set — overfitting risk that bites at deploy time.

Questions to ask the interviewer about their DSPy stack

The reverse-questions section is where you signal seniority. The seven below are the ones that work in DSPy loops at production-mature shops — each one extracts something specific about the team’s actual stack, none are generic “what’s the culture like” filler.

  • What’s your team’s compile-time budget per optimization run, and how do you cache compiled programs across model upgrades?
  • Are you using DSPy’s MLflow tracing in production, or rolling your own observability layer?
  • How do you handle the LM-swap problem — when a compiled program needs to migrate from one LM provider to another?
  • What does your evaluation harness look like? Are you using DSPy’s evaluate.Evaluate or something custom?
  • Have you experimented with GEPA on this codebase? What’s been the verdict versus MIPROv2 so far?
  • How do you version optimized DSPy programs alongside model versions — is it part of your MLflow registry, or a separate artifact store?
  • What’s the most surprising bug you’ve hit in a DSPy program in the last quarter, and how did the team debug it?

Two-week DSPy prep sequence: from signatures to MIPROv2 to a Databricks-style demo

Two weeks is enough to walk into a DSPy-literate loop with confidence if you spend the time on the right artifacts. The sequence below is calibrated for an engineer with prior LM experience but no DSPy production time.

  • Days 1-2: Read the DSPy docs from top to bottom. Write three signatures, wrap them in dspy.Predict, then upgrade to dspy.ChainOfThought. Inspect the difference with lm.inspect_history().
  • Days 3-4: Write a metric function for a simple task (classification or extraction). Build a 20-example trainset. Compile with BootstrapFewShot. Verify the score moves.
  • Days 5-7: Scale to 100+ examples. Switch to MIPROv2. Read the MIPROv2 API docs and the December 2024 comparative study. Walk through the three stages on paper so you can recite them in a whiteboard.
  • Days 8-10: Build a multi-hop RAG module composing Retrieve + ChainOfThought. If you have Databricks access, wire it to DBRX and Vector Search per the DSPy on Databricks blog. Otherwise use any vector store and any LM provider — the pattern is the same.
  • Days 11-12: Read the DAIS 2025 DSPy 3.0 session notes. Practice articulating the MLflow tracing story. Skim the GEPA paper so you can name it as an optimizer alternative.
  • Days 13-14: Mock-interview the questions above. Specifically rehearse the MIPROv2 three-stage walkthrough, the DSPy-vs-LangChain layering answer, and the non-English locale failure mode — these are the three places candidates most often stumble. Read adjacent tracks like the RAG interview questions, agentic AI interview questions, AI engineer interview track, and NLP engineer interview prep so cross-framework questions land cleanly.

The pattern that distinguishes strong DSPy candidates is not breadth of optimizer knowledge but specificity about failure modes. If you can name three things that broke for you during compile, three production gotchas you’d watch for, and one debugging tool you reach for first, you’ve covered the surface area senior interviewers actually probe.

Similar Posts