OpenAI Codex CLI Interview Questions: Junior, Mid, and Senior Answers with a Full Postmortem Scenario
OpenAI Codex CLI interview questions in 2026 test three things simultaneously: whether you can cleanly separate the 2025 Codex CLI from the deprecated 2021 Codex API (an immediate credibility gate at every seniority level), whether you know the three approval modes and their blast-radius implications, and whether you can articulate the planning system — AGENTS.md, Plan mode, and Reverse Interview — as an integrated framework rather than three isolated features. Senior roles add enterprise governance and a postmortem scenario that has no answer in any existing guide: you ran --full-auto in production and deleted your database migration files.
- What is OpenAI Codex CLI and how is it different from the 2021 Codex API?
- How do you install Codex CLI and verify the installation?
- Walk me through the three approval modes and when you’d use each one.
- What is AGENTS.md and what should it contain?
- Explain the Goal / Context / Constraints / Done-When prompt formula.
- When would you use Plan mode instead of running Codex directly?
- What reasoning level would you choose for a 5,000-line legacy refactor task?
- What is the “fork rather than fight” pattern and when do you apply it?
- What is the difference between Codex CLI and Claude Code? Which would you use for a 10-hour refactor?
- You’re rolling out Codex to a 200-person engineering organization. What governance controls do you put in place?
- How does MCP integration work in Codex, and why configure a centralized tool layer instead of per-developer configs?
- What are the prompt injection risks when you use Codex in a GitHub Actions CI/CD pipeline?
- You ran
codex --full-autoand it deleted your database migration files. Walk me through the postmortem.
What OpenAI Codex CLI Interviews Actually Test in 2026
This guide is for software engineers, AI engineers, platform engineers, and staff-level architects preparing for roles where Codex CLI proficiency is now a hiring signal. That includes AI-native startups building agentic CI/CD pipelines, enterprise engineering teams deploying Codex at scale — Cisco, Nvidia, and Ramp are publicly named early adopters — and staff or principal engineers evaluating multi-tool AI coding workflows. It is not a walkthrough of how to use Codex. It is a map of what interviewers probe and why, grounded in the official OpenAI documentation, Kuldeep Paul’s 2026 practitioner guide from Cisco/Nvidia/Ramp deployments, and real security incidents with named attributions.
Four structural shifts define the 2025–2026 Codex CLI hiring cycle:
- OpenAI Codex CLI launched April 16, 2025. This is a new agentic terminal coding tool — not a refresh of the 2021 Codex API, which was a stateless code-completion model. The two share a name and nothing else. Conflating them at a phone screen is an immediate credibility hit that ends interviews. The launch post makes this explicit: the 2025 CLI reads files, edits code, runs shell commands, and manages tasks autonomously. The 2021 API generated code completions and is now fully deprecated.
- 4 million weekly active users by April 2026. Sam Altman cited this figure directly. Kuldeep Paul’s 2026 practitioner guide — drawing on deployments at Cisco, Nvidia, and Ramp — corroborates the enterprise adoption trajectory and adds operational context that official documentation omits. Enterprise scale means interviewers now test governance, cost controls, and multi-team AGENTS.md ownership, not just CLI mechanics.
- The Rust rewrite changed what candidates must know about the architecture. The original Codex CLI was built in TypeScript. OpenAI rebuilt it in Rust. Shareuhack’s practitioner review frames the significance plainly: “built fresh in Rust — getting this wrong in an interview is an immediate credibility hit.” Claude Code, by contrast, remains in TypeScript. The architectural contrast — Rust + GPT-5.5 vs. TypeScript + Claude Sonnet 4.x — is a recurring senior interview probe.
- Security scrutiny has arrived via a named vulnerability class. The PromptPwnd pattern — untrusted user input injected into AI agent prompts in GitHub Actions pipelines — was documented by Aikido Security in 2025 and confirmed in at least six Fortune 500 companies. Codex Actions in CI/CD pipelines are an explicit attack surface. Senior security-adjacent roles now probe whether candidates understand where approval modes provide deterministic protection vs. where prompt injection bypasses them entirely.
The SERP for “Codex CLI interview questions” as of May 2026 contains zero interview-prep pages. Every ranked result is a how-to guide or best-practices post. This article is the first structured interview preparation resource for Codex CLI, organized by seniority level with answers grounded in primary sources and practitioner accounts.
Foundation Questions: What Codex CLI Is and How It Works
Foundation questions appear at the phone screen or first technical round. Their purpose is to filter candidates who cannot separate the 2025 CLI from the 2021 API — a mistake that signals the candidate has done keyword research but no hands-on preparation.
What is OpenAI Codex CLI and how is it different from the 2021 Codex API?
Concept: Product history and disambiguation | Difficulty: Junior | Stage: Phone screen
OpenAI Codex CLI is an agentic terminal coding tool launched by OpenAI on April 16, 2025. It reads files, edits code, executes shell commands, runs test suites, and opens pull requests — all within sandboxed environments, in parallel, without continuous human approval. It is powered by GPT-5.5 as the default model for complex coding tasks (GPT-5.4 and GPT-5.3-Codex are available for narrower workloads). The 2021 Codex API was an entirely different product: a stateless code-completion model that took a prompt and returned a code string. It had no file system access, no shell execution, no agentic loop. It is now deprecated. The two products share a name and nothing else architecturally.
What they’re really probing: The disambiguation itself. An interviewer who hears “the original Codex that OpenAI launched to complete code” without the word “deprecated” or the year “2021” knows the candidate is conflating the products. Shareuhack’s practitioner review puts it plainly: “Codex CLI (2025) is not the 2021 Codex API — built fresh in Rust. Getting this wrong in an interview is an immediate credibility hit.”
Common pitfall: Saying “Codex is OpenAI’s code model.” That describes the 2021 API. The 2025 CLI is a tool-calling agent — the model is GPT-5.5; the CLI is the agentic wrapper around it. These are different layers and the interviewer wants to know you understand both.
How do you install Codex CLI and verify the installation?
Concept: CLI setup and version awareness | Difficulty: Junior | Stage: Phone screen / technical screen
Installation uses npm: npm install -g @openai/codex. After installation, run codex --version to confirm the binary is in PATH and the version is current. Authentication uses an OpenAI API key or a ChatGPT account linked via OAuth. Tosea.ai’s 2025 guide documents the five-step sequence: global npm install, authenticate, verify with codex --version, navigate to a project directory, and issue a first task. The CLI runs in interactive mode by default (a TUI with composer, inline review, and Ctrl+R history search); --full-auto switches to headless/pipeline mode.
What they’re really probing: Installation awareness confirms hands-on use. The follow-up question — “what do you check when Codex produces output that contradicts your existing codebase conventions?” — immediately escalates to AGENTS.md, which is the practical answer.
Common pitfall: Confusing the npm package name @openai/codex with any other OpenAI CLI package. The 75K GitHub stars and 14.5 million monthly npm downloads (per Shareuhack) make this a live, actively maintained package — not a deprecated artifact.
Codex CLI Feature Reference (2026)
Interviewers at the mid level cycle through the core feature set: approval modes first (the security surface), then planning primitives (AGENTS.md, Plan mode, Reverse Interview), then session management (resume, compaction, fork). Knowing all three layers and the reason each exists as a distinct mechanism is the threshold test for mid-level roles.
| Feature | What It Does | Interview Signal |
|---|---|---|
| Approval modes | Three-level trust system: suggest (default, per-action approval), auto-edit (file edits without approval; shell still gated), full-auto (all gates bypassed). Different blast-radius profiles. | “What happens if you run –full-auto in a production environment without a sandbox?” |
| AGENTS.md | Project-level markdown file providing Codex with durable guidance: repo conventions, prohibited commands, test suite invocation, environment setup. Carries across CLI, IDE, and Codex app surfaces. | “How would you govern AGENTS.md across 10 teams in a monorepo?” |
| Plan mode | Produces a PLANS.md execution plan for human review before any action is taken. Recommended for fuzzy or high-risk tasks. | “When do you pay the planning overhead vs. running Codex directly?” |
| Reverse Interview | Developer asks Codex what information it needs before execution begins — elicits clarifying questions upfront rather than mid-task course corrections. | “Walk me through using all three planning patterns for a high-risk database migration.” |
| Session resume | codex resume restores a previous session via local transcript storage. --all flag for cross-directory resume. |
“What happens to a Codex session when it hits context window limits?” |
| Compaction | Manages context window limits for multi-hour sessions. If not configured, the session may lose context and produce incorrect results for the remainder of the task. | “You’re running a 4-hour refactor job with Codex — what do you configure before starting?” |
| Automations | Scheduled and webhook-triggered Codex workflows — Codex can run tasks on a defined schedule without manual invocation. | “How would you trigger a Codex task from a GitHub webhook?” |
| Skills (Skill.md) | Parameterized, reusable task definitions — separate from AGENTS.md project conventions. Invoked by name within a session. | “How is a Skill.md different from an entry in AGENTS.md?” |
Codex CLI vs. Claude Code: Architecture and Workflow Split
This comparison appears in senior interviews whenever the role involves AI tooling evaluation or multi-tool workflow design. The wrong answer is treating it as a product war. The correct answer is a workflow split grounded in architectural differences. While Codex is OpenAI’s tool, understanding Anthropic’s Claude Code approach can inform your evaluation — review Anthropic interview questions to grasp the safety-first design philosophy behind Claude Code.
| Dimension | Codex CLI | Claude Code |
|---|---|---|
| Implementation language | Rust (rewritten from TypeScript in 2025) | TypeScript (remains Anthropic’s flagship product) |
| Underlying model | GPT-5.5 (default); GPT-5.4, GPT-5.3-Codex available | Claude Sonnet 4.x (Anthropic model family) |
| Approval model | Three modes: suggest / auto-edit / full-auto (mode-level trust) | Hooks-based deterministic enforcement (PreToolUse / PostToolUse lifecycle events) |
| Task model | Parallel async sandboxed sessions; repo pre-loaded into sandbox; tasks run in isolation | Interactive single-session; tight feedback loop; exploratory by design |
| Enterprise path | AI gateway governance (Bifrost, Maxim AI); centralized MCP registry; AGENTS.md standards | Managed settings / plugins architecture; CLAUDE.md scoped hierarchy |
| Best for | Parallel async tasks; batch work; long-running autonomous sessions; CI/CD integration | Exploratory interactive sessions; tight feedback loops; complex multi-turn debugging |
Shareuhack’s practitioner review frames the split directly: “Codex for parallel async tasks; Claude Code for exploratory interactive sessions. Experienced engineers use both — the workflow split is a practical reality, not a product war.” The community debate around the Rust rewrite adds a footnote: r/programming commenter u/Florence-Equator noted “the real bottleneck of an AI coding assistant lies in the AI computation itself, not the frontend” — Claude Code remains in TypeScript because Anthropic treats it as a flagship product, not because of performance necessity.
Junior-Tier Questions (L3/L4): CLI Mechanics and Fundamentals
Junior questions test basic operational knowledge: do you know the approval modes, can you explain AGENTS.md, do you understand the prompt formula. These are the phone-screen and first-round technical questions at AI engineer and software engineer roles.
Walk me through the three approval modes in Codex CLI and when you’d use each one.
Concept: Trust model and blast-radius tradeoffs | Difficulty: Junior | Stage: Technical screen
Codex CLI has three approval modes defined in the official features documentation. Suggest (default) requires human approval before every action — every proposed file edit or shell command is presented for confirmation. Lowest blast radius; appropriate for unfamiliar codebases or sensitive environments. Auto-edit allows Codex to make file edits without per-action approval, but shell command execution still requires explicit confirmation. Useful for bulk refactors where you trust Codex’s edits but want a gate on side-effecting commands. Full-auto (--full-auto flag) bypasses all approval gates — Codex executes both file edits and shell commands without any human-in-the-loop gate. Recommended only in isolated sandbox environments. Used in headless/pipeline mode where human confirmation is structurally impossible. The risk: any misdirected session in full-auto mode can delete files, overwrite configuration, or run destructive shell commands with no gate to stop it.
What they’re really probing: Whether you understand the escalating risk at each level. The follow-up — “what happens if you run –full-auto in a production environment without a sandbox?” — is the senior-level postmortem scenario in disguise. Junior candidates who answer “full-auto is faster and more convenient” have not thought through the blast radius.
Common pitfall: Describing full-auto as simply “more autonomous.” The correct framing is: full-auto removes the safety net entirely. Without a sandbox, there is nothing between Codex’s actions and your production file system.
What is AGENTS.md and what should it contain?
Concept: Project configuration and durable agent guidance | Difficulty: Junior | Stage: Technical screen
AGENTS.md is a project-level markdown file that gives Codex durable guidance about the repository before any task begins. According to the OpenAI Best Practices documentation, it should contain: repo conventions (import ordering, naming patterns, file structure standards), prohibited file paths or commands that Codex must never touch, test suite invocation commands (so Codex can run tests autonomously), environment setup steps, and project-specific constraints. It carries across CLI, IDE extensions, and the Codex app surface — it is the single configuration artifact that follows the agent everywhere. Kuldeep Paul calls it “the single source of truth for repo conventions” (Practice 2): without it, Codex fills gaps with training-data defaults that may not match your codebase, producing technically correct but review-failing PRs.
What they’re really probing: Whether you treat AGENTS.md as infrastructure (created before the first Codex session) or as a reactive fix (written after Codex does something wrong). The production incident pattern from Kuldeep Paul is instructive: teams that skip AGENTS.md see Codex apply its own conventions — import ordering, test framework choices — that conflict with existing patterns. The correct answer positions AGENTS.md as infrastructure, not documentation.
Common pitfall: Conflating AGENTS.md with Skill.md. AGENTS.md is project-level permanent context about the repo. Skill.md defines parameterized, reusable tasks. They are different artifacts with different purposes — knowing the distinction signals hands-on experience.
Explain the Goal / Context / Constraints / Done-When prompt formula.
Concept: Official prompt structure for Codex tasks | Difficulty: Junior | Stage: Technical screen
The Goal / Context / Constraints / Done-When formula is OpenAI’s canonical four-part prompt structure for Codex tasks, documented in the Best Practices documentation. Goal: what Codex must achieve, stated as a specific outcome. Context: what Codex needs to know about the codebase, environment, or constraints that are not immediately visible from reading files. Constraints: what Codex must not do — prohibited approaches, files it should not touch, performance budgets, or style rules. Done-When: the explicit success criterion — usually a test suite result, a specific output, or a measurable state that Codex can verify without human interpretation. The Done-When clause transforms a vague instruction into a testable objective: instead of “refactor the payment module,” it becomes “Done-When: all 47 payment tests pass and no test file is modified.”
What they’re really probing: Whether you understand why the Done-When clause matters. Without an explicit success criterion, Codex must infer when it is finished — which in long sessions leads to over-engineering, unnecessary refactoring, or stopping too early. The Done-When clause gives the agent an objective exit condition.
Common pitfall: Treating the formula as optional polish rather than structural necessity. Interviewers at Cisco, Nvidia, and Ramp-level deployments (per Kuldeep Paul) expect this formula to be the default prompt pattern for any agentic Codex task, not something you apply only to complex cases.
What is the difference between Codex CLI and Claude Code? Which would you use for a 10-hour refactor task?
Concept: Architecture comparison and workflow selection | Difficulty: Junior/Mid | Stage: Technical screen
The core architectural differences: Codex CLI is built in Rust, powered by GPT-5.5, and runs tasks in isolated sandboxed environments — multiple tasks can run in parallel across separate sandboxes. Claude Code is built in TypeScript, powered by Claude Sonnet 4.x (Anthropic), and is designed for interactive exploratory sessions with a tight feedback loop. For a 10-hour refactor task, Codex CLI is the better fit: it is designed for long-running autonomous sessions (Kuldeep Paul cites typical task runtime of 1–30 minutes per discrete task; a 10-hour refactor would break into parallel sandboxed subtasks), the sandbox model isolates blast radius, and compaction support manages context window limits across extended sessions. Claude Code excels at exploratory debugging sessions where you are interactively iterating on a problem and need frequent back-and-forth with the agent. As Shareuhack puts it: “Codex for parallel async tasks; Claude Code for exploratory interactive sessions. Experienced engineers use both.”
What they’re really probing: Whether you understand the workflow split or treat the tools as interchangeable. The wrong answer is “I’d use whichever I’m more comfortable with.” The right answer is grounded in the architectural difference: Codex’s sandbox model and parallel task execution make it the correct choice for long-running batch work; Claude Code’s interactive loop makes it the correct choice for exploratory investigation.
Common pitfall: Not knowing that Codex CLI was rebuilt in Rust. The original was TypeScript. This architectural shift is frequently probed — it signals you are tracking the project, not just using the tool.
Mid-Tier Questions (L5): Planning Patterns and Session Management
Mid-level questions test whether you understand the three-pattern planning system as an integrated framework rather than isolated features. The threshold test at this level: can you explain when to pay the planning overhead, when to fork a session, and how to select a reasoning level for a given task?
When would you use Plan mode instead of running Codex directly?
Concept: Pre-execution review gate and risk-adjusted planning | Difficulty: Mid | Stage: Technical screen
Plan mode produces a PLANS.md artifact — a structured execution plan for human review — before Codex takes any action. Per the OpenAI Best Practices documentation and Kuldeep Paul’s Practice 3, use Plan mode for: fuzzy tasks where the correct approach is not obvious (and you want to evaluate Codex’s proposed approach before committing resources), high-blast-radius tasks where mistakes are expensive to reverse (database migrations, infrastructure changes, multi-file refactors), and multi-step tasks where an early error would invalidate all subsequent work. Do not use Plan mode for narrow, well-specified tasks where the approach is clear — the planning overhead is unnecessary. The artifact PLANS.md is a signal in itself: knowing this term and what it contains (the proposed execution plan, not a general notes file) distinguishes candidates who have used Plan mode from those who have only read about it.
What they’re really probing: Risk calibration. The interviewer wants to know whether you default to Plan mode for everything (wasteful overhead), never use it (dangerous for high-risk tasks), or apply it selectively based on task fuzzy-ness and blast radius. The correct answer is selective and grounded in specific criteria.
Common pitfall: Describing Plan mode as “the safe option.” It is a planning overhead decision, not a safety mechanism. Safety comes from the sandbox model and approval modes. Plan mode is about reviewing the strategy before execution, not preventing damage during execution.
What reasoning level would you choose for a task like “refactor this 5,000-line legacy module to use dependency injection”?
Concept: Reasoning level selection and compute budgeting | Difficulty: Mid | Stage: Technical screen
Codex has four reasoning levels: Low, Medium, High, and Extra High, per the OpenAI Best Practices documentation. Selection depends on task complexity and computation budget. For a 5,000-line legacy module refactor with dependency injection: High or Extra High. Reasoning: this is a multi-file refactor requiring Codex to understand cross-module dependencies, trace injection boundaries through the call graph, and maintain consistency across dozens of files. Low and Medium reasoning levels are appropriate for narrow, well-specified tasks — generating boilerplate, writing a single test, renaming a variable. The practical cost signal: a documented 5-hour Codex session using GPT-5.3-Codex at Extra High reasoning consumed 50% of a 5-hour usage limit. High-reasoning sessions require pre-planned compute budgets — not just task scoping but explicit token and time limits before the session starts.
What they’re really probing: Whether you understand that reasoning level selection is a tradeoff between quality and compute cost, not a “higher is always better” decision. The follow-up — “at what task complexity do you escalate from High to Extra High?” — expects a boundary condition: Extra High is warranted when the task requires multi-step reasoning where an error at step 3 cascades through steps 4–10 (legacy modernization, complex refactors). High is warranted when the task is complex but each step is relatively independent.
Common pitfall: Always defaulting to Extra High. This burns compute budget unnecessarily on tasks that Medium reasoning handles correctly. Enterprise deployments (per Kuldeep Paul) cap reasoning levels per task type to control cost — knowing this is appropriate at the mid level.
What is the “fork rather than fight” pattern and when do you apply it?
Concept: Session management and context recovery | Difficulty: Mid | Stage: Technical screen
“Fork rather than fight” is Practice 5 from Kuldeep Paul’s 2026 practitioner guide: when a Codex session has drifted significantly from the goal — producing work that is increasingly wrong or inconsistent — starting a new session with a refined, narrower prompt is more effective than attempting to correct the existing session through follow-up instructions. The rationale: once a session accumulates enough misdirected context, every correction must fight the existing (wrong) trajectory. A fresh session with a better-specified prompt is a lower-cost reset. Apply it when: Codex has made three or more consecutive corrections in the wrong direction, the session output has diverged significantly from the Done-When criterion, or the accumulated context is visibly confusing the agent about which version of a file or approach is current.
What they’re really probing: Session management judgment. Senior interviewers want to hear that you recognize drift early rather than fighting it for hours before starting over. The anti-pattern — continuing to fight a drifted session because you’ve already invested 45 minutes — is explicitly what this practice counters.
Common pitfall: Treating every session as recoverable through better prompting. Some sessions need to be forked. The signal that a session is unrecoverable: the context itself has become the problem, not the individual prompts. Compaction (managing context window limits) can help extend a session, but it cannot fix a session that has fundamentally drifted from the goal.
How do you use the Reverse Interview planning pattern, and where does it fit relative to Plan mode?
Concept: Pre-specification elicitation vs. pre-execution review | Difficulty: Mid | Stage: Technical screen
The three Codex planning patterns operate at different stages and serve different purposes. AGENTS.md is pre-session infrastructure: what the agent knows about the repo before any task begins. Reverse Interview is pre-specification elicitation: before issuing a task, you ask Codex what information it needs to complete the task successfully — eliciting clarifying questions upfront rather than discovering mid-task what context was missing. Plan mode is pre-execution review: Codex has enough information to propose an execution plan, which you review in PLANS.md before any action is taken. The correct sequencing for a high-risk task: (1) AGENTS.md provides repo context, (2) Reverse Interview surfaces what else Codex needs, (3) Plan mode produces a plan for human review, (4) execution proceeds with appropriate approval mode. As Kuldeep Paul notes, teams that use all three patterns see significantly fewer mid-task corrections and session forks.
What they’re really probing: Whether you see the three patterns as a system rather than three options to pick from. The integrated workflow — AGENTS.md → Reverse Interview → Plan mode → execution — is the staff-level answer. Mid-level candidates should understand each pattern individually and be able to articulate when to use them together.
Common pitfall: Using Plan mode as a substitute for AGENTS.md. PLANS.md is produced per-session and addresses a specific task. AGENTS.md is permanent project infrastructure. They are not interchangeable — AGENTS.md informs every session; PLANS.md documents a specific session’s proposed approach.
Senior-Tier Questions (L6+): Governance, Security, and Architecture
Senior questions test whether you can design Codex deployments at organization scale — governance controls, cost attribution, security surfaces, and cross-tool architecture decisions. The postmortem scenario is the defining senior question: it has no correct answer in any existing Codex guide, which is why it is used as a differentiator.
You’re rolling out Codex to a 200-person engineering organization. What governance controls do you put in place?
Concept: Enterprise AI tooling governance at scale | Difficulty: Senior | Stage: System design
The governance framework for enterprise Codex deployment, grounded in Kuldeep Paul’s deployment patterns at Cisco, Nvidia, and Ramp, covers six domains:
1. AI gateway (Practice 6): Before scaling Codex to any team, put an AI gateway (Kuldeep Paul names Bifrost and Maxim AI) in front of all Codex API calls. The gateway enforces rate limits, cost controls, audit trails, and policy enforcement at the infrastructure layer — not at the developer layer, where it will be inconsistently applied. This is the practice Kuldeep Paul identifies as “the one most teams skip and most teams regret.”
2. AGENTS.md governance: At 10+ teams in a monorepo, AGENTS.md requires ownership governance: define which team owns each section, how conflicts between team-specific and org-wide conventions are resolved, and what the update process is. Unowned AGENTS.md files drift and become stale — Codex applies outdated conventions that new team members have never seen.
3. Cost controls and chargeback: Per-user or per-team token budget caps enforced at the gateway layer. Usage anomaly alerts for sessions that consume disproportionate compute (e.g., multi-hour Extra High reasoning sessions that were not pre-approved). Audit logs for cost attribution to teams and cost centers.
4. Approval mode policy: Full-auto must only run in sandboxed environments. This is a policy, not a recommendation — enforce it at the gateway level or through CI/CD pipeline controls. Developer workstations should default to suggest or auto-edit. Pipeline jobs should run in pre-provisioned sandboxes.
5. PR review requirements (Practice 9): Treat Codex output like production code — it requires review, testing, and merge hygiene. Do not allow Codex-generated PRs to bypass standard review workflows because “the AI checked it.” The gateway audit trail provides the evidence that Codex ran; the PR review process provides the human gate before merge.
6. MCP centralized tool layer (Practice 8): Rather than letting individual developers configure their own MCP servers, a platform team maintains a shared MCP registry. This reduces configuration drift, ensures consistent tool access control, and provides a single audit point for all Codex tool calls.
What they’re really probing: Whether you lead with the gateway or with developer-level controls. The correct answer leads with the AI gateway — it is the infrastructure-layer control that makes all other controls consistent. Developer-level controls (telling people which approval mode to use) are inconsistently applied and do not provide audit trails.
Common pitfall: Treating governance as documentation (“we’ll write a policy doc”). Governance at 200 engineers requires infrastructure-layer enforcement — the gateway rate limits are not optional, the sandbox policy is not a suggestion, and the AGENTS.md ownership model is not a naming convention. Each control needs an enforcement mechanism.
How does MCP integration work in Codex, and why configure a centralized tool layer instead of per-developer MCP configs?
Concept: MCP architecture and enterprise tool governance | Difficulty: Senior | Stage: System design
Codex supports MCP (Model Context Protocol) integration — connecting to external toolchains (databases, APIs, internal services) via a standardized protocol layer, documented in the OpenAI Best Practices documentation. In a typical per-developer configuration, each developer maintains their own MCP server configurations — which tools they connect to, which credentials they use, which server versions they run. At team scale, this creates configuration drift: different developers have different tool versions, different permissions, and different connection patterns. Codex sessions produce inconsistent results because they have access to different toolsets.
Kuldeep Paul’s Practice 8 — “MCP through a centralized tool layer” — addresses this: a platform team maintains a shared MCP registry that all Codex sessions connect to. Benefits: consistent tool versions across all developers, centralized access control (who can call which tools), a single audit point for all MCP tool calls, and simplified onboarding (new developers inherit the full toolset without configuration). The centralized layer also enables the gateway pattern: the MCP registry sits behind the AI gateway, so all tool calls are subject to the same rate limits and audit logging as direct Codex API calls.
What they’re really probing: Whether you think about MCP as a developer-level convenience or as an infrastructure component with governance implications. At platform engineering roles at AI-adopting enterprises, MCP governance is increasingly a first-class concern alongside access control and cost management.
Common pitfall: Describing MCP as “a way to give Codex access to your tools” without addressing the access control question. The follow-up — “who decides which tools Codex can call, and how is that enforced?” — expects a centralized registry answer, not “it depends on the developer’s config.”
What are the prompt injection risks when you use Codex in a GitHub Actions CI/CD pipeline?
Concept: PromptPwnd vulnerability class and CI/CD security surface | Difficulty: Senior | Stage: Security-adjacent
The PromptPwnd vulnerability class — documented by Aikido Security in 2025 and confirmed in at least six Fortune 500 companies — applies directly to Codex Actions in CI/CD pipelines. The attack pattern: untrusted user-controlled text (GitHub issue bodies, PR descriptions, commit messages) is passed into Codex prompts within GitHub Actions workflows. Codex, treating this text as instructions, executes privileged tools it has access to — gh issue edit, shell commands, environment variable reads. Pattern: Untrusted input → AI prompt → privileged tool execution → secrets leaked or workflows modified. The Aikido Security research team named it “PromptPwnd” and documented that Google’s own Gemini CLI repo was affected and patched within four days of disclosure.
Mitigations (per the Aikido Security r/programming thread): (1) restrict what tools Codex Actions can call — do not give the agent access to privileged tools unless strictly necessary; (2) do not inject untrusted user-controlled text directly into Codex prompts; (3) treat all Codex output as untrusted — review before acting; (4) use GitHub token IP restrictions to reduce the blast radius if an injection succeeds.
What they’re really probing: Whether you understand that the approval mode system does not protect against prompt injection. Full-auto bypasses human gates, but even suggest mode can be exploited if the injected instruction causes Codex to request an action that looks reasonable on the surface. The defense is input sanitization and tool access restriction — not approval mode selection.
Common pitfall: Answering “we’d use suggest mode so a human reviews everything.” Suggest mode presents Codex’s proposed action — if the injected instruction has constructed a plausible-looking action, the human reviewer may approve it without recognizing the injection. The correct defense is architectural: sanitize inputs before they reach the prompt.
You ran codex --full-auto on your production environment and it deleted your database migration files. Walk me through the postmortem.
Concept: Full-auto blast radius, root cause analysis, and prevention architecture | Difficulty: Senior | Stage: System design postmortem
This scenario — from practitioner guidance by Kuldeep Paul and the OpenAI CLI Features documentation — is the flagship senior interview question for Codex CLI. There is no existing answer for it in the SERP. A structured postmortem response covers four phases:
Immediate triage: Run git status to assess current working tree state. Run git reflog to identify the commit before the damage occurred. Run git log --oneline -20 to understand what Codex committed before the deletion. If migration files were deleted and committed, git show HEAD -- path/to/migrations/ recovers the content. If deleted but not committed, git checkout HEAD -- path/to/migrations/ restores. The first 10 minutes are about stopping the bleeding and understanding the scope.
Root cause analysis: Full-auto bypassed all approval gates — no human saw the deletion before it executed. The root cause has two components: (a) no sandbox — full-auto ran directly against the production file system rather than an isolated environment; (b) no AGENTS.md prohibition — the migration directory was not listed as a prohibited path. Codex had no signal that deleting migration files was out of bounds. Contributing factors: no gateway logging (so there is no record of what Codex was asked to do), no Plan mode run before execution (so no human reviewed the proposed approach), and no pre-run test baseline (so Codex had no Done-When criterion that would have caught the deletion as a failure).
Prevention architecture: Four controls that would have prevented this incident: (1) Sandbox-only policy for full-auto — enforce at the CI/CD pipeline level or AI gateway layer; full-auto never runs against production file systems. (2) AGENTS.md explicit prohibitions — migration files, seed data, and infrastructure configuration should be listed as paths Codex must not touch without explicit instruction. (3) Gateway audit trail — every Codex task run is logged at the gateway with the prompt, the session duration, the tool calls made, and the files modified. The audit trail is how you reconstruct what happened; without it, you are guessing. (4) Plan mode for destructive operations — any task that touches database migrations, infrastructure code, or deployment configuration should produce a PLANS.md for human review before execution proceeds.
Organizational learning: Treat this as a missing governance control, not an operator error. The operator ran full-auto because the tooling permitted it on production. The fix is an infrastructure-layer policy, not a reminder to be more careful. Treat the gateway audit trail as the permanent record and the AGENTS.md prohibition list as the standing prevention mechanism.
What they’re really probing: Whether your postmortem leads to infrastructure changes or to individual blame. The correct answer changes the system — sandbox policy, AGENTS.md prohibitions, gateway logging, Plan mode gate. A weak answer blames the developer who ran full-auto.
Common pitfall: Proposing “better documentation” as a prevention measure. Documentation tells people what to do; infrastructure-layer controls (sandbox policy, gateway enforcement, AGENTS.md prohibitions) make it impossible to do the wrong thing. At senior level, the expectation is infrastructure-first remediation.
Full-Auto Postmortem: Root Cause to Prevention Matrix
This table synthesizes the postmortem scenario into a structured reference for interviewers and candidates. The five columns map to the standard blameless postmortem format used at AI-adopting enterprises.
| Layer | What Failed | Why It Matters | Prevention Control | Enforcement Point |
|---|---|---|---|---|
| Execution environment | Full-auto ran against production file system (no sandbox) | No isolation = unlimited blast radius | Sandbox-only policy for –full-auto | AI gateway or CI/CD pipeline config |
| Agent configuration | No AGENTS.md prohibition on migration paths | Codex had no signal that deletions were out of bounds | Add prohibited paths to AGENTS.md | AGENTS.md (pre-session infrastructure) |
| Planning | No Plan mode run; no human reviewed the proposed approach | Human review of PLANS.md would have caught the deletion proposal | Require Plan mode for destructive migrations | Team SOP or gateway policy |
| Observability | No gateway logging; no audit trail of what Codex was instructed to do | Without the audit trail, root cause analysis is guesswork | AI gateway with per-task logging | Gateway infrastructure (Bifrost, Maxim AI) |
| Success criterion | No Done-When criterion; Codex had no test baseline to validate against | Migration integrity would have failed a test — catching the error before commit | Test-first before agent runs (Practice 4) | Prompt design (Done-When clause) |
Red Flag Answers: What Interviewers Note Immediately
Interviewers at AI-native companies and enterprise engineering teams conducting Codex interviews in 2026 have identified recurring answer patterns that immediately signal underprepared candidates. These are not edge cases — they are the most common wrong answers at each level.
- Calling it “the Codex API.” The 2021 Codex API is deprecated. Using “API” to describe the 2025 CLI reveals the candidate has not worked with the current product. Every evaluator at a Codex-adopting company knows this distinction. Red flag: immediately credibility-damaging.
- “Full-auto is better because it’s faster.” Faster, yes. Safe without a sandbox, no. Any candidate who describes full-auto as a productivity feature without mentioning sandbox requirements and blast radius has not thought through the security model. Red flag: eliminated at mid-to-senior level.
- “AGENTS.md is like a README for Codex.” A README is documentation for humans. AGENTS.md is operational configuration for an autonomous agent — it carries across surfaces, it prohibits actions, it specifies test commands. Framing it as documentation signals surface-level familiarity. Red flag: weak mid-level answer.
- “I’d use suggest mode to prevent the prompt injection problem.” Suggest mode does not prevent prompt injection — it presents the injected instruction’s proposed action for human approval. The defense is input sanitization and tool access restriction, not approval mode selection. Red flag: reveals misunderstanding of the security model.
- Postmortem that concludes with “we’d train developers to be more careful.” A postmortem that ends at individual behavior rather than infrastructure-layer controls is not a production-grade postmortem. The correct answer changes the system. Red flag: insufficient for senior or staff level.
- “Codex and Claude Code do the same thing — I’d just pick one.” They have different architectures (Rust vs. TypeScript), different underlying models (GPT-5.5 vs. Claude Sonnet 4.x), different task models (parallel async sandboxed vs. interactive), and different enterprise governance paths. Treating them as interchangeable signals the candidate has not compared them operationally. Red flag: eliminated in workflow-design rounds.
Questions to Ask Your Interviewer
Asking informed questions at the end of a Codex CLI interview serves two purposes: it demonstrates you have thought beyond the basics, and it surfaces information you need to evaluate whether the role is the right fit. These are not generic “show engagement” questions — they are diagnostic probes calibrated to the Codex CLI hiring context in 2026.
- What approval mode is the team currently using by default, and is there a sandbox policy for full-auto runs? The answer tells you immediately how mature the team’s Codex governance is. “We use suggest mode and full-auto only in CI sandboxes” signals a mature deployment. “It depends on the developer” signals an early-stage rollout.
- Do you have an AGENTS.md in the repository? Who owns it, and how often does it change? This surfaces whether the team treats Codex configuration as infrastructure or as an afterthought.
- Is there a gateway in front of your Codex API calls? What tool are you using for cost controls and audit logging? At a 200-person org that has not answered this, you are being hired to build the answer.
- Are you using Codex alongside Claude Code for different task types, or have you standardized on one tool? Teams that have thought about the workflow split (Codex for parallel async, Claude Code for exploratory interactive) are further along than teams treating them as interchangeable.
Four-Week Prep Roadmap
This roadmap assumes you have access to a Codex CLI account and an OpenAI API key. It is calibrated for engineers targeting mid-to-senior roles at AI-native companies or enterprise engineering teams that have adopted Codex at scale.
Week 1 — Operational foundations. Read the CLI Features documentation and Best Practices documentation in full. Install Codex CLI on a personal project. Use suggest mode exclusively for the first week — review every proposed action before approving, building intuition for what Codex does by default vs. what it needs to be constrained via AGENTS.md. Write your first AGENTS.md for the project. Run one task using the Goal / Context / Constraints / Done-When formula. Note where the formula improved output quality compared to a free-form prompt.
Week 2 — Planning patterns and session management. Run a non-trivial task (a 500+ line refactor or a test-writing session) using Plan mode. Read the produced PLANS.md before approving execution — evaluate whether it is the plan you would have written. Try the Reverse Interview pattern on a fuzzy task: ask Codex what it needs before issuing the task. Document one case where “fork rather than fight” was the right call. Read Kuldeep Paul’s 2026 practitioner guide in full — the nine practices are the operational framework for enterprise Codex use.
Week 3 — Security and governance. Read the Aikido Security PromptPwnd thread — understand the prompt injection pattern, the attack surface (GitHub Actions), and the four mitigations. Build the full-auto postmortem scenario into memory: root cause (no sandbox + no AGENTS.md prohibitions), immediate triage (git reflog), contributing factors (no gateway logging, no Plan mode), prevention controls (sandbox policy, AGENTS.md prohibitions, gateway audit trail, Plan mode gate). For the Codex vs. Claude Code comparison, run the same task on both tools and compare the output, the session experience, and the time-to-completion.
Week 4 — Enterprise design and mock interviews. Answer all 13 questions in this guide out loud, without notes, targeting 2–3 minutes per answer. Focus on the governance question (Practice 6 gateway pattern, AGENTS.md ownership model, cost controls, approval mode policy) and the postmortem scenario. Do a timed mock with the postmortem question — the goal is a structured four-phase answer (triage → root cause → prevention → organizational learning) delivered in 4–5 minutes. Review the red flag answers section and confirm you cannot catch yourself giving any of the six flagged answers.
What Separates Hired Candidates from Strong Rejects
Codex CLI interviews in 2026 are first-mover territory: interviewers are often calibrating the bar in real time because no established interview canon exists yet. What consistently separates hired candidates is not memorization depth — it is the ability to reason from principles. Every strong answer in this guide follows the same pattern: name the mechanism, explain why it exists (the design decision behind it), and connect it to a consequence that matters at scale (blast radius, cost, team consistency, audit trail). The candidate who says “full-auto bypasses all gates, which is why you must have a sandbox policy enforced at the infrastructure layer before you give anyone full-auto access” is demonstrating reasoning, not recall. That reasoning pattern — mechanism → design intent → scale consequence — is the throughline of a successful Codex CLI interview.