OpenTelemetry interview questions in 2026 sit on top of a different stack than they did two years ago. Datadog shipped native OTLP ingest in late 2023, and through 2024-2025 Honeycomb, Grafana Cloud, New Relic, and Splunk Observability Cloud all followed (Datadog OpenTelemetry docs).
The OpenTelemetry Logs SDK went stable in 2024, and the continuous-profiling signal was accepted as a first-class fourth signal the same year (OTel Logs spec). CNCF now lists OTel as the second-most-active project after Kubernetes (CNCF Annual Report 2024).
Senior interviewers no longer expect a textbook definition. They probe operational corners: tail-sampling decision windows, Collector processor ordering, eBPF auto-instrumentation versus SDK, and OpAMP fleet management. This guide groups the questions into four substantive categories with an italic concept-difficulty-stage tag line on every question so you can scan to your interview’s level.
- What is OpenTelemetry, in one sentence?
- Are there three signals or four? Where does profiling fit?
- What’s the difference between OTLP and a backend like Datadog or Grafana?
- What’s the difference between a trace and a span?
- Why do you need both resource attributes and span attributes?
- What does the OpenTelemetry Collector do?
- Agent mode vs gateway mode — which would you pick?
- Memory_limiter before batch — why does processor order matter?
- What’s the difference between the core and contrib Collector distributions?
- How do you manage 200 Collectors? Walk me through OpAMP.
- Head-based vs tail-based sampling — when would you pick each?
- How does the tail_sampling processor’s decision_wait actually work?
- How would you debug a metrics cardinality explosion?
- Auto-instrumentation vs manual — what are the trade-offs?
- When would you reach for eBPF auto-instrumentation (Beyla, Pixie)?
- How do you propagate trace context across a Kafka producer/consumer?
- A trace is missing spans from one service. How do you debug it?

How OpenTelemetry interviews shifted in 2024-2026
Three concrete shifts changed what gets asked. Logs went stable in 2024, so interviewers no longer treat them as experimental; questions about the Logs Bridge API and routing existing log libraries through OTel now appear in technical screens for SRE and platform-engineer roles (OTel logs specification). The profiles signal was accepted as a first-class signal in 2024 after Grafana, Elastic, and Splunk donated their profiling work — Go has the most mature SDK as of 2026, with other languages following (OTel profiling announcement).
The second shift is operational. The Collector graduated to stable 1.0 in September 2023, and through 2024-2025 the operational pattern of “run a Collector fleet, not vendor agents” became the default for any organization above a few dozen services. OpAMP — the Open Agent Management Protocol emerged as the spec for managing those fleets remotely; senior infrastructure interviews now reliably probe whether candidates have used or evaluated it.
The third shift is eBPF auto-instrumentation. Grafana’s Beyla, the New Relic-owned Pixie, and Coroot all matured through 2024-2025 to the point where you can get RED metrics and partial traces with zero code changes (Beyla on GitHub). Interviewers ask about this as a complement-versus-replacement question: candidates who say “eBPF means we don’t need SDKs” are flagged junior, because business-level attributes still require SDK instrumentation.
Together those three shifts mean the modern OpenTelemetry interview rewards depth on the Collector, sampling, and fleet management — and treats the bare definition of the project as a recruiter-screen question, not a technical-screen question.
Foundation questions: OpenTelemetry fundamentals
What is OpenTelemetry, in one sentence?
Concept: project definition | Difficulty: junior | Stage: recruiter screen
Direct answer: OpenTelemetry is a CNCF-incubating project that provides vendor-neutral SDKs, APIs, and the OTLP wire protocol for emitting traces, metrics, logs, and profiles. You instrument your code once and route the data to whichever observability backend you pick. The project formed from the 2019 merger of OpenTracing and OpenCensus, has been incubating in the CNCF since 2021, and is now the second-most-active project after Kubernetes by contributor count and commit volume (CNCF Annual Report 2024). The Collector — the routing layer that lives between SDKs and backends — graduated to stable 1.0 in September 2023, which is the milestone that moved OTel from “experimental adoption” to default observability infrastructure across most engineering organizations.
What they’re really probing: whether you can frame OTel as instrumentation plus transport, separate from any backend. Junior candidates conflate OTel with a vendor and lose credibility immediately.
The phrase to avoid is “OpenTelemetry is a tracing library.” It is not — it has been a four-signal project since the profiles signal was accepted in 2024. Phrase to use instead: “OpenTelemetry is the vendor-neutral instrumentation layer for the four observability signals.”
Are there three signals or four? Where does profiling fit?
Concept: signals + stability | Difficulty: mid | Stage: technical phone screen
Direct answer: As of 2026, OpenTelemetry covers four signals: traces, metrics, logs, and profiles. Stability is staggered — Traces hit 1.0 in 2021, Metrics in early 2023, Logs SDK declared stable in 2024 (the Logs Bridge API for routing existing log libraries is now production-recommended), and the continuous-profiling signal was accepted as a first-class signal in 2024 after Grafana, Elastic, and Splunk donated their profiling work (OTel profiling announcement). The Profiles SDK maturity varies by language as of 2026, with Go furthest along and the JVM following. The practical implication of the four-signal model: all four share resource attributes, so you can correlate a slow span to the matching profile sample and the matching log line from the same service instance.
What they’re really probing: whether you know the recent timeline. Candidates still calling logs “experimental” signal that they last touched OTel before 2024.
Pair the answer with the practical implication: because all four signals share resource attributes, you can correlate a slow span (trace) to the corresponding profile sample and the corresponding log line, all from the same service-instance fingerprint. That correlation across signals is the real value the unified project provides.
What’s the difference between OTLP and a backend like Datadog or Grafana?
Concept: protocol vs backend | Difficulty: mid | Stage: technical phone screen
Direct answer: OTLP (OpenTelemetry Protocol) is the vendor-neutral wire protocol — gRPC on port 4317, HTTP/protobuf on port 4318 — for shipping telemetry from an SDK or Collector to a receiver. A backend is the storage, query, and visualization layer that ingests OTLP and lets engineers search and dashboard the data. Datadog, Honeycomb, Grafana Cloud, New Relic, Splunk Observability Cloud, AWS X-Ray (via ADOT), Azure Monitor, and Google Cloud all accept OTLP natively as of 2025 (Datadog OTel docs). The split matters because switching backends becomes a YAML edit on the Collector’s exporters section — not a re-deploy of every service in your fleet, which is what the pre-OTel world required when you swapped a vendor agent.
What they’re really probing: the vendor-lock-in story. The whole reason OTel won the observability space is that engineers stopped accepting “re-instrument every service when we switch vendors.”
If the interviewer pushes further, name the practical seam: switching backends is a YAML edit on the Collector’s exporters section, not a re-deploy of every service. This single fact is the business case for OpenTelemetry adoption in most organizations.
What’s the difference between a trace and a span?
Concept: data model | Difficulty: junior | Stage: technical phone screen
Direct answer: A trace is a directed acyclic graph of spans that represents one end-to-end request across services. A span is a single unit of work — a function call, an HTTP request, a database query — with a start time, end time, attributes, and a parent span ID. All spans in one trace share the same 16-byte trace ID; each span has its own 8-byte span ID and points at its parent. The root span has no parent and represents the entry point of the request. Beyond the basic parent-child relationship, the spec defines span links (relationships between spans in different traces, used for async jobs triggered by a request) and span events (named timestamped points within a span, like cache hits or queue waits). Naming these unprompted signals real OpenTelemetry experience.
What they’re really probing: whether you flip trace ID and span ID. It’s one of the most common junior tells.
Practitioners on r/devops and r/sre frequently note that candidates who can also explain span links (relationships between spans in different traces, e.g., async job triggered by a request) and span events (named timestamped points within a span) signal experience beyond the textbook (r/devops OpenTelemetry threads).
Why do you need both resource attributes and span attributes?
Concept: semantic conventions | Difficulty: mid | Stage: technical phone screen
Direct answer: Resource attributes describe the emitter — they apply to every signal a service instance produces. The canonical ones are service.name, service.version, service.instance.id, and deployment.environment. Span attributes describe the specific operation — http.request.method, db.system, messaging.destination.name. The split exists so dashboards can group by emitter without duplicating per-span data. Putting service.name on every span instead of as a resource attribute would multiply your storage cost and break the standard query patterns (OpenTelemetry semantic conventions). Naming attributes consistently with the spec is what lets a query like http.response.status_code >= 500 work uniformly across the entire fleet, regardless of which language or framework emitted the span.
What they’re really probing: whether you understand the cardinality consequences of mis-locating attributes.
The semantic conventions spec is the authoritative naming guide: HTTP semconv stabilized in 2023, database semconv in 2024, messaging is still in development as of 2026. Naming attributes consistently with the spec is what lets a query like http.response.status_code >= 500 work across every service in the fleet.
Collector architecture and deployment questions
What does the OpenTelemetry Collector do?
Concept: Collector basics | Difficulty: mid | Stage: technical phone screen
Direct answer: The OpenTelemetry Collector is a vendor-agnostic agent and gateway that receives, processes, and exports telemetry. Its configuration is a YAML file declaring pipelines (traces, metrics, logs — profiles is coming) each shaped as receivers → processors → exporters. The Collector reached stable 1.0 in September 2023, which moved it from “experimental adoption” to default observability infrastructure across most engineering organizations (OTel Collector docs). Common components you should be able to name on demand:
- Receivers:
otlp,prometheus(scraping),filelog,hostmetrics, plus Kubernetes-specifickubeletstatsandk8s_cluster. - Processors:
memory_limiter,batch,attributes,resource,transform,filter,tail_sampling. - Exporters:
otlp(forward),prometheusremotewrite,loki,datadog,awsxray,googlecloud,kafka,file.
What they’re really probing: whether you can describe the pipeline shape without reaching for a diagram.
The Collector graduated to stable 1.0 in September 2023, which is the date senior interviewers expect you to know if the topic comes up. Before that release, production deployments often hesitated; after, “deploy a Collector” became the default first answer to any observability scaling question.
Agent mode vs gateway mode — which would you pick?
Concept: deployment topology | Difficulty: senior | Stage: system design
Direct answer: Run the Collector in agent mode when the SDK is on the same host or pod — usually as a Kubernetes DaemonSet or sidecar — so SDKs ship to localhost over OTLP and the agent handles batching, retries, and a first level of enrichment. Run it in gateway mode when you need centralized policy: tail-based sampling (the gateway needs every span of a trace), fan-out to multiple backends, cross-region routing, or cost-control aggregation. The production pattern most teams converge on is both — node-local agents absorb SDK retries and shape batches, while a gateway cluster makes cross-trace decisions and routes to backends. The pattern matters because tail-based sampling categorically requires a gateway: you cannot decide on a trace’s fate at the agent because each agent sees only its host’s spans, not the full trace.
What they’re really probing: whether you understand why tail-based sampling forces the gateway pattern.
The trade-off candidates often miss: agent-only is cheap but loses you tail-sampling and fan-out; gateway-only adds a cross-host hop on the hot path. The hybrid pattern wins because the local agent absorbs SDK retries while the gateway makes the cross-trace decisions. Discuss this with the interviewer in terms of where the policy lives, not just where the boxes sit.
Memory_limiter before batch — why does processor order matter?
Concept: Collector internals | Difficulty: senior | Stage: technical phone screen
Direct answer: Processors run in declared order, so put memory_limiter first so it can refuse incoming data when the Collector is near its memory ceiling, then batch, then any enrichment processors like attributes, resource, or transform. If batch runs before memory_limiter, the batcher will accept and buffer data the limiter would have rejected — pushing the Collector into OOM under burst load. Reversed order is one of the most cited Collector configuration mistakes on r/devops Collector threads, and it usually surfaces as an apparently-random Collector crash under traffic spikes that the team can’t reproduce in staging. The tail_sampling processor, when present, sits late in the pipeline — after enrichment, before the exporter — because it needs the full trace context including any attributes added upstream.
What they’re really probing: whether you’ve operated a Collector under sustained burst load and learned this lesson the hard way.
If the interviewer pushes for more, mention that tail_sampling, when used, sits late in the pipeline — after enrichment, before the exporter — because the sampler needs the full trace context including any attributes added by upstream processors. The processor order is documented in the Collector’s configuration reference.
What’s the difference between the core and contrib Collector distributions?
Concept: distributions | Difficulty: mid | Stage: technical phone screen
Direct answer: The upstream core distribution (opentelemetry-collector) ships a deliberately narrow set of components — primarily OTLP receivers and exporters plus a handful of universally useful processors. The contrib distribution (opentelemetry-collector-contrib) bundles the long tail of community components: vendor-specific exporters like datadog and awsxray, specialized receivers like kafkametrics, and processors like tail_sampling. Most production deployments use contrib because they need at least one component that doesn’t live in core. Vendor distributions — Splunk OTel Collector, AWS Distro for OpenTelemetry (ADOT) — extend contrib with their own components and ship pre-built images with the vendor’s preferred defaults already configured.
What they’re really probing: whether you’ve actually looked at the GitHub repo layout or only used pre-built images.
The practical implication: a query about which exporter to use often resolves to “check whether it’s in contrib first.” If it isn’t, the path is usually OTLP to a vendor endpoint, which has been the de facto answer since the 2024-2025 OTLP-native push at major backends.
How do you manage 200 Collectors? Walk me through OpAMP.
Concept: fleet management | Difficulty: senior | Stage: system design
Direct answer: OpAMP (Open Agent Management Protocol) is the OTel spec for remotely managing Collector fleets: pushing configuration updates, rotating certificates, upgrading packages, and reporting health. The protocol runs over WebSocket or HTTP; clients (Collectors) connect to an OpAMP server and receive instructions. Production OpAMP servers in 2026 include Splunk’s implementation, observIQ’s BindPlane OP, Honeycomb (limited), and Grafana’s in-progress server (OpAMP specification). The pre-OpAMP fallback, and still the common pattern in 2026, is GitOps: Collector configs in a Git repo, applied via Argo CD or Flux. OpAMP wins for dynamic updates without redeploy and for heterogeneous fleets (mixed VMs and Kubernetes) where GitOps coverage is uneven.
What they’re really probing: whether you’ve thought past per-Collector config management. Anyone running more than a dozen Collectors has felt this pain.
The fallback before OpAMP, and still common in 2026, is GitOps: Collector configs in a Git repo, applied via Argo CD or Flux. OpAMP wins for dynamic updates without redeploy and for heterogeneous environments where GitOps coverage is uneven. The mention of OpAMP in an interview is itself an experience signal — most candidates won’t reach for it.
Sampling and cost-control questions
Head-based vs tail-based sampling — when would you pick each?
Concept: sampling strategy | Difficulty: senior | Stage: system design
Direct answer: Head-based sampling decides at the SDK when the root span starts — typically a deterministic hash on trace ID so all spans of a trace share the decision. It’s cheap and adds no latency, but you commit to keeping or dropping a trace before knowing whether it errored or ran slow. Tail-based sampling decides at the Collector gateway after the full trace has been assembled — you can keep every error trace, every latency-tail trace, and a percentage of healthy traces. It’s expensive (requires the Collector to buffer spans) and adds a decision-wait window, but it’s the only way to guarantee error retention without keeping 100% of traces. The production pattern most senior teams converge on is hybrid: head-based at low rates (1-10%) for cheap baseline coverage, plus tail-based at the gateway with rules like “keep all errors, keep all p99-latency traces, keep 1% of healthy traffic.”
What they’re really probing: whether you understand the buffer-and-decide cost.
Production pattern: head-based at low rates (1-10%) plus tail-based at the gateway with rules “keep all errors, keep all p99-latency traces, keep 1% of healthy traces.” This is the pattern Honeycomb’s engineering blog has documented across multiple customer write-ups, and it’s the pattern senior SRE interviewers expect you to describe.
How does the tail_sampling processor’s decision_wait actually work?
Concept: tail_sampling internals | Difficulty: senior | Stage: technical deep-dive
Direct answer: The Collector’s tail_sampling processor buffers all spans of a trace in memory, keyed by trace ID, and runs its decision policies when one of two things happens: the decision_wait timer fires (default 30 seconds after the first span arrives), or the buffer hits a configured span limit. Once the decision is made, all buffered spans of that trace are either exported or dropped, and the buffer entry is freed. The 30-second default exists because most traces complete well within it; longer traces (batch jobs, long-running queries) need a higher decision_wait or they get dropped as incomplete. The memory cost of this buffering is the operational risk: at high RPS, a thirty-second window can hold tens of millions of spans in RAM, which is why tail-sampling deployments need careful num_traces ceiling tuning.
What they’re really probing: whether you’ve tuned decision_wait in production. The default isn’t always right.
The practical failure mode: if your traces span a batch job that takes 60 seconds, a 30-second decision_wait will flush the trace before the batch’s spans arrive — the trace will look broken or truncated. Either raise decision_wait or split the workflow into two traces with a span link. The processor’s documentation in the contrib repo is the source of truth.
How would you debug a metrics cardinality explosion?
Concept: cardinality | Difficulty: senior | Stage: incident-response
Direct answer: Start at the backend’s cardinality dashboard — Datadog, Grafana Mimir, and Honeycomb all expose top-N high-cardinality dimensions. Identify which metric and which label exploded. Common causes: a label that holds an unbounded value (user ID, request path with IDs, error message), or a recently-added label on a hot metric. Fix at the source by removing the label from the SDK, or use the Collector’s attributes processor with the delete action to strip it before export. Alternatively, use the filter processor to drop metric streams matching a label pattern. The Collector-side fix is the senior answer because one platform team can apply a guardrail without coordinating with every service owner — and you can roll it back the moment you confirm the spike has stopped.
What they’re really probing: whether you reach for the Collector as a guardrail rather than chasing every team for a code change.
The cost framing is important: most observability spend is metrics cardinality, not trace volume. A single label change can multiply your monthly bill 10x within a week. The Collector’s filter and attributes processors are your first line of defense, and this is a frequent topic on the r/sre OpenTelemetry threads.
Instrumentation and context-propagation questions
Auto-instrumentation vs manual — what are the trade-offs?
Concept: instrumentation strategy | Difficulty: mid | Stage: technical phone screen
Direct answer: Auto-instrumentation attaches via the language’s instrumentation hooks — bytecode rewriting in Java (the opentelemetry-javaagent JAR), monkeypatching in Python, ESM hooks in Node, or eBPF for system-level coverage. It gives you HTTP, gRPC, and database spans for free without code changes. Manual instrumentation requires SDK calls in your code — tracer.start_as_current_span() in Python, tracer.Start() in Go — and lets you attach business-level attributes like tenant_id, plan, or feature_flag that auto-instrumentation can’t infer. The production pattern is both: turn on auto-instrumentation for baseline coverage across every service immediately, then add manual spans and attributes for the handful of business operations that drive your most valuable queries. Auto-only leaves your best queries impossible to write; manual-only is too slow to roll out at fleet scale.
What they’re really probing: whether you’ll default to one or the other in a rollout plan.
The correct production pattern is both: turn on auto-instrumentation to get baseline HTTP and DB spans across every service immediately, then add manual spans and attributes for the handful of business operations that matter to your queries. Going manual-only is too slow; going auto-only leaves your most valuable queries impossible to write.
When would you reach for eBPF auto-instrumentation (Beyla, Pixie)?
Concept: eBPF coverage | Difficulty: senior | Stage: system design
Direct answer: Reach for eBPF auto-instrumentation — Grafana’s Beyla, New Relic’s Pixie, Coroot, or the upstream OpenTelemetry eBPF Collector — when you need RED metrics and partial traces across services you can’t or won’t modify: third-party binaries, legacy services without an active maintainer, or polyglot fleets where instrumenting every language is uneconomical. eBPF attaches at the kernel/syscall layer, so it sees HTTP, gRPC, and DB calls regardless of the language above it (Beyla GitHub repo). Beyla was donated to the OpenTelemetry project in 2024, which consolidated the upstream eBPF instrumentation story under the CNCF umbrella. The ceiling is important: business-level attributes like tenant_id or feature_flag still require SDK code — eBPF gives you the network-level picture but cannot infer application-level context.
What they’re really probing: whether you understand eBPF’s ceiling.
The candidate red flag: saying eBPF means you can skip SDK instrumentation. It can’t — business-level attributes still need code. Beyla and Pixie give you the network-level picture for free; the application-level depth still requires SDK spans with custom attributes.
How do you propagate trace context across a Kafka producer/consumer?
Concept: context propagation | Difficulty: senior | Stage: system design
Direct answer: Use the OpenTelemetry Kafka instrumentation, which injects the W3C traceparent and tracestate headers into the Kafka message headers on the producer side and extracts them on the consumer side. The consumer’s span links to or continues the producer’s trace using the extracted span context. The propagator is configured globally via the SDK’s propagator registry — typically the default tracecontext,baggage composite propagator works without code changes. The propagators API spec defines the interfaces. The general principle to generalize to any async boundary: any time data crosses a process boundary, the trace context must travel with it. For Kafka, SQS, and HTTP the auto-instrumentation handles injection; for custom async paths (a row another service polls, a file on shared storage), inject the trace context manually as a column or field.
What they’re really probing: whether you can extend the pattern to any async boundary — SQS, scheduled jobs, webhooks.
The generalization is what senior interviewers want: any time data crosses a process boundary, the trace context has to travel with it. For Kafka and HTTP, auto-instrumentation handles header injection; for custom async boundaries you inject the trace context manually as a trace_context column or field. The principle generalizes beyond any single transport.
A trace is missing spans from one service. How do you debug it?
Concept: debugging | Difficulty: senior | Stage: incident-response
Direct answer: Walk the SDK-to-backend pipeline in five steps, in order. Naming this pipeline walk unprompted is itself the senior signal — junior candidates default to “check the logs,” which doesn’t isolate which stage is failing:
- Is the SDK starting spans? Add a
span_processorthat logs span starts, or query the service’s local OTLP endpoint. - Is the SDK exporting? Look for batch-processor logs and exporter errors in the service.
- Is the Collector receiving? The Collector’s own metrics — exposed via the
telemetryconfig block — show received spans per pipeline. - Is the Collector dropping? The
memory_limiterortail_samplingprocessor may be discarding spans; check their refused-span counters. - Is the exporter delivering? Exporter retry counts and the backend’s own ingest dashboard close the loop.
What they’re really probing: whether you have a systematic mental model of the SDK-to-backend pipeline, not just “check the logs.”
A frequent answer to this question on r/devops threads is “propagation broke at a gRPC boundary because the team had disabled auto-instrumentation to avoid a dependency conflict” — exactly the kind of root cause that emerges from walking the pipeline. The Collector’s otlp/debug exporter, which writes received spans to stdout, is invaluable in step 3 — turn it on and you see exactly what the Collector thinks it has.
Questions to ask the interviewer
Asking these signals seniority and surfaces information you genuinely need before accepting an offer:
- What signals are you collecting today — traces, metrics, logs, profiles — and which are the gaps you’d want me to close?
- What backend do you ship to, and how is the Collector deployed — agent-only, gateway-only, or hybrid?
- Do you do tail-based sampling? If so, where does the decision policy live, and who owns tuning it?
- How do you manage Collector configuration across the fleet — manual YAML, GitOps, OpAMP, or something else?
- What’s your monthly observability spend, and what’s the biggest cost driver — traces, metrics cardinality, or logs?
- What’s the on-call story when the observability stack itself goes down — do you have a backup ingest path?
- What part of the OTel spec are you actively waiting on — profiles GA across more languages, messaging semconv stabilization, something else?
- How does a new service get onboarded into the OTel pipeline today, and how long does it typically take?
OpenTelemetry interview prep: A 7-day sequence
Cram-paths fail because OTel rewards operational depth. A focused week beats a panicked weekend:
- Day 1 — Re-read the spec landing pages. Walk through opentelemetry.io’s Signals page for each of the four signals. Note the stability status. Note the date each one stabilized.
- Day 2 — Run a Collector locally. Pull the contrib image, point a sample SDK at it, write a minimal YAML pipeline with
otlpreceiver →memory_limiter→batch→otlp/debugexporter. Watch the spans print. - Day 3 — Add tail sampling. Modify the YAML to include the
tail_samplingprocessor with one keep-all-errors policy and one probabilistic policy. Trigger errors in the sample SDK, confirm they pass through. - Day 4 — Instrument a real service. Pick one of your own services, add the SDK manually plus auto-instrumentation. Add three business-level attributes (user_id, plan, feature_flag). Watch them show up in span attributes.
- Day 5 — Read the practitioner threads. Skim recent r/devops and r/sre OpenTelemetry threads. Note recurring complaints — those are the questions interviewers will probe.
- Day 6 — Map your war story. Pick one observability incident from your past — a debugging session, a cost spike, a rollout — and rehearse telling it in STAR shape with OTel-specific detail (which signal, which Collector component, what processor change).
- Day 7 — Mock the loop. Run through the foundation, Collector, sampling, and instrumentation question blocks above out loud. Listen for the moment you reach for a hand-wave — that’s where you re-read.
Two specific traps to watch on interview day: do not call OpenTelemetry “a tracing library” (it has been a four-signal project since 2024), and do not flip trace ID and span ID (a trace contains many spans; spans share a trace ID). Either tell flags you as junior in seconds.