Bioinformatics Interview Questions: Junior, Mid, and Senior Answers Anchored to Named Scientific Failures and 2024-2026 Research
Bioinformatics interviews in 2026 test three things simultaneously: your command of tools and file formats (SAM/BAM, GATK, DESeq2, AlphaFold), your awareness of documented failures in the field (Potti/Duke fraud, Excel gene-name auto-conversion, CRISPR off-target retraction), and your judgment about when to apply statistical rigor under reproducibility pressure. This guide organizes questions by career tier — junior, mid, and senior — and anchors every key answer to the named papers and postmortems that senior interviewers cite.
- What’s the difference between a bioinformatician and a computational biologist?
- Walk me through how BLAST works and when you’d use it vs ClustalW vs HMMER.
- Walk me through FASTQ → SAM → BAM → VCF — what changes at each step?
- What’s a Phred quality score, and why does Q30 matter?
- You ran BLAST with default settings and got too many hits. What knobs do you turn?
- What’s multiple-testing correction, and when would you use Bonferroni vs Benjamini-Hochberg?
- How do you decide between NCBI, Ensembl, and UCSC for a given task?
- How would you evaluate the quality of a FASTQ file before alignment?
- Walk me through GATK best practices for germline variant calling.
- You have RNA-seq data with 6 samples per group. Which differential-expression tool do you choose and why?
- What does the CIGAR string
8M1I4M1D3Mtell you about an alignment? - BWA-MEM vs Bowtie2 vs minimap2 — when do you reach for each?
- What are batch effects, and how would you detect and correct them in an RNA-seq dataset?
- How would you make your variant-calling pipeline reproducible across labs?
- Walk me through the AlphaFold 2 architecture. How does AlphaFold 3 differ?
- Walk me through a complete scRNA-seq workflow including the choices that matter.
- Why would a senior interviewer immediately reject a candidate who exports gene lists to Excel?
- Walk me through what went wrong with the Anil Potti / Duke chemotherapy genomics work.
- How would you design a multi-omics integration analysis combining bulk RNA-seq, single-cell RNA-seq, and ATAC-seq?
- You’re asked to apply transformer models to genomics. What baselines do you establish first?
Bioinformatics Hiring in 2026: What Actually Changed
This guide is for biology, computer science, or statistics graduates targeting bioinformatics scientist, bioinformatics engineer, or computational biology roles at biotech (Genentech, Regeneron, Illumina, Moderna, 23andMe), pharma R&D (Pfizer, Novartis, Roche), academic labs (Broad Institute, Wellcome Sanger, EMBL-EBI, MSKCC), or genomics core facilities. It is not written for general “computational biology” MOOCs or candidates who have not touched sequencing data.
Before diving into questions, one distinction matters at the resume-screening stage:
- Bioinformatics = engineering tools, pipelines, and data infrastructure for biological data. Writing the BWA wrapper, building the variant-calling pipeline, maintaining cluster schedulers. Often hired into bioinformatics-engineer or bioinformatics-scientist roles at biotech and pharma. CS and statistics backgrounds are common.
- Computational biology = using computational methods to answer biological questions — designing the experiment, formulating the model, interpreting results in biological context. Often hired into research-scientist or PI-track roles at academic labs. Biology PhDs with computational training dominate.
Most postings labeled “bioinformatics” sit between these poles. Read the JD carefully to see which side it leans before tailoring your answers.
2024-2026 hiring shifts worth knowing before your first screen
- Long-read sequencing is now standard. PacBio HiFi and Oxford Nanopore have moved from niche to expected in many pipelines. The 2022 T2T-CHM13 telomere-to-telomere assembly and the Human Pangenome Reference Consortium (2023) reshape variant-calling benchmarks — candidates who mention only GRCh38 may be flagged as behind.
- AlphaFold 3 and foundation models. Abramson et al., Nature 630, 493–500 (2024); doi: 10.1038/s41586-024-07487-w extended AlphaFold to protein-DNA, protein-RNA, and protein-ligand complexes. AlphaMissense (variant pathogenicity prediction), Evo (DNA foundation model), and Geneformer (single-cell transformers) are now cited in senior screens at pharma and biotech.
- Multi-omics integration. Mofa+, Seurat v5 multimodal, and scvi-tools are standard at top labs. Candidates who know only bulk RNA-seq will be asked why they haven’t touched single-cell or ATAC-seq.
- Reproducibility expectations. Snakemake or Nextflow plus Conda or Singularity or Docker is now the baseline for senior candidates, not a bonus. “I have a shell script” is a red flag.
- NIH funding pressure. 2024-2025 budget cuts have squeezed genomics core facilities; pharma and biotech are absorbing more computational talent, shifting the negotiating balance.
Salary signal (industry-reported aggregator data, hedged): mid-to-senior bioinformatics scientists cluster in the $110K–$190K range in US markets; PhD-level computational specialization commands a meaningful premium at pharma and large biotech.
What Bioinformatics Interviews Actually Test in 2026
Interviewers at Broad Institute and similar institutions cycle through four question types:
- Sequence and database fundamentals: BLAST, alignment algorithms, FASTQ/SAM/BAM/VCF format internals, NCBI/Ensembl/UCSC navigation.
- NGS pipeline design: variant calling (BWA→GATK), RNA-seq differential expression (DESeq2 vs edgeR vs limma-voom), scRNA-seq (Seurat vs Scanpy, clustering, batch correction).
- Named-incident postmortem framing: “Tell me about a published bioinformatics result that turned out to be wrong.” Potti/Duke, Excel gene names, ENCODE 80% functional, Schaefer CRISPR retraction — these come up at senior screens.
- Statistical and ML methodology under reproducibility pressure: multiple-testing correction, batch effects, train-test contamination, GWAS winner’s curse.
What’s the difference between a bioinformatician and a computational biologist?
Concept: Role disambiguation | Difficulty: foundational | Stage: intro / phone screen
Direct answer: A bioinformatician primarily builds and maintains tools and pipelines that process biological data; a computational biologist primarily uses those tools to generate and test biological hypotheses. The distinction maps roughly to software-engineering versus research-scientist tracks, though most roles blend both. The JD almost always tells you which side of the line a specific opening sits on — if it lists “pipeline maintenance,” “cloud infrastructure,” or “HPC cluster management,” it’s engineering-leaning; if it lists “experimental design,” “model development,” or “manuscript preparation,” it’s research-leaning.
What they’re really probing: Whether you’ve thought clearly about what kind of work you’re pursuing and whether your background matches the role. Candidates who call themselves “bioinformaticians” while describing only wet-lab work, or “computational biologists” while describing only DevOps, signal misalignment.
Walk me through how BLAST works and when you’d use it vs ClustalW vs HMMER.
Concept: Sequence search algorithms | Difficulty: foundational | Stage: technical screen
Direct answer: BLAST, documented at NCBI BLAST+ documentation, uses a seed-and-extend heuristic: short word matches (“seeds”) are extended into local alignments scored by a substitution matrix (BLOSUM62 for protein, NUC4.4 for nucleotide). The E-value reports the expected number of hits at that score by chance. Use BLAST when you need fast identification of known homologs against large databases. ClustalW and MUSCLE are multiple-sequence aligners — use them after you’ve found your sequences to produce a column-by-column alignment for phylogenetics. HMMER uses profile hidden Markov models trained on aligned families; it is more sensitive than BLAST for distant homologs and is the backend for Pfam and Interpro domain searches.
What they’re really probing: Whether you understand that each tool solves a different problem. Reaching for BLAST when you need a profile-based distant-homolog search signals a gap.
Walk me through FASTQ → SAM → BAM → VCF — what changes at each step?
Concept: NGS data format pipeline | Difficulty: foundational | Stage: technical screen
Direct answer: FASTQ stores raw reads: sequence plus per-base Phred quality scores, four lines per read. SAM (Sequence Alignment/Map), described in Li et al. 2009, Bioinformatics 25(16):2078–2079, adds alignment context: each record has 11 mandatory fields — QNAME, FLAG (bitwise), RNAME, POS (1-based leftmost), MAPQ (Phred-scaled mapping quality), the CIGAR run-length string (one of the 11 fields), MRNM, MPOS, ISIZE, SEQ, and QUAL. BAM is the BGZF-compressed binary equivalent — for a 112 Gbp Illumina dataset, BAM requires roughly 116 GB (~1.0 byte per input base). VCF records only positions where the sample differs from the reference, plus genotype likelihoods and filter annotations.
What they’re really probing: Whether you can explain the WHY behind each format. MAPQ encoding, FLAG bit arithmetic, and CIGAR operations are common follow-up probes.
What’s a Phred quality score, and why does Q30 matter?
Concept: Quality encoding | Difficulty: foundational | Stage: technical screen
Direct answer: A Phred quality score Q encodes base-call error probability as Q = -10 × log₁₀(P), where P is the probability that the base call is wrong. Q30 means P = 0.001, or 1 error in 1,000 bases — 99.9% base-call accuracy. Q20 means 1 error in 100 bases; Q40 means 1 error in 10,000 bases. Most variant-calling pipelines require a minimum MAPQ of 20 and a per-base quality threshold above Q20–Q30 for reliable SNV calling. The reason Q30 appears in QC thresholds is that the downstream error rate from low-quality bases compounds: a base at Q10 (10% error probability) introduces more false variants than an entire high-quality read.
What they’re really probing: That you understand the logarithmic scale and can reason about error rates during QC filtering, not just that you’ve seen the FastQC output screen.
Bioinformatics Reference Stack (2026)
Tool choice questions are among the highest-signal junior-to-mid prompts — interviewers are checking whether you can articulate why each tool exists, not just name it.
| Layer | Tool / Database | When to Use | Common Interview Probe |
|---|---|---|---|
| Short-read DNA alignment | BWA-MEM (Heng Li, 2013) | Gold standard for 70bp–few Mbp reads against large genomes; GATK Best Practices default | “When would you use BWA-aln vs BWA-MEM?” |
| Short-read DNA alignment | Bowtie2 (Langmead) | Fast end-to-end alignment; useful for RNA-seq splice-unaware cases | “Why doesn’t GATK recommend Bowtie2 for germline calling?” |
| Long-read alignment | minimap2 (Heng Li) | PacBio HiFi + Oxford Nanopore; assembly-to-assembly; spliced RNA-seq (as of April 2025, per Heng Li blog) | “How does minimap2 handle splice sites differently from STAR?” |
| Variant calling | GATK Best Practices (Broad Institute) | Germline SNPs + Indels, somatic variants, CNVs, RNAseq variants; hg38 reference | “Walk me through the germline variant calling pipeline end to end.” |
| RNA-seq differential expression | DESeq2 / edgeR / limma-voom | DESeq2 for small-to-medium studies; edgeR for quasi-likelihood; limma-voom for large arrays or when linear model flexibility matters | “Why can’t you give DESeq2 TPM values?” |
| scRNA-seq | Seurat (R, Satija lab) / Scanpy (Python, Theis lab) | Choose based on team language; Scanpy scales to 1M+ cells; Seurat v5 introduced multimodal integration | “When would you choose Scanpy over Seurat?” |
| Sequence search | BLAST+ (NCBI) | Known-homolog search; functional annotation; blastn / blastp / blastx / tblastn / DELTA-BLAST | “What’s the difference between blastx and tblastn?” |
| Genome browser | UCSC / Ensembl / IGV | UCSC for annotation track richness; Ensembl for API access; IGV for local BAM inspection | “How would you load a custom BED track in UCSC?” |
| Reproducible pipelines | Snakemake / Nextflow + Conda / Singularity / Docker | Any production pipeline; Nextflow preferred at large institutions; Snakemake preferred in academic labs | “How do you pin tool versions in a Snakemake workflow?” |
| Protein structure | AlphaFold 2 / AlphaFold 3 / RoseTTAFold All-Atom | AF2 for protein monomers; AF3 for protein-DNA/RNA/ligand complexes | “How do you interpret pLDDT when deciding whether to trust an AF2 prediction?” |
| Chromatin / ATAC-seq + Foundation models (2024-2026) | MACS2 / ChIPseeker; AlphaMissense / Evo / Geneformer | Peak calling + annotation; variant pathogenicity / DNA / single-cell modeling | “What baseline before trusting AlphaMissense?” / “Fragment size distribution in ATAC-seq?” |
Junior-Tier Questions: Definitions, Tools, and Formats (0–2 Years)
Junior bioinformatics screens focus on format literacy (FASTQ/SAM/BAM/VCF, Phred quality), tool parameter awareness, database navigation, and basic statistics. The most common failure mode is answering “I just use defaults” — which signals that you haven’t had to troubleshoot a real dataset.
You ran BLAST with default settings and got too many hits. What knobs do you turn?
Concept: BLAST parameterization | Difficulty: junior | Stage: technical screen
Direct answer: The primary controls are: E-value threshold (default 10 — reduce to 0.001 or 1e-5 to cut noise), word size (default 11 for blastn — increase to 15+ to require longer seed matches and reduce false alignments), and the scoring matrix (BLOSUM62 for standard protein; BLOSUM45 for distant homologs; PAM matrices for highly diverged sequences). For BLAST+ applications, the -max_target_seqs and -max_hsps flags control output volume; reducing E-value is almost always the first move. If you’re using blastx or tblastn, choosing the right genetic code matters for non-standard organisms.
What they’re really probing: Parameter sensitivity awareness. The follow-up is often “what happens to sensitivity when you raise the E-value threshold?” — the answer is that you get fewer hits but miss real homologs, so there’s a sensitivity-specificity tradeoff.
Trap: Saying “I’d just filter the output table afterward” signals you don’t understand that filtering after the fact doesn’t recover hits that were never computed because the seed never extended.
What’s multiple-testing correction, and when would you use Bonferroni vs Benjamini-Hochberg?
Concept: Multiple testing correction | Difficulty: junior | Stage: technical screen or stats round
Direct answer: When you test many hypotheses simultaneously — for example, differential expression across 20,000 genes — the probability of at least one false positive by chance alone grows sharply. Multiple-testing correction accounts for this. Bonferroni correction divides the significance threshold α by the number of tests (α/N), controlling the family-wise error rate (FWER) — the probability that any single result is a false positive. It is conservative: use it when a single false positive is dangerous (e.g., a clinical genomic diagnostic). Benjamini-Hochberg (FDR correction) controls the false discovery rate — the expected proportion of false positives among all called significant results. Use it for exploratory RNA-seq analysis where you expect many true positives and can tolerate some false discoveries, then validate top hits experimentally. DESeq2 applies Benjamini-Hochberg by default via its results() function.
What they’re really probing: Whether you understand FWER vs FDR conceptually, not just which formula to apply. The probe often extends to “what’s a p-value adjusted to 0.05 FDR actually saying?”
How do you decide between NCBI, Ensembl, and UCSC for a given task?
Concept: Genomic database selection | Difficulty: junior | Stage: technical screen
Direct answer: UCSC Genome Browser (UCSC docs) excels at annotation track density — known genes, ESTs, CpG islands, cross-species homologies, GWAS data via Genome Graphs — and its MariaDB-backed Table Browser enables SQL-style queries over annotation tables. Use UCSC when you need rich track comparison or custom track loading. Ensembl (EMBL-EBI pathway course at doi: 10.6019/TOL.IntroPathway-t.2021.00001.1) has a well-documented REST API and variant effect predictor (VEP) — use it for programmatic access and cross-species genomics. NCBI Entrez integrates BLAST, gene records, and literature — start here for literature-to-sequence workflows or when you need transcript variants and RefSeq status.
What they’re really probing: Task-appropriate tool selection — you don’t open the same browser by habit.
How would you evaluate the quality of a FASTQ file before alignment?
Concept: Pre-alignment QC | Difficulty: junior | Stage: technical screen
Direct answer: The standard tool is FastQC, which reports: per-base quality score distribution (flag if mean drops below Q20 at 3′ end), per-sequence quality score distribution, per-base sequence content (flag GC bias or adapter contamination), overrepresented sequences (adapter dimers), and k-mer enrichment. For a paired-end library, run FastQC on both R1 and R2 before trimming. Then run MultiQC across samples to spot batch effects in quality metrics before alignment. Common findings: 3′ quality drop (trim with Trimmomatic or Cutadapt — though modern aligners like STAR and BWA-MEM are soft-clip-aware), adapter contamination (trim), and unexpected GC bias (flag for possible rRNA contamination or library prep artifact). The Galaxy Training Network has QC tutorials covering these steps across 525 tutorials.
What they’re really probing: Whether you have a systematic QC habit before touching alignment. “I just run the aligner and look at the alignment rate” is a weak answer.
Mid-Tier Questions: Pipeline Design and Statistical Methodology (2–5 Years)
Mid-level screens probe pipeline architecture decisions, not just tool names. Interviewers expect you to justify why each step exists and what goes wrong when you skip it.
Walk me through GATK best practices for germline variant calling.
Concept: Germline variant calling pipeline | Difficulty: mid | Stage: technical deep-dive
Direct answer: The germline pipeline per GATK Best Practices runs in six stages:
- BWA-MEM alignment with read group tags — mandatory for downstream GATK steps that parse RG fields.
- MarkDuplicates (Picard) — flags PCR duplicates to prevent coverage inflation in variant calling.
- BQSR (Base Quality Score Recalibration) — corrects systematic instrument bias against known variant sites; materially improves SNV accuracy.
- HaplotypeCaller — produces a per-sample GVCF that records evidence at every site, enabling cohort joint genotyping.
- GenotypeGVCFs — aggregates GVCFs across samples into a cohort VCF.
- VariantFiltration or VQSR — final filtering; VQSR preferred when cohort size is sufficient. Standard reference: hg38.
What they’re really probing: That you can explain WHY each step exists — duplicate marking prevents inflation, BQSR corrects instrument error, GVCF enables scalable joint genotyping — not just recite the order.
You have RNA-seq data with 6 samples per group. Which differential-expression tool do you choose and why?
Concept: Differential expression tool selection | Difficulty: mid | Stage: technical deep-dive
Direct answer: With 6 samples per group, DESeq2 (Love, Huber, Anders 2014, Genome Biology 15:550) is the most defensible default. It fits a negative binomial GLM and uses empirical Bayes shrinkage of dispersion estimates — this works well at n=6 where per-gene estimates would be noisy. Critical requirement: DESeq2 requires raw count matrices, not TPM or RPKM — the model requires raw counts for correct dispersion estimation. Feed it Salmon or kallisto output via tximport/tximeta. Use lfcShrink with the apeglm method before ranking genes. edgeR (Robinson et al. 2010) uses quasi-likelihood F-tests and is slightly more flexible for complex designs. limma-voom (Law et al. 2014) transforms counts to log-CPM with precision weights — it scales best to large sample sizes and offers the most flexible design formula.
What they’re really probing: That you know DESeq2 requires raw counts, understand why (dispersion estimation), and can explain the tradeoffs between the three frameworks.
What does the CIGAR string 8M1I4M1D3M tell you about an alignment, and when does the difference between =/X operators matter?
Concept: SAM CIGAR field internals | Difficulty: mid | Stage: technical deep-dive
Direct answer: The CIGAR string is one of the 11 mandatory SAM fields — it encodes alignment operations as a compact run-length string. Reading 8M1I4M1D3M left to right: 8M = 8 bases of match or mismatch (M ambiguously covers both); 1I = 1 base inserted in the read relative to the reference (consumes read, not reference); 4M = 4 more match/mismatch bases; 1D = 1 base deleted from the reference (consumes reference, not read); 3M = 3 final match/mismatch bases. The = operator (exact match) and X operator (explicit mismatch) are extended CIGAR operations that replace the ambiguous M. They matter for variant-calling QC: when you use samtools view --reference to generate =/X-explicit CIGAR, you can count mismatches directly from the CIGAR without re-examining the SEQ and reference. This speeds up per-base mismatch profiling and is useful in BQSR-like recalibration workflows.
What they’re really probing: Whether you understand that M is ambiguous (a common source of confusion), and that extended CIGAR operators have practical implications in recalibration and QC pipelines.
BWA-MEM vs Bowtie2 vs minimap2 — when do you reach for each?
Concept: Aligner selection | Difficulty: mid | Stage: technical screen
Direct answer: The decision tree is read-length-first, then downstream task. BWA-MEM (Heng Li, arXiv 2013) is the gold standard for short-read DNA (70bp to a few Mbp) against large reference genomes — it’s the GATK Best Practices default precisely because it handles chimeric reads and produces accurate MAPQ scores that BQSR depends on. Bowtie2 (Langmead) is faster for short reads in end-to-end mode but is not recommended for GATK germline calling where MAPQ calibration matters. minimap2 (Heng Li) is the standard for long-read alignment — PacBio HiFi, Oxford Nanopore, assembly-to-assembly; as of April 2025 it also handles short RNA-seq reads, per Heng Li’s blog. For RNA-seq from short reads, neither BWA-MEM nor Bowtie2 is appropriate — use a splice-aware aligner (STAR or HISAT2).
What they’re really probing: That you understand why splice-aware alignment exists and won’t route short RNA-seq through BWA-MEM, which is a common mistake.
What are batch effects, and how would you detect and correct them in an RNA-seq dataset from 3 sequencing centers?
Concept: Batch effect detection and correction | Difficulty: mid | Stage: technical deep-dive
Direct answer: Batch effects are systematic non-biological differences caused by sequencing center, library prep kit version, or sequencing date. In a 3-center RNA-seq dataset, detection starts with PCA on the count matrix: if the first PC separates by center rather than biological condition, you have a batch effect (confirm with limma’s plotMDS). Correction strategy depends on confounding: if batch is orthogonal to the biological variable, include it as a covariate in the DESeq2 or limma formula (~batch + condition) — this is the cleanest approach. For visualization only, use ComBat or limma::removeBatchEffect — never run DE analysis on ComBat-corrected counts, which double-corrects. For scRNA-seq, MNN (Haghverdi et al. 2018, Nature Biotechnology 36:421) or Seurat v3 integration are the standard batch-correction methods.
What they’re really probing: The distinction between covariate-based correction (for DE testing) and count correction (visualization only), and the specific danger of running DE on ComBat-corrected data.
How would you make your variant-calling pipeline reproducible across labs?
Concept: Pipeline reproducibility | Difficulty: mid | Stage: technical deep-dive or systems design
Direct answer: Reproducibility in variant-calling pipelines has four layers. First, workflow management: write the pipeline in Snakemake or Nextflow — every rule, input, output, and tool invocation declared, not a shell script requiring manual ordering. Second, environment management: pin every tool version via Conda environment files or Singularity/Docker — a BWA minor-version change can shift alignments enough to change variant calls. Third, reference data versioning: record the exact URL, checksum (MD5/SHA256), and download date of every reference genome, dbSNP build, and interval list — two labs using “hg38” with different dbSNP builds produce different VQSR models. Fourth, execution records: Nextflow’s workflow log and Snakemake’s provenance tracking capture which tool version ran on which input — necessary for audit. The Bioconductor rnaseqGene workflow and GATK Best Practices both ship with versioned Conda environments for this reason.
What they’re really probing: That you’ve thought about reproducibility as an engineering problem, not just as “I’ll write README instructions.”
Senior-Tier Questions: Architecture Decisions and Named-Incident Postmortems (5+ Years)
Senior screens expect you to defend architectural choices with named papers, cite the failures the field has already documented, and demonstrate that you know where current methods still fail. Citing exact RMSD figures, exact retraction years, and exact DOIs distinguishes candidates who have read the primary literature from those who have read summaries.
Walk me through the AlphaFold 2 architecture. How does AlphaFold 3 differ?
Concept: Protein structure prediction architecture | Difficulty: senior | Stage: technical deep-dive
AlphaFold 2 (Jumper et al., Nature 2021) has two main stages. The Evoformer trunk processes a multiple sequence alignment (MSA) — an Nseq × Nres array — and a pairwise residue representation (Nres × Nres) through attention blocks that update both representations jointly via triangle multiplicative updates. This allows each residue pair to gather information from all other pairs simultaneously. The structure module represents residues as “residue gas” — independent rotations and translations per residue — breaking chain connectivity to allow simultaneous local refinement across the sequence. The per-residue pLDDT confidence score (predicted local-distance difference test) correlates with actual lDDT-Cα accuracy (Pearson’s r = 0.76) and is your primary signal for which regions to trust. At CASP14, AlphaFold 2 achieved 0.96 Å RMSD95 median backbone accuracy vs 2.8 Å for the next-best method — roughly 3× improvement. Caveats that senior interviewers expect: AlphaFold 2 still struggles with intrinsically disordered proteins (IDPs), novel folds without MSA depth, and large complexes.
AlphaFold 3 (Abramson et al., Nature 630, 493–500, 2024; doi: 10.1038/s41586-024-07487-w) replaces the Evoformer with a diffusion-based architecture and extends the system beyond proteins to protein-DNA, protein-RNA, and protein-ligand complexes. This is a qualitative expansion — AF2 could not model drug-target binding poses or nucleic acid interactions. AF3 also introduces joint structure generation rather than sequential residue prediction. It is still imperfect for novel folds, IDPs, and very large complexes — do not claim it “solved” structural biology.
What they’re really probing: Whether you can name specific architectural components (Evoformer, residue gas, pLDDT), cite the exact CASP14 performance figure, and accurately describe AF3’s scope extension without over-claiming.
Walk me through a complete scRNA-seq workflow including the choices that matter.
Concept: scRNA-seq pipeline design | Difficulty: senior | Stage: technical deep-dive
Direct answer: A complete workflow: Cell Ranger (10x Genomics) or STARsolo for read alignment + UMI collapsing → cell barcode filtering (knee-plot or EmptyDrops) → QC filtering (thresholds are dataset-specific, not universal — common starting points: <200 genes, >20% mitochondrial reads) → normalization → highly variable gene selection → PCA → UMAP for visualization (preferred over t-SNE because UMAP better preserves global cluster relationships) → neighbor graph construction → Louvain or Leiden clustering on the KNN graph — both Seurat and Scanpy (Wolf, Angerer, Theis 2018, Genome Biology 19:15) use this approach. If combining samples, batch correction is mandatory: MNN (Haghverdi et al. 2018, Nature Biotechnology 36:421) or Seurat v3 integration. Cell-type annotation: marker genes, SingleR, or manual DEG review per cluster. Scanpy handles 1.3M+ cells; Seurat v5 introduced multimodal integration for CITE-seq and multiome data.
What they’re really probing: Whether you know the difference between Seurat (Satija lab) and Scanpy (Theis lab), why UMAP is preferred for downstream interpretation, and that batch correction is mandatory when combining samples.
Why would a senior interviewer immediately reject a candidate who exports gene lists to Excel?
Concept: Data integrity in genomics | Difficulty: senior | Stage: technical or culture screen
Direct answer: The documented problem: Excel automatically converts gene names to dates — SEPT2 becomes “2-Sep,” MARCH1 becomes “1-Mar” — and to floating-point numbers (DEC1 becomes a decimal). A 2016 baseline study by Ziemann et al. found this affected approximately 20% of Excel gene lists in papers published 2005–2015. The Abeysooriya 2023 update (Retraction Watch, 2023-09-20) examined 11,000+ articles published 2014–2020 and found 31% of Excel supplementary gene lists contained auto-conversion errors — the problem grew despite widespread awareness. Novel failure modes include locale-specific conversions: AGO2 → “Aug-02” in Spanish locale, MEI1 → “May-01” in Dutch locale. The HGNC responded by renaming SEPT-1 → SEPTIN1 and MARCH1 → MARCHF1 to avoid auto-conversion; the committee described broad adoption as “a lengthy process.” The practical test: if you export a gene list to Excel and a collaborator opens it in a different locale, you cannot guarantee what they receive. Safe alternatives: TSV with explicit text formatting, or R/Python scripts that output the final list.
What they’re really probing: That you know the specific documented scale of the problem (31%, Abeysooriya 2023), not just that “Excel is bad for genomics.” The HGNC rename detail is the senior signal.
Walk me through what went wrong with the Anil Potti / Duke chemotherapy genomics work.
Concept: Named scientific failure postmortem | Difficulty: senior | Stage: senior culture or postmortem screen
Direct answer: Anil Potti claimed to have developed microarray-based signatures predicting chemotherapy response. Keith Baggerly and Kevin Coombes at MD Anderson published their critique in Nature Medicine (2007), identifying: mislabeled sample classes (sensitive and resistant labels swapped), off-by-one row shifts in expression matrices (a spreadsheet offset that corrupted feature vectors), and train-test contamination (test samples in training data, inflating apparent performance). Clinical trials were opened based on these signatures. Papers were retracted from Nature Medicine and the Journal of Clinical Oncology — do not cite Lancet Oncology, not in the documented retraction record per the Retraction Watch Anil Potti retractions tracker. In 2015, the US Office of Research Integrity found Potti had “engaged in research misconduct by falsifying and/or fabricating data” — he was barred from federal grants.
What they’re really probing: Whether you can articulate the specific analytical errors — row shifts, label swaps, train-test contamination — not just “there was fraud.” The errors were computational before they were ethical; a code review would have caught them.
How would you design a multi-omics integration analysis combining bulk RNA-seq, single-cell RNA-seq, and ATAC-seq from the same samples?
Concept: Multi-omics integration architecture | Difficulty: senior | Stage: systems design or research design screen
Direct answer: Integration across modalities requires a clear question first — are you seeking regulatory logic (which ATAC-seq peaks correlate with gene expression changes), cell-type deconvolution (using scRNA-seq as a reference to deconvolve bulk), or latent factor discovery (shared sources of variation across omics)? Tool selection follows the question. For latent factor integration across bulk RNA-seq + ATAC-seq + protein data, Mofa+ (Multi-Omics Factor Analysis v2) identifies shared and modality-specific factors without requiring paired single-cell data. For joint single-cell + ATAC analysis, Seurat v5 multimodal or scvi-tools ArchR / Muon handle paired CITE-seq or 10x Multiome data (RNA + ATAC from the same cell). For cross-modal deconvolution, use scRNA-seq cell-type signatures (from Seurat or Scanpy clustering) as a reference to decompose bulk RNA-seq with CIBERSORT or MuSiC. Quality control across modalities is critical: peak calling parameters in MACS2, library-size differences across assay types, and batch effects between single-cell and bulk runs must all be controlled before integration.
What they’re really probing: Whether you think about what question integration is supposed to answer before picking a tool, and whether you know the practical constraints of each framework.
You’re asked to apply transformer models to genomics. What baselines do you establish first?
Concept: ML methodology in genomics | Difficulty: senior | Stage: ML / research design screen
Direct answer: The senior answer starts with the question of whether the biological signal is strong enough to justify the model complexity. Before reaching for Evo (DNA foundation model), Geneformer (single-cell transformers), or AlphaMissense (variant pathogenicity), establish: logistic regression or linear regression baseline on the same feature set — if a linear model achieves 90% of the transformer’s AUC, the transformer is adding engineering complexity for marginal gain. Then a random forest or XGBoost baseline, which captures nonlinear interactions without attention mechanisms. Only if both fall materially short do transformers justify their compute and interpretability costs. In genomics specifically, the GWAS literature documents the winner’s curse problem: top-effect-size hits in discovery datasets systematically over-estimate true effects in replication cohorts — a 2021 Scientific Reports study (doi: 10.1038/s41598-021-97896-y) found binary phenotype GWAS SNVs replicate at only 58.1% vs 94.8% for quantitative phenotypes. Train-test contamination (the Potti failure mode) is equally dangerous in ML genomics — ensure your held-out test set has no overlap with training subjects, and account for population stratification in genetic data splits.
What they’re really probing: Whether you have methodological discipline — establishing simpler baselines first, knowing the replication literature, and catching contamination before publication. The Potti case is directly relevant: his errors were train-test contamination in a predictive genomics model.
Named-Incident Quick Reference for Interview Recall
Full narratives are in the senior-tier Q&As above. Use this table for rapid review — if you can narrate each row’s failure mode in 30 seconds, you’re ready for the postmortem probe.
| Incident | Failure Mode | Probe | Key Fact |
|---|---|---|---|
| Anil Potti / Duke (retractions 2010–2015) |
Row shifts, label swaps, train-test contamination | “Walk me through what went wrong analytically.” | Retracted from Nature Medicine + JCO — not Lancet Oncology |
| Excel gene-name auto-conversion (Abeysooriya 2023) |
SEPT2 → date; locale variants | “Why is Excel a red flag for gene lists?” | 31% of 2014–2020 supplementary gene lists had errors; HGNC renamed SEPT-1→SEPTIN1, MARCH1→MARCHF1 |
| ENCODE 80% functional debate (Graur, PNAS 2013) |
Biochemical activity ≠ biological function | “What does 80% functional mean?” | ENCODE: biochemically active sites. Graur: larger-genome organisms aren’t phenotypically more complex. |
| CRISPR off-target — Schaefer et al. (Nature Methods 2017) |
No parental controls; background SNVs attributed to CRISPR | “How do you validate CRISPR off-targets?” | Published 2017; retracted 2018. Isogenic parental controls are mandatory. |
| Cancer microbiome — Sepich-Poore et al. 2020 (Rob Knight lab; retracted 2024) |
Contamination misread as tumor microbiome; shrimp virus as cancer classifier | “How do you verify a microbiome signal is real?” | Salzberg et al. (mBio 2023): bacteria 100× lower than claimed. Decontamination controls mandatory. |
| GWAS winner’s curse (Scientific Reports 2021) |
Discovery-phase effect sizes inflate; binary phenotypes replicate at 58.1% | “How do you design a GWAS with replication?” | Replication cohort + effect-size shrinkage required. Replication predictive model AUC = 0.90. |
Red-Flag Answers (And What They Signal)
These answer patterns recur in bioinformatics screens and reliably signal gaps. Each probe is checking a specific area of the literature — understanding what the probe is really testing helps you avoid the trap.
-
“I’d export the gene list to Excel.” This reveals unfamiliarity with the documented auto-conversion problem. The Abeysooriya 2023 data (31% of supplementary gene lists in 2014–2020 papers contain errors) is well-known. Interviewers aren’t asking whether you’ve made the mistake — they’re checking whether you’ve read recent methods literature. The fix is straightforward: TSV with explicit text formatting, or programmatic output from R/Python.
-
“I just use BLAST with default settings.” Defaults exist to be a starting point, not an answer. E-value threshold (default 10) will flood most searches with noise; word size and scoring matrix choices depend on evolutionary distance. No parameter sensitivity awareness signals no troubleshooting experience.
-
“I’d use t-SNE for clustering and downstream analysis.” t-SNE preserves local neighborhood structure but distorts global distances between clusters — it’s valuable for visualization but misleading for inferring cluster relationships or trajectories. UMAP is now preferred for downstream interpretation. Interviewers aren’t penalizing you for knowing t-SNE; they’re checking whether you know its limitation.
-
“I’d ask ChatGPT to write the pipeline.” LLMs are useful for scaffolding, but pipeline architecture is the candidate’s accountability — specifically version pinning, reference data provenance, and deterministic execution. Snakemake or Nextflow plus container-pinned environments are not optional for production pipelines; they’re the reproducibility standard. The concern isn’t tool use — it’s whether you know what the tool can’t do.
-
“I just trust the BWA output.” No QC mindset. After alignment, the standard steps are duplicate marking (MarkDuplicates), MAPQ filtering, and BQSR before variant calling. Skipping any of these is documented to inflate false-positive variant calls in the GATK Best Practices literature.
-
“AlphaFold makes structural biology obsolete.” Over-claim. AlphaFold 2’s 0.96 Å RMSD95 median at CASP14 is genuinely transformative, but predictions for intrinsically disordered proteins, novel folds without deep MSA, and protein-protein complexes remain imperfect. AlphaFold 3 extends to protein-DNA/RNA/ligand but does not resolve these limitations. Structural validation (cryo-EM, X-ray crystallography) remains essential for high-stakes applications.
Questions to Ask Your Interviewer (2026-Aware)
Reverse questions serve two purposes: they demonstrate you’ve thought about the specific organization, and they surface information you need to evaluate fit. The questions below are organized by employer type, since the technical landscape and team structure differ substantially. Practitioners at the Broad Institute and similar institutions consistently note that candidates who ask specific technical questions about stack and pipeline ownership project seniority; candidates who ask only about work-life balance project junior intent.
For biotech (Genentech / Regeneron / Illumina / Moderna)
- How are long-read sequencing assays (PacBio HiFi / Nanopore) currently integrated into your pipeline stack?
- Are pipelines in Nextflow WDL, Snakemake, or something internal — and does a platform engineering team maintain the execution environment, or do scientists manage their own containers?
- How does the team navigate the tension between proprietary instrument data (Illumina DRAGEN) and open-source reproducibility for regulatory submissions?
For academic labs (Broad Institute / Wellcome Sanger / EMBL-EBI / MSKCC)
- How does the lab balance pipeline maintenance work versus first-author publications — what’s the authorship culture for computational contributions?
- With NIH funding under pressure in 2024-2025, how is the lab scoping computational projects for the next grant cycle?
- What’s the code-availability standard on publication — GitHub + Zenodo, or a lab-specific approach?
For pharma R&D (Pfizer / Novartis / Roche)
- What documentation standard is expected for variant-calling pipeline validation in genomic biomarker submissions to regulatory affairs?
- Is multi-omics integration (bulk RNA-seq + scRNA-seq + proteomics) in production for candidate selection, or still exploratory?
- How is the team using AlphaFold 3 for structure-based drug design, and what experimental validation loop is expected for computational predictions?
For genomics core facilities / clinical diagnostics labs
- Is the pipeline CAP/CLIA accredited, and how does the team manage change control when a tool update shifts variant calls?
- What’s the turnaround time expectation for clinical NGS reports, and how does variant tier classification work under those constraints?
- How are QC metrics (coverage, duplicate rates, run QC) communicated to submitting labs, and what happens when a run fails mid-project?
6-Week Bioinformatics Interview Prep Roadmap
This roadmap assumes you have a biology or CS background and roughly 10 hours per week available. It prioritizes the resources that senior interviewers actually cite, not bootcamp curricula.
Weeks 1–2: Format literacy and sequence fundamentals
- Read the SAM/BAM format paper (Li et al. 2009) — the original spec. Know all 11 mandatory fields and CIGAR operations including extended
=/X. - Work through the NCBI BLAST+ documentation — run a blastp and blastx search, adjust E-value and word size, inspect the output format.
- Complete the EMBL-EBI Introductory Bioinformatics Pathway (3+ hours; covers Ensembl, UniProt, Expression Atlas).
- Explore the UCSC Genome Browser docs — load a BED track, query the Table Browser, and use the Genome Graphs view for GWAS data.
- Pick one Galaxy Training Network tutorial from the Variant Analysis or Transcriptomics topic as an end-to-end hands-on workflow without local infrastructure.
Weeks 3–4: NGS pipelines and differential expression
- Read the GATK Best Practices overview and germline pipeline page. Be able to explain BQSR and GVCF-based joint genotyping.
- Work through the DESeq2 vignette (Bioconductor) end to end — construct a DESeqDataSet, run the model, apply lfcShrink with apeglm, and interpret the MA plot. Raw counts required.
- Read the Bioconductor rnaseqGene workflow for the full FASTQ-to-DE picture including tximport for Salmon output.
- Use the Biopython tutorial to write a script using Bio.Entrez and Bio.Blast — practicals reinforce reading.
Weeks 5–6: Senior topics, postmortems, and practitioner perspective
- Read the AlphaFold 2 paper (Jumper et al., Nature 2021) — Methods and Extended Data. Know the Evoformer, pLDDT, and CASP14 0.96 Å RMSD95. Read the AlphaFold 3 paper (Abramson et al., Nature 630, 493–500, 2024) for the scope extension to protein-DNA/RNA/ligand.
- Read the Scanpy paper (Wolf et al. 2018) and the scRNA-seq challenges review (Nature Reviews Genetics 2018).
- Follow Heng Li’s blog for alignment design decisions (BWA, SAMtools, minimap2 author). Read the Lior Pachter blog for RNA-seq methodology critique and comp bio commentary.
- Memorize the six postmortem incidents — Potti (retracted NM + JCO), Excel gene names (31% Abeysooriya 2023), ENCODE debate (Graur 2013), Schaefer CRISPR (retracted 2018), Sepich-Poore microbiome (retracted 2024), GWAS winner’s curse. Practice each in 60 seconds with the specific error mechanism named.
Where Bioinformatics Hiring Goes From Here
The 2026 bioinformatics interview tests three converging pressures: format and pipeline literacy proven at scale, ML methodology discipline that the genomics reproducibility crisis made non-negotiable, and awareness of named scientific failures that distinguish candidates who have read the primary literature from those who have only read summaries. Long-read sequencing, AlphaFold 3, and multi-omics integration are now table stakes — candidates who engage with those literatures will set the tone of the interview rather than react to it. The Potti errors, the Excel gene-name problem, and the Schaefer CRISPR retraction are documented in the primary literature; knowing them signals you read the field’s history, not just its present state.