This blog post was written in collaboration with Phylo, the team behind the Biomni biomedical agent environment used in this work.
Bringing a single new medicine to patients often takes over a decade and can run into the billions, and most candidates fail somewhere along the way. Much of that attrition is decided in the early, analysis-heavy stages, so anything that speeds them up has outsized leverage on what eventually reaches the clinic. The decision-making in those stages, from target validation through hit discovery, and lead optimization, is computational and tool-driven. These are exactly the kind of multi-step problems that agentic coding tools are now being pointed at.
Whether an agent can actually drive this kind of work, though, is hard to measure, and it takes a specialized, agentic evaluation set beyond simple question-and-answer. To that end, we evaluated three state-of-the-art coding agents, Claude Code (Opus 4.7), Codex (GPT-5.5), and Gemini CLI (Gemini 3.1 Pro), on 66 verifiable, domain-expert-curated tasks run inside a shared biomedical tool environment.
While this is an early report from an ongoing internal evaluation with the task set and numbers still likely to change, several findings stand out:
No single agent wins everywhere. GPT-5.5 leads in structural reasoning and database screening while Claude Opus 4.7 is the most accurate in patent mining and molecular biology tasks. Gemini 3.1 Pro shows strength in exploring a couple of different paths before committing to a final answer.
Long-horizon remains the biggest challenge. On about two-thirds of the tasks at least one agent reaches the answer; the tasks that defeat all three agents are consistently the longest pipelines.
Chemistry and structure are the strong suit. The agents are most reliable when a task comes down to finding the right molecule and reading or computing one property of it. They struggle most on the retrieval- and biology-heavy tasks that chain many database queries and filters together before arriving at an answer.
Given an expert's method, they recover. Tasks that stump all three unaided frequently become solvable once the agent is handed the recipe: the sequence of steps and which tools to use, but not the answer. The agents execute well once the path is set; what they tend to lack is the know-how to plan the right high-level workflow on their own and hold it across every step.
Dataset
The tasks span the drug-discovery workflow, from target identification and validation through hit discovery, hit-to-lead, and lead optimization, alongside a smaller set of adjacent biomedical tasks in protein engineering and cancer genomics. Much of the work sits where a project already has a structure and a series of analogs and is characterizing or ranking them, which is where most computational chemistry happens.
Each task has a verifiable ground-truth answer, usually a single value or string, and is graded by an LLM verifier against a set of weighted, expert-written rubric criteria. The work behind even a short answer can be a full agentic workflow. For example:
Take the task that supplies a KRAS G12D co-crystal and asks for the net charge of the bound inhibitor: arriving at that one signed integer means pulling the structure, working out which heteroatom group is the inhibitor rather than a buffer or cofactor, parsing the deposited per-atom records, and summing their charges, calling and chaining several tools along the way.
What a task demands of an agent is captured better by capability than by pipeline stage, since two tasks from the same stage can require very different skills. We group the set into the capabilities below.
Structural reasoning, the largest group, asks the agent to read a property off a real co-crystal structure: the net charge of a bound ligand, a count of hydrogen bonds or salt bridges, a metal's coordination number, or the affinity ranking of a few analogs from their modeled contacts.
Cheminformatics tasks identify the right molecule and compute one well-defined descriptor, such as a Crippen LogP, a TPSA, a monoisotopic mass, a canonical SMILES or InChIKey, or a structural-alert set.
Database screening chains queries across OpenTargets, ChEMBL, UniProt, and PubChem to land on a gene, target, or compound.
Molecular biology works directly on DNA and protein sequence, covering primer design, restriction digestion, directional cloning, and protein-engineering edits.
Patent mining pulls a specific patent or paper, extracts a compound table from long unstructured text, and applies multi-threshold filters with explicit tie-breaks.
Genomics covers mutation-frequency ranking and cross-genome regulatory-element analysis.
Below is a depiction of the drug discovery stages and corresponding capabilities captured by our evaluation:
A few tasks, abbreviated:
- "PDB entry 7RT2 describes a co-crystal of KRAS G12D with inhibitor X. What is the net charge of the protein-bound inhibitor?" (structural reasoning)
- "Find a melanoma protein marker: search a reviewed human protein database, rank by disease-associated variants, filter to skin-expressed genes, and report the top gene by pathogenic-variant count." (database screening)
- "From the competitor US patent, filter the disclosed compounds by the stated assay thresholds and report a descriptor of the surviving compound." (patent mining)
- "Provide the standard InChIKey of the neutral form of the macrocyclic ligand in PDB entry 6HZC." (cheminformatics)
Structural reasoning and cheminformatics together make up half the set. That is also where the models differ most, as the results below show.
Environment
The agents work inside Biomni, the biomedical agent framework from Stanford's SNAP lab, in a version we adapted at Scale for evaluation (since Biomni's open-source release, Phylo has released a newer version, Biomni Lab; the experiments in this analysis use the former). Biomni gives an agent a few hundred biomedical functions spanning structure parsing, sequence manipulation, database clients, and cheminformatics, and the agent uses them the way it would any Python library: from biomni.tool.<domain> import <function>. A set of curated reference datasets (binding-affinity, gene-expression, cancer-dependency, and genetic-association resources such as BindingDB, GTEx, DepMap, and the GWAS catalog) sits on disk for tasks that need it.
Each task is presented the same way: The agent receives the prompt and a note about the available tools and libraries, asking it to write its final answer to a file. Everything else is left to the agent: which functions to call, whether to lean on the library or write its own Python, how many steps to take.
We evaluated three models along with their native harnesses and refer to the agent or the model interchangeably:
- Claude Code: Claude Opus 4.7
- Codex: GPT-5.5
- Gemini CLI: Gemini 3.1 Pro
What the Scores Show
Codex and Gemini each returned an answer on all 66 tasks; Claude Code answered 61, declining 5 on content-policy grounds. The declined tasks include HIV-1 reverse transcriptase, two competitor-patent tasks, and SARS-CoV-2 PLpro. On review these refusals had no clear safety basis; the tasks ask for routine structural or patent analysis of already-published data. That said, to compare the three on equal footing, the accuracy figures below, and the per-task comparisons throughout, are computed over the 61 tasks all three agents answered.
The Average panel shows mean outcome accuracy across capabilities: 46% for Claude Code, 62% for Codex, and 53% for Gemini. Codex also produces the most exactly-correct answers, with 27 perfect trials (score ≥ 0.9) of 61 against 21 for Gemini and 19 for Claude Code, at roughly $1 to $1.40 in API costs per task across the three.
Codex leads on the structure tasks (73%) and on database screening (62%), with Gemini just behind on structure (68%). Claude Code does worst on both of those, 39% and 34%, but comes out on top for molecular biology (58%) and patent mining (61%), where a task is a sequence of steps worked over a protein or a document rather than one value to look up.
The shape of the scores differs too: Codex's are close to bimodal, mostly near 0 or near 1 with little in between. That points to strong execution once it commits; when Codex settles on the right approach it tends to carry the task all the way to a correct answer, and when it picks the wrong one it misses cleanly, which is why it rarely lands in the partial-credit middle. Gemini sits between the extremes, and Claude Code's trials cluster more in that partial-credit middle.
Failures the Models Share
The wrong answers are overwhelmingly reasoning and planning failures: somewhere in the chain the agent queries the wrong record, applies the wrong filter, or misreads a structure, and the final value comes out wrong. The agents diverge on the tasks they get right but converge on a smaller set they get wrong. On 12 tasks all three score at or below 0.3, and on a few they return the same wrong answer.
Two such tasks are worth calling out.
- On a melanoma-marker task all three agents scored 0: each ranked candidate genes by their genome-wide pathogenic-variant count and returned a non-melanoma gene such as BRCA2 or PTEN, when the answer is CDKN2A.
- On a ligand intramolecular-H-bond count they again all scored 0, each reporting 0 H-bonds where the correct answer is 1.
The melanoma task is the clearest worked example. The prompt anchors on melanoma in its first sentence and carries that scope through each step:
I am looking for a protein marker linked to a highly invasive cutaneous malignancy of melanocytic origin called melanoma. [...] rank candidates by the number of disease-associated variants [...] For each retained gene, count the number of pathogenic or likely pathogenic variants [...] Rank the genes by descending variant count and report the top-ranked gene.
The first steps keep the melanoma scope: the agents query UniProt for melanoma-associated entries and filter to skin-expressed genes. At the final counting step all three silently dropped that scope, counting every pathogenic and likely-pathogenic variant on each gene rather than the melanoma-associated ones the task is about. Ranking by gene-wide counts surfaces genes like BRCA2 and PTEN, which carry large pathogenic-variant totals through unrelated syndromes but are not primary familial melanoma genes. The expected answer, CDKN2A, is correct.
Per the task's author, this is not an error in biological reasoning but a failure to hold the subject of the question across the workflow; applying scientific common sense, a domain expert would have kept the melanoma scope to the end. The intramolecular-H-bond task is the same shape: all three run a reasonable counting procedure and arrive at the same wrong number.
Case Studies
Restricting to the 61 tasks all three answered, we looked at the cases where their scores spread the widest. When the models diverge, one usually solves the task outright while another misses, and which one wins tracks the task type.
Case Study 1: Reading the Bound State Vs. the Canonical Form
Four tasks ask for the net charge of a protein-bound ligand, among them the HIV-RT co-crystal (ligand 6FT) and the 1N5X co-crystal (ligand TEI). One reads:
What is the net formal charge of the bound inhibitor X in its co-crystal structure with HIV RT, PDB ID: 5J1E? Provide the answer as an integer, with a sign.
The ground-truth charge is -1. Codex and Gemini get these; Claude Code reports 0 and loses the task.
For readers outside structural biology: a deposited crystal structure stores two different descriptions of a ligand. One is the canonical chemical component, the idealized molecule as it exists in a reference dictionary, which is usually drawn in its neutral protonation state. The other is the set of per-atom records for the molecule as it actually sits in the protein's binding pocket, which can carry a charge because the bound environment shifts protonation. The question asks for the second while an API field reports the first.
Claude Code's error in this task stems from a fundamental misunderstanding of structural biology rather than a simple misidentification. While it successfully navigates the entry to find the correct inhibitor (6FT) and queries its chemical-component endpoint, it stops there. It finds a dictionary field literally named pdbx_formal_charge, reads 0 (the canonical neutral form), and writes it down as the final answer. Codex and Gemini download the deposited coordinate file and sum the per-atom charge column over the bound ligand's atoms, which gives -1:
1curl -fsSL https://files.rcsb.org/download/5J1E.cif |
2python - <<'PY'
3charge_sum = 0
4for line in sys.stdin:
5 if line.startswith('HETATM') and ' 6FT ' in line:
6 c = line.split()[15]
7# _atom_site.pdbx_formal_charge
8 if c != '?': charge_sum += int(c)
9print(charge_sum)The expert's solution goes further and predicts the protonation from the published crystallization pH.
Case Study 2: A Six-step Pipeline Where Each Model Fails Differently
- Find the top five targets associated with Parkinson's disease, ranked by overall association score.
- Keep only the genes with the highest interaction score with the top-ranked gene from step 1 (alphabetical order as the tie-breaker).
- Search for the top five 3D structures for the gene from step 2, human, solved by Solution NMR (as of January 2020).
- Among those PDB IDs, take the one with the longest sequence for the gene from step 2.
- If that structure contains other peptides, look for the small molecules associated with each and find the compound with the smallest molecular weight.
- Predict the metabolism of all compounds found in step 5 by CYP3A4.
- Return the metabolism percentage in a table.
The task is a chain: Open Targets → STRING → PDB → other peptides → ChEMBL → ADMET. All three models fail, and the interesting part is that they fail at three different steps.
Claude Code walked all six steps in order and erred only at the ChEMBL query, where it applied an activity filter before the molecular-weight sort and so passed over the smallest molecules the step asks for. Codex erred at the same step in a different way, taking "small molecules associated with these peptides" to mean the crystallographic ions in the structure and running the metabolism prediction on a zinc ion:
1# Codex pulled the ligand records out of the 6N13 structure file
2
3predict_admet_property(['[Zn+2]', '[Ca+2]', ...])
4
5# final answer table: | Zinc ion | [Zn+2] | 65.4 | ... |The most interesting part of the model failure was not that they misinterpreted the definition of small molecules associated with these peptides, but that they didn't recognize that their interpretation led to incoherent results. A human would have backtracked at this step.
Gemini broke earliest, at step 2: it selected the wrong interaction partner for the top gene and followed that gene through the rest of the chain.
Overall, none of the three errors is exotic on its own; each is a single careless step in an otherwise reasonable run, yet a slip at any step ends the task. The multi-step pipelines are where the distance between getting the science right and carrying the whole workflow to the end is widest, and where the three models look least alike.
Case Study 3: Exploration Strategy as a Differentiator
On a subset of three open-ended tasks, only Gemini lands the answer: the hCHIT1 ligand charge, the HDAC6 affinity ranking, and a PubChem search for the most potent KRAS G12D inhibitor in a date-bounded slice of the literature. That last one reads:
Query PubChem for any open-access literature with links in PubChem, discussing KRAS G12D inhibitors, and published in the first six months of 2025. Within your search findings, look for the single unique compound X with the most potent activity against A549 cells. Provide the following RDKit-calculated properties: RDKit canonical SMILES, clogP, hydrogen bond donors, hydrogen bond acceptors, topological polar surface area.
The common factor is that these tasks reward trying several approaches before committing. Gemini and Codex both take far more steps per trial than Claude Code, about 47 and 42 against 19, so neither is short on room to explore. With Gemini and Codex so close on raw tool-call count, the volume of calls alone does not explain why only Gemini lands these tasks; the difference is how Gemini spends those steps, reformulating the literature search several ways before settling on a compound rather than committing early to a single query that does not pan out.
Patterns Across the Full Set
Stepping back from individual tasks to the whole set, a few patterns hold across the full dataset.
Long-horizon tasks remain the challenge. The tasks that resist all three are consistently the longest pipelines: multi-step chains like the Parkinson's task above, where a slip at any one of six or seven dependent steps sinks the result even when each individual step is within reach. Eval and training value therefore concentrates in these multi-step tasks that require holding a requirement to the end, which separate models far more than single-shot questions.
Some capabilities are more tractable. The chemistry-and-structure groups are the most tractable: at least one of the three models lands the answer on 18 of 22 structural-reasoning tasks and 10 of 13 cheminformatics tasks. The retrieval- and biology-heavy groups are harder: at least one model solves 6 of 13 database-screening tasks, 6 of 9 molecular-biology and 3 of 6 patent-mining. The difference is largely structural: the chemistry and structure tasks tend to be deterministic, with few decision points between the data and the answer, whereas the retrieval and biology tasks are long chains with many decision points, where a wrong turn early compounds into a wrong final answer.
About two-thirds of the set is within reach today. On 43 of the 66 tasks at least one of the three agents produced a correct answer (≥85% outcome); on the remaining 23 none did. The unsolved set spans multiple capabilities across planning and scientific commonsense rather than clustering in one. The per-model profiles are somewhat stable: Codex is most accurate on well-defined single answers, Gemini explores more and reaches answers on open-ended tasks the others miss, and Claude Code holds up better on long tasks that pile up several constraints at once.
Given the expert's method, the agents recover. We re-ran a subset of the unsolved tasks with the expert's methodology supplied as guidance to assess solvability: the sequence of steps detailing information about which databases and filters to use but not the answer or any intermediate value. A majority of the tasks that all three models had missed unaided became solvable. In one case, given the method, an agent returned the correct molecular formula for the carboxypeptidase-bound inhibitor and completed the Pf3 phage-engineering design, where unaided every model had been wrong.
Implications
The clearest signal in these results is where the gap is, and where it is not. The agents are not short on domain knowledge: they recognize the right structures, know the relevant databases, and execute individual steps competently. What separates them is the ability to plan a multi-step workflow and hold every constraint to the end. Difficulty tracks workflow length and the long pipelines that demand this are exactly the tasks that pull the three models apart.
That has a direct consequence for how these agents should be evaluated and trained. The discriminating signal lives in the long, dependent chains where one wrong turn compounds into a wrong final answer. We are therefore concentrating our evaluation and training efforts on the retrieval-, patent-, and biology-heavy capabilities, where these long-horizon failures are most common.
The most encouraging result is that the gap appears closable. When we supply an expert's method, the sequence of steps and which tools to use without revealing the answer, a majority of the unsolved tasks we re-ran become solvable. The bottleneck is high-level planning rather than the underlying science, and that is a more tractable target: it suggests the missing ingredient is the plan itself, which points to planning and self-correction as the place to focus rather than a ceiling on what the agents know. Closing that gap is the focus of our continuing work, and we will share a fuller report as the evaluation matures.