DrugDiscoveryBench: Can Coding Agents Assist Early-Stage Drug Discovery?

Afra Feyza Akyürek^†, Xinming Tu^† [Phylo], Alec Gutmanstein, Jason Qin, Sofia Monasdotter, Sergey Chekhov, Brenda Hernandez Villegas, Kirill Chugunov, Judah Engel, Veronica Chatrath, Divyansh Agarwal, Geobio Boo, Ernesto Hernandez, Ying Liu, Yuan (Emily) Xue, Aakash Sabharwal, Daniel Yue Zhang, Zainab Doctor, Yuanhao Qu [Phylo], Yunzhong He, Sami Hassaan

Scale Labs and Phylo built DrugDiscoveryBench to measure how today's top AI agents handle the computational tasks of early-stage drug discovery.

The drug discovery landscape is being reshaped by powerful general-purpose frontier AI models and bespoke AI agents that can plan, write code, and propose and test drug candidates. Yet we still cannot rigorously measure how reliably frontier agents perform the computational, multi-step work that early-stage drug discovery actually demands. We present DrugDiscoveryBench, a benchmark of 82 verifiable, domain-expert-curated tasks spanning drug discovery workflows ranging from target identification to patent mining to structure-activity analysis. Each task is authored by pharmaceutical scientists and biomedical researchers and is grounded in real artifacts such as patents, papers, and database records that agents need to retrieve. To solve tasks, agents use a biomedical tool environment that we adapted from the open-source BIOMNI environment. We evaluate frontier LLM-based coding agents across six different harnesses. We report four main findings: first, only about half of the tasks are within reach for any given agent; second, model pass rate scales cleanly with test-time compute within a model family (GPT-5.5 Codex climbs 27.6 → 39.8 → 43.9%); third, the frontier is tight, with pass rates of GPT-5.5 (xhigh, mini-SWEagent) and Gemini 3.5 Flash (high, Gemini CLI) leading at 51.6% and 50.0% over Opus 4.8 (max, mini-SWE-agent) at 46.8%. Finally, we find that agents lack the scientific reasoning and common sense to rigorously carry a long workflow to the end without dropping a constraint or skipping a step. Re-running unsolved tasks with the expert’s method (the step-by-step instructions and tools) supplied as a hint recovers many of them, allowing 80 out of 82 tasks to be solved by at least one of the agents. We believe this benchmark provides comprehensive coverage of computational and information-retrieval work of early drug discovery

Scale Labs and Phylo built DrugDiscoveryBench to measure how today's top AI agents handle the computational tasks of early-stage drug discovery.