[Blog]

Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Authors

Date Title

2026

15 posts

6/4/2026Can Coding Agents Tackle Early-Stage Drug Discovery?ResearchAfra Feyza Akyürek, Xinming Tu, Sofia Monasdotter, Yuanhao Qu, Sergey Chekhov, Sami Hassaan

6/4/2026

Can Coding Agents Tackle Early-Stage Drug Discovery?Research

5/27/2026HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowResearchTu Trinh, Kelvin Luu, Weijun Luo, Matthew Siegel, Mohamed Elfeki

5/27/2026

HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t KnowResearch

5/19/2026The Path to Large Scale Dense Video CaptioningResearchJade Choghari, Agustin Sansone, Nicolas Pasqualis, Conrado Mader, Aleks Tiupikov, Mouli Sivapurapu

5/19/2026

The Path to Large Scale Dense Video CaptioningResearch

5/11/202657 Healthcare Professionals Told Us What They Need from AIResearchSami Hassaan, Oscar Kavanagh, Matthew Siegel

5/11/2026

57 Healthcare Professionals Told Us What They Need from AIResearch

5/6/2026Coverage Not Averages: Rethinking Retrieval EvaluationResearchAndrew Klearman, Radu Revutchi, Rohin Garg

5/6/2026

Coverage Not Averages: Rethinking Retrieval EvaluationResearch

4/6/2026Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearchRazvan Dumitru, Chetan Rane, Sami Hassaan, Divyansh Agarwal

4/6/2026

Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearch

3/23/2026MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearchVipul Gupta, Matthew Siegel, Marcos Ayestaran

3/23/2026

MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearch

3/20/2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearchAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki

3/20/2026

Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearch

3/11/2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearchMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

3/11/2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearch

3/5/2026VeRO: Can AI Agents Build Better AI Agents?ResearchVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton

3/5/2026

VeRO: Can AI Agents Build Better AI Agents?Research

3/4/2026When AI Safety Becomes a Denial‑of‑Service for DefendersResearchDavid Campbell

3/4/2026

When AI Safety Becomes a Denial‑of‑Service for DefendersResearch

2/17/2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearchGeorge Pu, Mike Lee, Sam Denton

2/17/2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearch

1/28/2026How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdownJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang

1/28/2026

How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdown

1/23/2026MoReBench: Evaluating the Process of AI Moral ReasoningSafety Brandon Handoko, Matthew Siegel, Mike Lee

1/23/2026

MoReBench: Evaluating the Process of AI Moral ReasoningSafety

1/12/2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearchNiklas Lauffer

1/12/2026

Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearch

2025

1 post

11/17/2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearchJerry Chan, Vijay Kalmath, George Pu, Sam Denton

11/17/2025

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearch

16 posts found