[Blog]

Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Authors

Date Title

2026

10 posts

4/6/2026Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearchRazvan Dumitru, Chetan Rane, Sami Hassaan

4/6/2026

Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearch

3/23/2026MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearchVipul Gupta, Matthew Siegel

3/23/2026

MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearch

3/20/2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearchAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki

3/20/2026

Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearch

3/11/2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearchMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

3/11/2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearch

3/5/2026VeRO: Can AI Agents Build Better AI Agents?ResearchVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton

3/5/2026

VeRO: Can AI Agents Build Better AI Agents?Research

3/4/2026When AI Safety Becomes a Denial‑of‑Service for DefendersResearchDavid Campbell

3/4/2026

When AI Safety Becomes a Denial‑of‑Service for DefendersResearch

2/17/2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearchGeorge Pu, Mike Lee, Sam Denton

2/17/2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearch

1/28/2026How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdownJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang

1/28/2026

How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdown

1/23/2026MoReBench: Evaluating the Process of AI Moral ReasoningSafety Brandon Handoko, Matthew Siegel, Mike Lee

1/23/2026

MoReBench: Evaluating the Process of AI Moral ReasoningSafety

1/12/2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearchNiklas Lauffer

1/12/2026

Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearch

2025

1 post

11/17/2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearchJerry Chan, Vijay Kalmath, George Pu, Sam Denton

11/17/2025

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearch

11 posts found