Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy

[Blog]

Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.

Authors
Date Title
2026
10 posts
4/6/2026Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearchRazvan Dumitru, Chetan Rane, Sami Hassaan
4/6/2026
Improving Multi-Turn Tool Use with GRPO: Results and InsightsResearch
3/23/2026MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearchVipul Gupta, Matthew Siegel
3/23/2026
MultiChallenge Update: A More Reliable Multi-Turn BenchmarkResearch
3/20/2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearchAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki
3/20/2026
Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearch
3/11/2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearchMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He
3/11/2026
Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearch
3/5/2026VeRO: Can AI Agents Build Better AI Agents?ResearchVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton
3/5/2026
VeRO: Can AI Agents Build Better AI Agents?Research
3/4/2026When AI Safety Becomes a Denial‑of‑Service for DefendersResearchDavid Campbell
3/4/2026
When AI Safety Becomes a Denial‑of‑Service for DefendersResearch
2/17/2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearchGeorge Pu, Mike Lee, Sam Denton
2/17/2026
Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearch
1/28/2026How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdownJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang
1/28/2026
How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdown
1/23/2026MoReBench: Evaluating the Process of AI Moral ReasoningSafety Brandon Handoko, Matthew Siegel, Mike Lee
1/23/2026
MoReBench: Evaluating the Process of AI Moral ReasoningSafety
1/12/2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearchNiklas Lauffer
1/12/2026
Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearch
2025
1 post
11/17/2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearchJerry Chan, Vijay Kalmath, George Pu, Sam Denton
11/17/2025
Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearch

11 posts found