Insights, analysis, and updates from Scale Labs on AI evaluation, benchmarks, and research.
Authors
Date Title
20268 posts
Mar 20, 2026Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearchAdvait Gosai, Janie Gu, Bing Liu, Mohamed Elfeki
Mar 20, 2026
Voice Showdown: An In-the-Wild Preference Arena for Voice AIResearch
Mar 11, 2026Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearchMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He
Mar 11, 2026
Agentic Rubrics: Teaching AI to Verify Code the Way Developers DoResearch
Mar 5, 2026VeRO: Can AI Agents Build Better AI Agents?ResearchVarun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton
Mar 5, 2026
VeRO: Can AI Agents Build Better AI Agents?Research
Mar 4, 2026When AI Safety Becomes a Denial‑of‑Service for DefendersResearchDavid Campbell
Mar 4, 2026
When AI Safety Becomes a Denial‑of‑Service for DefendersResearch
Feb 17, 2026Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearchGeorge Pu, Mike Lee, Sam Denton
Feb 17, 2026
Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon TasksResearch
Jan 28, 2026How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdownJanie Gu, Jaehwan Jeong, David Lee, Bing Liu, Zihao Wang
Jan 28, 2026
How Profession Shapes LLM Usage: Insights from SEAL ShowdownShowdown
Jan 23, 2026MoReBench: Evaluating the Process of AI Moral ReasoningSafety Brandon Handoko, Matthew Siegel, Mike Lee
Jan 23, 2026
MoReBench: Evaluating the Process of AI Moral ReasoningSafety
Jan 12, 2026Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearchNiklas Lauffer
Jan 12, 2026
Training Robust Multi-Turn LM Agents with On-Policy Expert CorrectionsResearch
20251 post
Nov 17, 2025Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearchJerry Chan, Vijay Kalmath, George Pu, Sam Denton
Nov 17, 2025
Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback LoopsResearch