Research to Advance AI
Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.
[LEADERBOARDS]
Benchmarks for frontier, agentic, and safety capabilities
[SHOWDOWN]
Model-preference rankings from real-world usage.
[PAPERS]
Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.






VeRO: An Evaluation Harness for Agents to Optimize Agents
[BLOG]
Insights, analysis, and updates from Scale Labs
Voice Showdown: An In-the-Wild Preference Arena for Voice AI
Voice Showdown is the first large-scale preference arena for voice AI, ranking models through blind comparisons embedded in real user conversations across 60+ languages.
Agentic Rubrics: Teaching AI to Verify Code the Way Developers Do
Agentic Rubrics is a method for verifying AI-generated code fixes. An agent explores the repo, writes a checklist for what a correct patch should do, and uses that rubric to score candidate fixes.
VeRO: Can AI Agents Build Better AI Agents?
VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.
When AI Safety Becomes a Denial‑of‑Service for Defenders
Most AI safety benchmarks measure whether models help when they shouldn’t. But what happens when they refuse when they shouldn’t? An analysis of real-world defender interactions reveals how alignment systems can block legitimate cybersecurity work—exposing a blind spot in how AI safety is currently evaluated.
View allAll posts