Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
Scale Labs

Research to Advance AI

Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

[LEADERBOARDS]

Benchmarks for frontier, agentic, and safety capabilities

SWE Atlas - Test WritingSWE Atlas - Codebase QnAMCP AtlasSWE-Bench Pro (Public Dataset)SWE-Bench Pro (Private Dataset)
View more

[SHOWDOWN]

Model-preference rankings from real-world usage.

1claude-opus-4-635.4K votes1120.46
1gpt-5.2-chat-latest51.0K votes1116.55
3claude-opus-4-6 (Thinking)31.4K votes1096.17
3claude-sonnet-4-619.6K votes1095.96
3claude-opus-4-5-2025110124.8K votes1092.19
View more

[PAPERS]

Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

Date Title Category Authors
Date Title
3/12/2026Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafetyDavid Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q. Knight2/26/2026LLM Novice Uplift on Dual-Use, In Silico Biology TasksSafetyChen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael2/25/2026VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and AlignmentVarun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Sam Denton2/12/2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and AlignmentGeorge Pu, Michael S. Lee, Udari Madhushani Sehwag, David Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Sam Denton1/15/2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and AlignmentUdari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernández Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernández Montoya, Bing Liu1/6/2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and AlignmentMohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He
3/12/2026
Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety
2/26/2026
LLM Novice Uplift on Dual-Use, In Silico Biology TasksSafety
2/25/2026
VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment
2/12/2026
LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
1/15/2026
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
1/6/2026
Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
View more
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
VeRO: An Evaluation Harness for Agents to Optimize Agents
LHAW: Controllable Underspecification for Long-Horizon Tasks
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?
Agentic Rubrics as Contextual Verifiers for SWE Agents

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

[BLOG]

Insights, analysis, and updates from Scale Labs

ResearchApr 6, 2026

Improving Multi-Turn Tool Use with GRPO: Results and Insights

We’re sharing early insights from applying GRPO reinforcement learning to multi-turn tool-use tasks using our MCP Tool Use dataset. In a controlled experiment with 3,000 samples, we fine-tuned Qwen2.5-14B using LoRA (rank 32) and evaluated it on MCP Atlas. We observed significant improvement in both coverage rate and pass rate. In this article, we share observations on how data quality, reward design, and training constraints interact in agentic training settings.

ResearchMar 23, 2026

MultiChallenge Update: A More Reliable Multi-Turn Benchmark

We’ve updated the MultiChallenge benchmark to improve evaluation reliability and reduce subjectivity, and re-evaluated frontier models under the new setup.

ResearchMar 20, 2026

Voice Showdown: An In-the-Wild Preference Arena for Voice AI

Voice Showdown is the first large-scale preference arena for voice AI, ranking models through blind comparisons embedded in real user conversations across 60+ languages.

ResearchMar 11, 2026

Agentic Rubrics: Teaching AI to Verify Code the Way Developers Do

Agentic Rubrics is a method for verifying AI-generated code fixes. An agent explores the repo, writes a checklist for what a correct patch should do, and uses that rubric to score candidate fixes.

View allAll posts

[Jobs]

Date Position Location
Position
04.08.2026Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY03.26.2026Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY03.26.2026Machine Learning Research Scientist / Engineer, ReasoningSan Francisco, CA; Seattle, WA; New York, NY03.26.2026Machine Learning Research Scientist / Research Engineer, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY03.26.2026Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY03.26.2026ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY
Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY · 04.08.2026
Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY · 03.26.2026
Machine Learning Research Scientist / Engineer, ReasoningSan Francisco, CA; Seattle, WA; New York, NY · 03.26.2026
Machine Learning Research Scientist / Research Engineer, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY · 03.26.2026
Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY · 03.26.2026
ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY · 03.26.2026
View more

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy