Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
Scale Labs

Research to Advance AI

Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

[LEADERBOARDS]

Benchmarks for frontier, agentic, and safety capabilities

SWE Atlas - Codebase QnAMCP AtlasSWE-Bench Pro (Public Dataset)SWE-Bench Pro (Private Dataset) SciPredict
View more

[SHOWDOWN]

Model-preference rankings from real-world usage.

1gpt-5.2-chat-latest8.6K votes1117.62
2claude-opus-4-5-202511019.7K votes1101.62
3gpt-5-chat11.5K votes1087.88
3claude-sonnet-4-5-2025092914.6K votes1087.71
3gemini-3-flash8.6K votes1082.20
View more

[PAPERS]

Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

Date Title Category Authors
Date Title
02.12.2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and AlignmentGeorge Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution01.15.2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and AlignmentUdari Madhushani Sehwag1, Elaine Lau1†, Haniyeh Ehsani Oskouie2,5, Shayan Shabihi3, Erich Liang4,5, Andrea Toledo1, Guillermo Mangialardi1, Sergio Fonrouge1, Ed-Yeremai Hernández Cardona1, Paula Vergara1, Utkarsh Tyagi1, Chen Bo Calvin Zhang1, Pavi Bhatter1, Nicholas Johnson1, Furong Huang3, Ernesto Gabriel Hernández Montoya1, and Bing Liu1 1Scale AI, 2University of California, Los Angeles, 3University of Maryland, 4Princeton University, 5Human Frontier Collective, Scale AI †Work done while at Scale AI01.06.2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and AlignmentMohit Raghavendra*, Anisha Gunjal*, Bing Liu, Yunzhong He *Equal contribution.12.22.2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and AlignmentYu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine12.18.2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and AlignmentChaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu12.17.2025Audio MultiChallengeMultimodal, Safety, Evaluation and AlignmentAdvait Gosai*, Tyler Vuong*, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He *Indicates equal contribution
02.12.2026
LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
01.15.2026
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
01.06.2026
Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
12.22.2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment
12.18.2025
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment
12.17.2025
Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment
View more
LHAW: Controllable Underspecification for Long-Horizon Tasks
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?
Agentic Rubrics as Contextual Verifiers for SWE Agents
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Audio MultiChallenge

LHAW: Controllable Underspecification for Long-Horizon Tasks

[BLOG]

Insights, analysis, and updates from Scale Labs

ResearchMar 5, 2026

VeRO: Can AI Agents Build Better AI Agents?

VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.

ResearchMar 4, 2026

When AI Safety Becomes a Denial‑of‑Service for Defenders

Most AI safety benchmarks measure whether models help when they shouldn’t. But what happens when they refuse when they shouldn’t? An analysis of real-world defender interactions reveals how alignment systems can block legitimate cybersecurity work—exposing a blind spot in how AI safety is currently evaluated.

ResearchFeb 17, 2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.

ShowdownJan 28, 2026

How Profession Shapes LLM Usage: Insights from SEAL Showdown

We analyze 580k+ production prompts and 100k+ preference battles from SEAL Showdown to study how profession shapes LLM usage. We find that professional background—independent of topic—predicts prompt difficulty, task type, and model preference, with domain experts asking harder in-domain questions and ranking models differently. These results motivate profession-aware evaluation of LLMs in expert workflows.

View allAll posts

[Jobs]

Date Position Location
Position
02.12.2026AI Infrastructure Engineer, Core InfrastructureSan Francisco, CA; Seattle, WA; New York, NY02.12.2026AI Infrastructure Engineer, Model Serving PlatformSan Francisco, CA; New York, NY02.12.2026Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY02.12.2026Machine Learning Research Scientist / Engineer, ReasoningSan Francisco, CA; Seattle, WA; New York, NY02.12.2026Machine Learning Research Scientist / Research Engineer, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY02.12.2026Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY
AI Infrastructure Engineer, Core InfrastructureSan Francisco, CA; Seattle, WA; New York, NY · 02.12.2026
AI Infrastructure Engineer, Model Serving PlatformSan Francisco, CA; New York, NY · 02.12.2026
Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY · 02.12.2026
Machine Learning Research Scientist / Engineer, ReasoningSan Francisco, CA; Seattle, WA; New York, NY · 02.12.2026
Machine Learning Research Scientist / Research Engineer, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY · 02.12.2026
Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY · 02.12.2026
View more

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy