Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
Scale Labs

Research to Advance AI

Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

[LEADERBOARDS]

Benchmarks for frontier, agentic, and safety capabilities

SWE Atlas - RefactoringSWE Atlas - Test WritingSWE Atlas - Codebase QnAHiL-Bench (Human-in-Loop Benchmark)MCP Atlas
View more

[SHOWDOWN]

Model-preference rankings from real-world usage.

1claude-opus-4-645.5K votes1070.53
1gpt-5.2-chat-latest55.2K votes1069.52
1claude-opus-4-7 (Thinking)6.1K votes1063.82
1claude-opus-4-75.4K votes1062.19
3gpt-5.5-2026-04-235.3K votes1052.91
View more

[PAPERS]

Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

Date Title Category Authors
Date Title
5/19/2026Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearchUtkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He5/17/2026ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and AlignmentUdari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, and Max Fenkell5/12/2026Reward Hacking in Rubric-Based Reinforcement LearningResearchAnas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He5/7/2026SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, AgentsMohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, Mohammad Hossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He4/22/2026Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, ResearchAndrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue4/13/2026HiL-BENCH (Human-in-Loop Benchmark)ResearchMohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Hernández, Nandan Marwaha, Yannis Yiming He, Charles Wang, Fernando Carabedo, Alessa Castillo, Bing Liu
5/19/2026
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearch
5/17/2026
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and Alignment
5/12/2026
Reward Hacking in Rubric-Based Reinforcement LearningResearch
5/7/2026
SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, Agents
4/22/2026
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, Research
4/13/2026
HiL-BENCH (Human-in-Loop Benchmark)Research
View more
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Reward Hacking in Rubric-Based Reinforcement Learning
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation
HiL-BENCH (Human-in-Loop Benchmark)

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

[BLOG]

Insights, analysis, and updates from Scale Labs

ResearchMay 27, 2026

HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t Know

HiL-Dynamics is our new diagnostic tool for studying how coding agents handle underspecified tasks. Across four modern harnesses, the verdict is the same: agents have learned to ask well, but not when to ask.

ResearchMay 19, 2026

The Path to Large Scale Dense Video Captioning

We ran dozens of experiments on dense captioning for robot manipulation video. The biggest lever turned out to be how we represented the video to the model. Most techniques from the literature added noise on smaller models.

ResearchMay 11, 2026

57 Healthcare Professionals Told Us What They Need from AI

We surveyed 57 healthcare professionals about what they actually want from AI. Their answers point to three capability gaps that current evaluations miss.

ResearchMay 6, 2026

Coverage Not Averages: Rethinking Retrieval Evaluation

A single benchmark score suggests stability and completeness. In reality, it may reflect performance on a narrow and biased slice of the problem.

View allAll posts

[Jobs]

Date Position Location
Position
05.08.2026Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY05.26.2026Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY05.08.2026Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY05.08.2026Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY05.08.2026Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY05.08.2026ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY
Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY · 05.08.2026
Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY · 05.26.2026
Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
View more

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy