Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
Scale Labs

Research to Advance AI

Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

[LEADERBOARDS]

Benchmarks for frontier, agentic, and safety capabilities

SWE Atlas - RefactoringSWE Atlas - Test WritingSWE Atlas - Codebase QnAHiL-Bench (Human-in-Loop Benchmark)MCP Atlas
View more

[SHOWDOWN]

Model-preference rankings from real-world usage.

1claude-opus-4-645.5K votes1070.41
1gpt-5.2-chat-latest55.2K votes1069.57
1claude-opus-4-7 (Thinking)6.1K votes1063.86
1claude-opus-4-75.4K votes1062.40
3gpt-5.5-2026-04-235.3K votes1052.62
View more

[PAPERS]

Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

Date Title Category Authors
Date Title
5/19/2026Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearchUtkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He5/17/2026ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and AlignmentUdari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, and Max Fenkell5/12/2026Reward Hacking in Rubric-Based Reinforcement LearningResearchAnas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He5/7/2026SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, AgentsMohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, Mohammad Hossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He4/22/2026Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, ResearchAndrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue4/13/2026HiL-BENCH (Human-in-Loop Benchmark)ResearchMohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Hernández, Nandan Marwaha, Yannis Yiming He, Charles Wang, Fernando Carabedo, Alessa Castillo, Bing Liu
5/19/2026
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearch
5/17/2026
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and Alignment
5/12/2026
Reward Hacking in Rubric-Based Reinforcement LearningResearch
5/7/2026
SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, Agents
4/22/2026
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, Research
4/13/2026
HiL-BENCH (Human-in-Loop Benchmark)Research
View more
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Reward Hacking in Rubric-Based Reinforcement Learning
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation
HiL-BENCH (Human-in-Loop Benchmark)

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

[BLOG]

Insights, analysis, and updates from Scale Labs

ResearchMay 19, 2026

The Path to Large Scale Dense Video Captioning

We ran dozens of experiments on dense captioning for robot manipulation video. The biggest lever turned out to be how we represented the video to the model. Most techniques from the literature added noise on smaller models.

ResearchMay 11, 2026

57 Healthcare Professionals Told Us What They Need from AI

We surveyed 57 healthcare professionals about what they actually want from AI. Their answers point to three capability gaps that current evaluations miss.

ResearchMay 6, 2026

Coverage Not Averages: Rethinking Retrieval Evaluation

A single benchmark score suggests stability and completeness. In reality, it may reflect performance on a narrow and biased slice of the problem.

ResearchApr 6, 2026

Improving Multi-Turn Tool Use with GRPO: Results and Insights

We’re sharing early insights from applying GRPO reinforcement learning to multi-turn tool-use tasks using our MCP Tool Use dataset. In a controlled experiment with 3,000 samples, we fine-tuned Qwen2.5-14B using LoRA (rank 32) and evaluated it on MCP Atlas. We observed significant improvement in both coverage rate and pass rate. In this article, we share observations on how data quality, reward design, and training constraints interact in agentic training settings.

View allAll posts

[Jobs]

Date Position Location
Position
05.08.2026Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY05.08.2026Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY05.08.2026Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY05.08.2026Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY05.08.2026Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY05.08.2026ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY
Director, Enterprise Machine Learning & ResearchSan Francisco, CA; New York, NY · 05.08.2026
Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY · 05.08.2026
Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026
View more

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy