Research to Advance AI

Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

[LEADERBOARDS]

Benchmarks for frontier, agentic, and safety capabilities

SWE Atlas - Refactoring SWE Atlas - Test Writing SWE Atlas - Codebase QnA HiL-Bench (Human-in-Loop Benchmark)MCP Atlas

[SHOWDOWN]

Model-preference rankings from real-world usage.

1claude-opus-4-645.5K votes1070.43

1gpt-5.2-chat-latest55.2K votes1069.77

1claude-opus-4-7 (Thinking)6.1K votes1063.81

1claude-opus-4-75.4K votes1061.88

3gpt-5.5-2026-04-235.3K votes1052.36

[PAPERS]

Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.

Date Title Category Authors

Date Title

5/19/2026Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearchUtkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He 5/17/2026ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and AlignmentUdari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, Max Fenkell 5/12/2026Reward Hacking in Rubric-Based Reinforcement LearningResearchAnas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He 5/7/2026SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, AgentsMohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, MohammadHossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He 4/22/2026Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, ResearchAndrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Sam Denton, Yuan Xue 4/13/2026HiL-BENCH (Human-in-Loop Benchmark)ResearchMohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Hernandez, Nandan Marwaha, Yannis Yiming He, Charles Wang, Fernando Carabedo, Alessa Castillo, Bing Liu

5/19/2026

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVRResearch

5/17/2026

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM AgentsSafety, Agents, Evaluation and Alignment

5/12/2026

Reward Hacking in Rubric-Based Reinforcement LearningResearch

5/7/2026

SWE Atlas: Benchmarking Coding Agents Beyond Issue ResolutionEvaluation and Alignment, Agents

4/22/2026

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval EvaluationEvaluation and Alignment, Science of Data, Research

4/13/2026

HiL-BENCH (Human-in-Loop Benchmark)Research

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

Reward Hacking in Rubric-Based Reinforcement Learning

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

[BLOG]

Insights, analysis, and updates from Scale Labs

ResearchJun 4, 2026

Can Coding Agents Tackle Early-Stage Drug Discovery?

Across 66 expert-curated drug-discovery tasks, three frontier coding agents each show distinct strengths but share one weakness: the long, multi-step pipelines that demand high-level planning rather than scientific knowledge.

ResearchMay 27, 2026

HiL-Dynamics: Understanding Agents That Don’t Know What They Don’t Know

HiL-Dynamics is our new diagnostic tool for studying how coding agents handle underspecified tasks. Across four modern harnesses, the verdict is the same: agents have learned to ask well, but not when to ask.

ResearchMay 19, 2026

The Path to Large Scale Dense Video Captioning

We ran dozens of experiments on dense captioning for robot manipulation video. The biggest lever turned out to be how we represented the video to the model. Most techniques from the literature added noise on smaller models.

ResearchMay 11, 2026

57 Healthcare Professionals Told Us What They Need from AI

We surveyed 57 healthcare professionals about what they actually want from AI. Their answers point to three capability gaps that current evaluations miss.

View allAll posts

[Jobs]

Date Position Location

Position

05.26.2026Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY 05.08.2026Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY 05.08.2026Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY 05.08.2026Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY 05.08.2026ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY 05.26.2026Research Scientist, Agent RobustnessSan Francisco, CA; New York, NY

Machine Learning Research Engineer, GenAI Applied MLSan Francisco, CA; New York, NY · 05.26.2026

Machine Learning Research Scientist, Post-TrainingSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026

Machine Learning Research Scientist, ReasoningSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026

Manager, Machine Learning Research Scientist, GenAISan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026

ML Research Engineer, ML SystemsSan Francisco, CA; Seattle, WA; New York, NY · 05.08.2026

Research Scientist, Agent RobustnessSan Francisco, CA; New York, NY · 05.26.2026