Research to Advance AI
Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.
[LEADERBOARDS]
Benchmarks for frontier, agentic, and safety capabilities
[SHOWDOWN]
Model-preference rankings from real-world usage.
[PAPERS]
Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.






LHAW: Controllable Underspecification for Long-Horizon Tasks
[BLOG]
Insights, analysis, and updates from Scale Labs
VeRO: Can AI Agents Build Better AI Agents?
VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.
When AI Safety Becomes a Denial‑of‑Service for Defenders
Most AI safety benchmarks measure whether models help when they shouldn’t. But what happens when they refuse when they shouldn’t? An analysis of real-world defender interactions reveals how alignment systems can block legitimate cybersecurity work—exposing a blind spot in how AI safety is currently evaluated.
Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks
LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.
How Profession Shapes LLM Usage: Insights from SEAL Showdown
We analyze 580k+ production prompts and 100k+ preference battles from SEAL Showdown to study how profession shapes LLM usage. We find that professional background—independent of topic—predicts prompt difficulty, task type, and model preference, with domain experts asking harder in-domain questions and ranking models differently. These results motivate profession-aware evaluation of LLMs in expert workflows.
View allAll posts