Research to Advance AI
Scale Labs advances AI through research. Our research focuses on agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.
[LEADERBOARDS]
Benchmarks for frontier, agentic, and safety capabilities
[SHOWDOWN]
Model-preference rankings from real-world usage.
[PAPERS]
Research papers and publications covering agents, post-training, reasoning, safety, evaluation, and alignment, and the science of data.






Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
[BLOG]
Insights, analysis, and updates from Scale Labs
Improving Multi-Turn Tool Use with GRPO: Results and Insights
We’re sharing early insights from applying GRPO reinforcement learning to multi-turn tool-use tasks using our MCP Tool Use dataset. In a controlled experiment with 3,000 samples, we fine-tuned Qwen2.5-14B using LoRA (rank 32) and evaluated it on MCP Atlas. We observed significant improvement in both coverage rate and pass rate. In this article, we share observations on how data quality, reward design, and training constraints interact in agentic training settings.
MultiChallenge Update: A More Reliable Multi-Turn Benchmark
We’ve updated the MultiChallenge benchmark to improve evaluation reliability and reduce subjectivity, and re-evaluated frontier models under the new setup.
Voice Showdown: An In-the-Wild Preference Arena for Voice AI
Voice Showdown is the first large-scale preference arena for voice AI, ranking models through blind comparisons embedded in real user conversations across 60+ languages.
Agentic Rubrics: Teaching AI to Verify Code the Way Developers Do
Agentic Rubrics is a method for verifying AI-generated code fixes. An agent explores the repo, writes a checklist for what a correct patch should do, and uses that rubric to score candidate fixes.
View allAll posts