Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
← All posts

Posts by Sam Denton

Research05. 03 2026

VeRO: Can AI Agents Build Better AI Agents?

VeRO benchmarks whether coding agents can improve other AI agents by modifying their prompts, tools, and control logic. Across 105 optimization runs, results show modest gains on tool-use tasks but persistent limits in exploration, cross-model generalization, and deeper architectural changes.

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Sam Denton

Research17. 02 2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.

George Pu, Mike Lee, Sam Denton

Research17. 11 2025

Scaling Enterprise Agent Performance with Reinforcement Learning via Verifiable Feedback Loops

We demonstrate that reinforcement learning can be used to fine-tune agents within realistic enterprise environments, leveraging task-specific feedback and structured rewards to substantially improve performance metrics compared to baseline models.

Jerry Chan, Vijay Kalmath, George Pu, Sam Denton

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy