Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
← All posts

Posts by Mike Lee

Research17. 02 2026

Introducing Long Horizon Augmented Workflows: Controllable Underspecification for Long-Horizon Tasks

LHAW is a dataset-agnostic pipeline for generating underspecified long-horizon tasks and evaluating strategic clarification. Across MCP-Atlas, TAC, and SWE-Bench Pro, we find large differences in how frontier models detect missing information and recover performance under ambiguity.

George Pu, Mike Lee, Sam Denton

Safety23. 01 2026

MoReBench: Evaluating the Process of AI Moral Reasoning

MoReBench is a benchmark designed to evaluate the procedural moral reasoning of large language models. Using expert-authored rubrics across diverse ethical scenarios, it scores models on the structure and coherence of their reasoning rather than task outcomes. Our findings show that moral reasoning remains weakly correlated with established benchmarks and warrants targeted evaluation and training.

Brandon Handoko, Matthew Siegel, Mike Lee

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy