Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
Safety, Evaluation and Alignment11.13.2025

Professional Reasoning Benchmark

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He

View paper

PRBench is the first benchmark to evaluate LLMs on high-stakes professional reasoning in Finance and Law.

See PRBench for Finance here: https://scale.com/leaderboard/prbench-finance

See PRBench for Legal here: https://scale.com/leaderboard/prbench-legal

Explore the data here: https://prbench-explorer.vercel.app/

Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. This gap is significant, as high-stakes domains like Legal and Finance are common professional use cases yet remain underexplored. Existing evaluations often fail to assess open-ended, economically consequential tasks where practical returns are paramount.

To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric- based benchmark for both legal and finance domains to our knowledge. We recruited 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks based on their actual client work. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions across both Finance and Legal domains. 

Our expert-curated rubrics were validated through a rigorous quality pipeline, including inter-rater agreement analysis and independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further analyze model performance by using the rubric categories provided by our annotators; revealing that even models with similar overall scores can exhibit large performance disparities on specific capability clusters. Combined with hierarchical clustering on rubrics and ablations, our analysis also reveals common failure modes: including inaccurate judgments, a lack of process transparency, and incomplete reasoning. This highlights critical gaps in reliability for professional adoption.

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy
Professional Reasoning Benchmark