Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]
BACK
Evaluation and AlignmentAgentsEnterprise6/9/2026

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Keqi Han†, Ryan Young†, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan (Emily) Xue, Zhijun Yin

View paper

PSEBench: a 5,074-case benchmark for evaluating LLMs on patient safety event triage, covering complete, missing-information, and uncertain cases.

Patient safety event triage, determining whether a clinical event is reportable under jurisdictionspecific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchordriven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota’s 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark’s utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy