[Papers]
Research papers and publications from Scale Labs covering AI evaluation, safety, benchmarking, and frontier model analysis.
Date Title
02.12.2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and AlignmentGeorge Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution01.15.2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and AlignmentUdari Madhushani Sehwag1, Elaine Lau1†, Haniyeh Ehsani Oskouie2,5, Shayan Shabihi3, Erich Liang4,5, Andrea Toledo1, Guillermo Mangialardi1, Sergio Fonrouge1, Ed-Yeremai Hernández Cardona1, Paula Vergara1, Utkarsh Tyagi1, Chen Bo Calvin Zhang1, Pavi Bhatter1, Nicholas Johnson1, Furong Huang3, Ernesto Gabriel Hernández Montoya1, and Bing Liu1 1Scale AI, 2University of California, Los Angeles, 3University of Maryland, 4Princeton University, 5Human Frontier Collective, Scale AI †Work done while at Scale AI01.06.2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and AlignmentMohit Raghavendra*, Anisha Gunjal*, Bing Liu, Yunzhong He *Equal contribution.12.22.2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and AlignmentYu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine12.18.2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and AlignmentChaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu12.17.2025Audio MultiChallengeMultimodal, Safety, Evaluation and AlignmentAdvait Gosai*, Tyler Vuong*, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He *Indicates equal contribution11.25.2025PropensityBenchSafety, Evaluation and AlignmentUdari Madhushani Sehwag1* , Shayan Shabihi2* , Alex McAvoy3 , Vikash Sehwag4 , Yuancheng Xu5, Dalton Towers6 , Furong Huang2 1Scale AI, 2University of Maryland, College Park, 3University of North Carolina at Chapel Hill, 4Google DeepMind, 5Netflix, 6University of Texas at Austin * Equal Contributions11.13.2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, ReasoningAfra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He11.10.2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and AlignmentManasi Sharma1, Chen Bo Calvin Zhang1, Chaithanya Bandi1, Clinton Wang†, Ankit Aich1, Huy Nghiem2, Tahseen Rabbani3, Ye Htet4, Brian Jang1 , Sumana Basu5 , Aishwarya Balwani1, Denis Peskoff6 , Marcos Ayestaran1 , Sean M. Hendryx†, Brad Kenstler1, Bing Liu1 1Scale AI, 2University of Maryland, 3University of Chicago, 4Washington University, St. Louis, 5McGill University, 6University of California, Berkeley †Work conducted while at Scale AI11.05.2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and AlignmentBoyi Wei1, 2∗† , Zora Che1, 3∗† , Nathaniel Li1†, Udari Madhushani Sehwag1 , Jasper Götting4 , Samira Nedungadi4 , Julian Michael1†, Summer Yue1†, Dan Hendrycks5 , Peter Henderson2 , Zifan Wang1†, Seth Donoughe4 , Mantas Mazeika5 1Scale AI, 2Princeton University, 3University of Maryland, 4SecureBio, 5Center for AI Safety ∗ Equal Contributions, † Work done while at Scale AI10.28.2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, ReasoningMantas Mazeika∗1 , Alice Gatti∗1 , Cristina Menghini∗† , Udari Madhushani Sehwag∗2 , Shivam Singhal∗†, Yury Orlovskiy∗1, Steven Basart1 , Manasi Sharma2 , Denis Peskoff2 , Elaine Lau2 , Sumana Basu2 , Jaehyuk Lim1 , Lachlan Carroll1 , Alice Blair1 , Vinaya Sivakumar1 , Brad Kenstler2 , Yuntao Ma† , Julian Michael† , Xiaoke Li1 , Oliver Ingebretsen1 , Aditya Mehta1 , Jean Mottola1 , John Teichmann‡ , Kevin Yu‡ , Zaina Shaik‡ , Adam Khoja1 , Richard Ren1 , Jason Hausenloy1 , Long Phan1 , Connor Smith1 , Ye Htet2 , Ankit Aich2 , Tahseen Rabbani2 , Vivswan Shah† , Andriy Novykov1 , Felix Binder† Kirill Chugunov2 , Luis Ramirez2 , Matias Geralnik2 , Hernán Mesura2 , Dean Lee2 , Ed-Yeremai Hernandez Cardona2 , Annette Diamond2 Summer Yue**†, Alexandr Wang**†, Bing Liu**2, Ernesto Hernandez**2 , Dan Hendrycks**1 1Center for AI Safety 2Scale AI *Equal contribution **Senior authors †Work done while at Scale AI ‡Work done while at CAIS10.20.2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and AlignmentZafir Stojanovski1∗, Oliver Stanley1,2∗, Joe Sharratt1∗,Richard Jones1∗,Abdulhakeem Adefioye1, Jean Kaddour3† Andreas Köpf1† 1Open-Thought, 2Scale AI, 3University College London10.15.2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, MultimodalXingang Guo1,2 , Utkarsh Tyagi1 , Advait Gosai1 , Paula Vergara1 , Ernesto Gabriel Hernandez Montoya1 , Chen Bo Calvin Zhang1 , Bin Hu2 , Yunzhong He1 , Bing Liu1 , Rakshith Sharma Srinivasa1 1Scale AI, 2University of Illinois at Urbana-Champaign10.08.2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-TrainingMohammadHossein Rezaei1,2,∗, Robert Vacareanu1, Zihao Wang1, Clinton Wang1, Bing Liu1, Yunzhong He1 , and Afra Feyza Akyürek1 1Scale AI, 2University of Arizona *Work done during internship at Scale AI09.25.2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of DataJunkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin09.23.2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and AlignmentAlwin Jin, Sean M. Hendryx, Vaskar Nath09.19.2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and AlignmentXiang Deng*, Jeff Da*, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, Brad Kenstler *Co-first author and equal contributions.09.11.2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and AlignmentRakshith S Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego Mares, Ernesto Hernandez, Jayeon Park, Dean Lee, Guillermo Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing08.26.2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, OversightNeil Kale1, 2, †, Chen Bo Calvin Zhang1, *, Kevin Zhu1, 3, †, *, Ankit Aich1 , Paula Rodriguez1 , Scale Red Team1 , Christina Q. Knight1 , and Zifan Wang1 1Scale AI, 2Carnegie Mellon University, 3Massachusetts Institute of Technology * Equal Contributions, †Work done during internship at Scale AI08.13.2025Search-Time Data ContaminationSafety, Evaluation and Alignment, OversightZiwen Han, Meher Mankikar, Julian Michael, and Zifan Wang07.23.2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and AlignmentAlexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing07.23.2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-TrainingAnisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, Sean Hendryx07.21.2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and AlignmentBoyuan Zheng1, Zeyi Liao1, Scott Salisbury1, Zeyuan Liu1, Michael Lin1, Qinyuan Zheng1, Zifan Wang2, Xiang Deng2, Dawn Song3, Huan Sun1, Yu Su1 1The Ohio State University 2Scale AI 3University of California, Berkeley07.15.2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and AlignmentTomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik06.28.2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, ReasoningMiles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael06.18.2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and AlignmentChristina Q. Knight∗, Kaustubh Deshpande⋄, Ved Sirdeshmukh⋄, Meher Mankikar, Scale Red Team, SEAL Research Team, and Julian Michael ∗ Project Lead, ⋄ Equal Contribution06.16.2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoningVaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx06.13.2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, ReasoningJeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx06.05.2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and AlignmentZifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian Michael05.09.2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of DataJulia Shuieh, Prasann Singhal, Apaar Shanker, John Heyer, George Pu, Samuel Denton03.14.2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoningWill LeVine, Bijan Varjavand03.08.2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and AlignmentBenjamin Jensen, Ian Reynolds, Yasir Atalan, Michael Garcia, Austin Woo, Anthony Chen, Trevor Howarth03.05.2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and AlignmentRichard Ren∗1, Arunim Agarwal∗1, Mantas Mazeika∗1, Cristina Menghini∗2, Robert Vacareanu2, Brad Kenstler2, Mick Yang1, Isabelle Barrass1, Alice Gatti1, Xuwang Yin1, Eduardo Trevino2, Matias Geralnik2, Adam Khoja1,Dean Lee2, Summer Yue2, Dan Hendrycks1 1Center for AI Safety 2Scale AI *Equal contribution.02.13.2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and AlignmentClinton J. Wang1, Dean Lee1 , Cristina Menghini1, Johannes Mols1, Jack Doughty1, Adam Khoja2, Jayson Lynch3, Sean Hendryx1, Summer Yue1, Dan Hendrycks2 1Scale AI, 2Center for AI Safety, 3MIT02.11.2025J2: Jailbreaking to JailbreakSafety, Evaluation and AlignmentJeremy Kritz, Vaughn Robinson, Robert Vacareanu, Bijan Varjavand, Michael Choi, Bobby Gogov, Scale Red Team, Summer Yue, Willow E. Primack, Zifan Wang02.10.2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and AlignmentYibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing01.29.2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, ReasoningVed Sirdeshmukh*, Kaustubh Deshpande*, Johannes Mols*, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing *Indicates Equal Contribution01.23.2025Humanity's Last ExamSafety, Evaluation and Alignment, ReasoningLong Phan∗1 , Alice Gatti∗1 , Ziwen Han∗2 , Nathaniel Li∗1 , Josephina Hu2 , Hugh Zhang‡, Sean Shi2, Michael Choi2, Anish Agrawal2, Arnav Chopra2, Adam Khoja1, Ryan Kim†, Richard Ren1, Jason Hausenloy1, Oliver Zhang1 , Mantas Mazeika1 , Summer Yue∗∗2 , Alexandr Wang∗∗2 , Dan Hendrycks∗∗1 1 Center for AI Safety, 2 Scale AI ∗Co-first Authors. ∗∗ Senior Authors. † Work conducted while at Center for AI Safety. ‡ Work conducted while at Scale AI. Refer to PDF for full list of Dataset Contributors.01.02.2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, OversightVaskar Nath, Pranav Raja, Claire Yoon, Sean Hendryx10.11.2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and AlignmentPriyanshu Kumar 1, Elaine Lau 3, Saranya Vijayakumar 1, Tu Trinh 3, Scale Red Team 3, Elaine Chang 3, Vaughn Robinson 3, Sean Hendryx 3, Shuyan Zhou 1, Matt Fredrikson 1, 2, Summer Yue 3, Zifan Wang 3 1 Carnegie Mellon University, 2 GraySwan Al, 3 Scale Al09.29.2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of DataYung-Chieh Chan∗, George Pu∗, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton ∗Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.09.27.2024Revisiting the Superficial Alignment HypothesisPost-TrainingMohit Raghavendra°1 , Vaskar Nath2 , Sean Hendryx2 1Georgia Institute of Technology, 2Scale AI, °Work conducted while at Scale AI09.05.2024Planning In Natural Language Improves LLM Search For Code GenerationPost-TrainingEvan Wang 1, 2, Federico Cassano o3,4, Catherine Wu o, Yunfeng Bai 1, Will Song 1, Vaskar Nath 1, Ziwen Han 1, Sean Hendryx 1, Summer Yue 1, Hugh Zhang 1 1 Scale AI , 2 California Institute of Technology, 3 Northeastern University, 4 Cursor AI, o Work conducted while at Scale AI08.30.2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of DataSpencer Whitehead, Jacob Phillips, Sean Hendryx08.27.2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and AlignmentNathaniel Li 1, 2, Ziwen Han 1, Ian Steneker 1, Willow Primack 1, Riley Goodside 1, Hugh Zhang 1, Zifan Wang 1, Cristina Menghini 1, Summer Yue 1 1 Scale AI , 2 UC Berkeley07.18.2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-TrainingVaskar Nath∗†, Dylan Slack∗, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead‡, Sean Hendryx‡ ∗Equal contribution †Corresponding author: vaskar.nath@scale.com ‡Equal senior authorship05.01.2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and AlignmentHugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati†, Summer Yue†03.05.2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-TrainingNathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Samuel Marks, Oam Patel, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Ruoyu Wang, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks01.22.2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer VisionWill LeVine, Benjamin Pikus, Jacob Phillips, Berk Norman, Fernando Amat Gil, Sean Hendryx11.21.2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-TrainingWill LeVine, Benjamin Pikus, Anthony Chen, Sean Hendryx10.05.2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and AlignmentDylan Slack*, Jean Wang*, Denis Semenenko*, Kate Park, Sean Hendryx *Equal Contribution10.04.2023On the Performance of Multimodal Language ModelsMultimodal, Post-TrainingUtsav Garg, Erhan Bas04.28.2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-TrainingGeorge Pu, Anirudh Jain, Jihan Yin, Russell Kaplan04.11.2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer VisionAnisha Gunjal*, Jihan Yin*, Erhan Bas† *These authors contributed equally. †Work done at Scale AI.03.11.2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer VisionWill Levine† , Benjamin Pikus† , Pranav Raja & Fernando Amat Gil † denotes equal contribution01.29.2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and AlignmentYatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh Sojoudi03.07.2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer VisionKareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga11.16.2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer VisionKareem Metwaly, Aerin Kim, Elliot Branson, Vishal Monga11.07.2021Natural Adversarial ObjectsComputer VisionFelix Lau, Nishant Subramani, Sasha Harrison, Aerin Kim, Elliot Branson, Rosanne Liu10.11.2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and AlignmentJohn Pougué-Biyong*, Valentina Semenova*, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, J. Doyne Farmer *Equal contribution07.31.2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of DataZeyad Emam1 2, Andrew Kondrich 1, Sasha Harrison 1, Felix Lau 1, Yushi Wang 1, Aerin Kim 1, Elliot Branson 104.20.2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer VisionMatthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Rachel Han, Aerin Kim, Arash Koochek, Omar Badri11.27.2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer VisionNishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian Lam02.12.2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
01.15.2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
01.06.2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
12.22.2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment
12.18.2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment
12.17.2025Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment
11.25.2025PropensityBenchSafety, Evaluation and Alignment
11.13.2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning
11.10.2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment
11.05.2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment
10.28.2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning
10.20.2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment
10.15.2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal
10.08.2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training
09.25.2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data
09.23.2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment
09.19.2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment
09.11.2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment
08.26.2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight
08.13.2025Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight
07.23.2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment
07.23.2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training
07.21.2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment
07.15.2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment
06.28.2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning
06.18.2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment
06.16.2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning
06.13.2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning
06.05.2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment
05.09.2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data
03.14.2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning
03.08.2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment
03.05.2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment
02.13.2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment
02.11.2025J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment
02.10.2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment
01.29.2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning
01.23.2025Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning
01.02.2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight
10.11.2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment
09.29.2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data
09.27.2024Revisiting the Superficial Alignment HypothesisPost-Training
09.05.2024Planning In Natural Language Improves LLM Search For Code GenerationPost-Training
08.30.2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data
08.27.2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment
07.18.2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training
05.01.2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment
03.05.2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training
01.22.2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision
11.21.2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training
10.05.2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment
10.04.2023On the Performance of Multimodal Language ModelsMultimodal, Post-Training
04.28.2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training
04.11.2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision
03.11.2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision
01.29.2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment
03.07.2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision
11.16.2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision
11.07.2021Natural Adversarial ObjectsComputer Vision
10.11.2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment
07.31.2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data
04.20.2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision
11.27.2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision































































LHAW: Controllable Underspecification for Long-Horizon Tasks
George Pu* , Michael S. Lee* , Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, and Samuel Marc Denton *Indicates equal contribution
63 papers found