LLMs often struggle with a core scientific capability: forecasting experimental outcomes. Doing so requires integrating experimental design, background knowledge, and causal reasoning to anticipate real-world results. SciPredict evaluates this capability directly, testing whether frontier models can predict the outcomes of real experiments in physics, biology, and chemistry rather than rely on theoretical recall or simulated tasks.

For each experiment, experts distill the experimental setup and separate it from the reported outcome. Relevant background knowledge is explicitly annotated, enabling evaluation with and without external context. Models also self-report confidence, perceived difficulty, and feasibility (whether an outcome can be predicted without running the physical experiment) enabling analysis of calibration alongside predictive accuracy. All questions are drawn from recent scientific literature, ensuring models must reason about new science rather than recite training data.

SciPredict

SciPredict

Dataset Design

Methodology

Evaluation Modes

Core Metrics

Data Summary

How to Read the Leaderboard

Performance Comparison