Challenging LLMs at the frontier of human knowledge
25.32±1.70
21.64±1.61
20.32±1.58
Challenging LLMs at the frontier of human knowledge
26.32±1.86
22.06±1.75
20.57±1.71
Multilingual Native Reasoning Evaluation Benchmark for LLMs
52.13±3.01
o3-pro-2025-06-10-high
49.00±3.02
o3-2025-04-16-high
45.50±3.00
Assessing models across diverse, interdisciplinary challenges
63.77±1.53
58.55±3.03
59.09±1.08
Frontier Risk Evaluation for National Security and Public Safety
8.24±1.93
12.96±2.34
14.79±2.49
Evaluate model honesty when pressured to lie
Claude Sonnet 4 (Thinking)
95.33±2.29
94.20±1.79
92.00±0.86
Evaluating model performance on complex, multi-step reasoning tasks
13.09±1.92
11.91±1.85
10.47±1.74
Vision-Language Understanding benchmark for multimodal models
54.65±1.46
54.63±0.55
51.79±0.63
Evaluating model performance on common tutoring tasks for high school and AP-level subjects
gemini-2.5-pro-preview-06-05
55.65±1.11
gpt-5-2025-08-07
55.33±1.02
o3-pro-2025-06-10
54.62±1.02
We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.
Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.
Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.
If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.