Including benchmarks on agentic coding, frontier reasoning, and safety alignment.
From leading AI labs including OpenAI, Anthropic, Google, Meta, and open-source contributors.
Evaluating deep code comprehension and reasoning
gpt-5.4-codex (xHigh) (Codex CLI)
35.48±8.70
claude-opus-4.6 Thinking (Claude Code Harness)^
31.50±8.62
gpt-5.2-2025-12-11 (High) (SWE-Agent)
29.03±8.53
Evaluating real-world tool use through the Model Context Protocol (MCP)
claude-opus-4-5-20251101
62.30±1.76
gpt-5.2-2025-12-11
60.57±1.62
gemini-3-flash-preview
57.40±1.48
Evaluating long-horizon software engineering tasks in public open source repositories
claude-opus-4-5-20251101
45.89±3.60
claude-4-5-Sonnet
43.60±3.60
gemini-3-pro-preview
43.30±3.60
Evaluating long-horizon software engineering tasks in commercial-grade private repositories
gpt-5.2-2025-12-11
23.81±5.09
claude-opus-4-5-20251101
23.44±5.07
gemini-3-pro-preview
17.95±4.78
Forecasting scientific experiment outcomes
gemini-3-pro-preview
25.27±1.92
claude-opus-4-5-20251101
23.05±0.51
claude-opus-4-1-20250805
22.22±1.48
Challenging LLMs at the frontier of human knowledge
37.52±1.90
34.44±1.86
31.64±1.82
Challenging LLMs at the frontier of human knowledge
37.72±2.04
36.24±2.03
33.32±1.99
Evaluating spoken dialogue systems in multi-turn interaction
gemini-3-pro-preview (Thinking)*
54.65±4.57
gemini-2.5-pro (Thinking)*
46.90±4.58
gemini-2.5-flash (Thinking)*
40.04±4.50
Evaluating spoken dialogue systems in multi-turn interaction
gpt-realtime-1.5
34.73±4.38
Qwen3-Omni-30B-A3B-Instruct
24.34±3.95
gpt-4o-audio-preview-2025-06-03
23.23±3.88
Evaluating spoken dialogue systems in multi-turn interaction
gemini-3-pro-preview (Thinking)
54.65±4.57
gemini-2.5-pro (Thinking)
46.90±4.58
gemini-2.5-flash (Thinking)
40.04±4.50
Evaluating Professional Reasoning in Finance
claude-opus-4-6 (Non-Thinking)
53.28±0.18
gpt-5
51.32±0.17
gpt-5-pro
51.06±0.59
Evaluating Professional Reasoning in Legal Practice
claude-opus-4-6 (Non-Thinking)
52.27±0.66
gpt-5-pro
49.89±0.36
o3-pro
49.67±0.50
Evaluating AI agents ability to perform real-world, economically valuable remote work
claude-opus-4-6 (CoWork)
4.17±0.00
claude-opus-4-5-20251101-thinking
3.75±0.00
Manus_1.6 (Max)
2.92±0.00
Simulating real-world pressure to choose between safe or harmful behavior
o3-2025-04-16
10.50±0.60
claude-sonnet-4-20250514
12.20±0.20
o4-mini-2025-04-16
15.80±0.40
Evaluating how LLMs can dynamically interact with and reason about visual information
gemini-3-pro-preview
26.85±0.54
gpt-5-2025-08-07-thinking
18.68±0.25
gpt-5-2025-08-07
16.96±0.06
Multilingual Native Reasoning Evaluation Benchmark for LLMs
65.20±1.24
58.96±2.97
57.06±2.99
Assessing models across diverse, interdisciplinary challenges
gemini-3-pro-preview
65.67±2.20
gpt-5.1-2025-11-13-thinking
63.41±2.11
gpt-5-thinking
63.19±1.63
Frontier Risk Evaluation for National Security and Public Safety
8.24±1.93
9.63±2.11
12.80±2.36
Evaluate model honesty when pressured to lie
96.28±0.41
96.13±0.57
Claude Sonnet 4 (Thinking)
95.33±2.29
Evaluating model performance on complex, multi-step reasoning tasks
18.75±2.22
18.24±2.20
13.09±1.92
Vision-Language Understanding benchmark for multimodal models
Gemini 2.5 Pro Experimental (March 2025)
54.65±1.46
gemini-2.5-pro-preview-06-05
54.63±0.55
gpt-5-pro-2025-10-06
52.39±1.07
Evaluating model performance on common tutoring tasks for high school and AP-level subjects
gemini-2.5-pro-preview-06-05
55.65±1.11
gpt-5-2025-08-07
55.33±1.02
o3-pro-2025-06-10
54.62±1.02
We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities--while continuously evaluating the latest frontier models.
Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations--ensuring efficiency and alignment with human judgment.
Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.
If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.