Scale AI logo
SEAL Logo
SEAL Logo

LLM Leaderboard

SEAL LLM Leaderboards evaluate frontier LLM capabilities. These leaderboards provide insight into models through robust datasets and precise criteria to benchmark the latest AI advancements.

Humanity's Last Exam

Challenging LLMs at the frontier of human knowledge

1

25.32±1.70

2

21.64±1.61

2

20.32±1.58

View Full Ranking

Humanity's Last Exam (Text Only)

Challenging LLMs at the frontier of human knowledge

1

26.32±1.86

2

22.06±1.75

2

20.57±1.71

View Full Ranking

MultiNRC

Multilingual Native Reasoning Evaluation Benchmark for LLMs

1

52.13±3.01

1

o3-pro-2025-06-10-high

49.00±3.02

2

o3-2025-04-16-high

45.50±3.00

View Full Ranking

MultiChallenge

Assessing models across diverse, interdisciplinary challenges

1

63.77±1.53

2

58.55±3.03

3

59.09±1.08

View Full Ranking

Fortress

Frontier Risk Evaluation for National Security and Public Safety

1

8.24±1.93

2

12.96±2.34

2

14.79±2.49

View Full Ranking

MASK

Evaluate model honesty when pressured to lie

1

Claude Sonnet 4 (Thinking)

95.33±2.29

1

94.20±1.79

2

92.00±0.86

View Full Ranking

EnigmaEval

Evaluating model performance on complex, multi-step reasoning tasks

1

13.09±1.92

1

11.91±1.85

1

10.47±1.74

View Full Ranking

VISTA

Vision-Language Understanding benchmark for multimodal models

1

54.65±1.46

1

54.63±0.55

3

51.79±0.63

View Full Ranking

TutorBench

Evaluating model performance on common tutoring tasks for high school and AP-level subjects

1

gemini-2.5-pro-preview-06-05

55.65±1.11

1

gpt-5-2025-08-07

55.33±1.02

1

o3-pro-2025-06-10

54.62±1.02

View Full Ranking

Frontier AI Model Evaluations & Benchmarks

We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.

Scaling with Human Expertise

Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.

Robust Datasets for Reliable AI Benchmarks

Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.

Run evaluations on frontier AI capabilities

If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.