scale logo
scale logo gradient

LLM Leaderboards

Discover the SEAL LLM Leaderboards for precise and reliable LLM rankings, where leading large language models (LLMs) are evaluated using a rigorous methodology.

Developed by Scale’s Safety, Evaluations, and Alignment Lab (SEAL), these leaderboards utilize private datasets to guarantee fair and uncontaminated results. Regular updates ensure the leaderboard reflects the latest in AI advancements, making it an essential resource for understanding the performance and safety of top LLMs.

Private Datasets

Private Datasets

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about ourLLM evaluation methodology

Agentic Tool Use (Chat)

Learn More

Model
Score95% Confidence
56.85
+6.92/-6.92
56.06
+6.91/-6.91
55.10
+6.96/-6.96
53.03
+6.95/-6.95
51.27
+6.98/-6.98
49.50
+6.96/-6.96
48.49
+6.96/-6.96
40.40
+6.84/-6.84
40.40
+6.84/-6.84
40.10
+6.84/-6.84
37.88
+6.78/-6.78
35.50
+6.57/-6.68
33.50
+6.59/-6.59
32.83
+6.54/-6.54
20.20
+5.59/-5.59
6.09
+3.34/-3.34

Agentic Tool Use (Enterprise)

Learn More

Model
Score95% Confidence
66.43
+5.47/-5.47
64.58
+5.52/-5.52
60.76
+5.64/-5.64
60.28
+5.66/-5.66
59.93
+5.67/-5.67
59.38
+5.67/-5.67
54.17
+5.78/-5.78
52.78
+5.77/-5.78
51.74
+5.77/-5.77
51.39
+5.77/-5.77
50.35
+5.78/-5.78
50.35
+5.78/-5.78
40.42
+5.68/-5.68
37.23
+5.60/-5.60
30.21
+5.30/-5.30
17.42
+4.39/-4.39

Coding

Learn More

Model
Score95% Confidence
1265
+40/-32
1195
+32/-32
1115
+24/-24
1086
+28/-31
1076
+26/-26
1074
+22/-23
1073
+29/-29
1072
+28/-27
1062
+25/-25
1022
+27/-24
1020
+30/-34
995
+22/-23
972
+27/-25
931
+27/-30
916
+27/-29
912
+24/-25
852
+28/-28
726
+33/-33
636
+37/-39

Instruction Following

Learn More

Model
Score95% Confidence
87.32
+1.71/-1.71
87.09
+1.51/-1.52
86.01
+1.54/-1.53
85.29
+1.61/-1.61
85.09
+1.83/-1.83
84.63
+1.81/-1.82
83.87
+1.42/-1.43
83.72
+1.88/-1.88
81.85
+1.96/-1.96
81.32
+1.75/-1.75
80.77
+1.84/-1.83
80.49
+1.72/-1.72
80.03
+1.57/-1.58
78.52
+2.33/-2.32
78.24
+2.19/-2.19
77.25
+1.96/-1.97
67.97
+2.61/-2.62
57.69
+2.58/-2.57

Spanish

Learn More

Model
Score95% Confidence
1130
+32/-30
1106
+24/-24
1090
+26/-26
1089
+29/-33
1080
+31/-27
1051
+21/-20
1050
+30/-33
1026
+30/-30
1004
+26/-27
1002
+34/-31
977
+28/-33
943
+22/-22
940
+25/-25
905
+29/-25
870
+29/-30
869
+28/-28
869
+27/-27

Math

Learn More

Model
Score95% Confidence
96.60
+1.02/-1.02
95.68
+1.15/-1.15
95.60
+1.16/-1.16
95.19
+1.21/-1.21
95.10
+1.22/-1.22
94.85
+1.25/-1.25
94.69
+1.27/-1.27
93.94
+1.35/-1.35
93.28
+1.41/-1.41
92.28
+1.51/-1.51
90.54
+1.65/-1.65
90.12
+1.69/-1.69
90.12
+1.69/-1.69
87.47
+1.87/-1.87
79.83
+2.27/-2.27
37.51
+2.73/-2.73

Adversarial Robustness

Learn More

Model
Number of Violations95% Confidence
8
+8/-4
10
+8/-5
13
+9/-5
14
+9/-6
16
+10/-6
20
+11/-7
37
+14/-10
67
+17/-14

If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.