LLM Leaderboards
Expert-Driven Private Evaluations
Discover the SEAL LLM Leaderboards for precise and reliable LLM rankings, where leading large language models (LLMs) are evaluated using a rigorous methodology.
Developed by Scale’s Safety, Evaluations, and Alignment Lab (SEAL), these leaderboards utilize private datasets to guarantee fair and uncontaminated results. Regular updates ensure the leaderboard reflects the latest in AI advancements, making it an essential resource for understanding the performance and safety of top LLMs.
Private Datasets
Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.
Evolving Competition
We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.
Expert Evaluations
Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.
Learn more about ourLLM evaluation methodology
Agentic Tool Use (Chat)→
Learn More
Model | Score | 95% Confidence |
---|---|---|
56.85 | +6.92/-6.92 | |
56.06 | +6.91/-6.91 | |
3rd | 55.10 | +6.96/-6.96 |
53.03 | +6.95/-6.95 | |
51.27 | +6.98/-6.98 | |
49.50 | +6.96/-6.96 | |
48.49 | +6.96/-6.96 | |
40.40 | +6.84/-6.84 | |
40.40 | +6.84/-6.84 | |
40.10 | +6.84/-6.84 | |
11 | 37.88 | +6.78/-6.78 |
35.50 | +6.57/-6.68 | |
33.50 | +6.59/-6.59 | |
32.83 | +6.54/-6.54 | |
20.20 | +5.59/-5.59 | |
6.09 | +3.34/-3.34 |
Agentic Tool Use (Enterprise)→
Learn More
Model | Score | 95% Confidence |
---|---|---|
1st | 66.43 | +5.47/-5.47 |
64.58 | +5.52/-5.52 | |
60.76 | +5.64/-5.64 | |
60.28 | +5.66/-5.66 | |
59.93 | +5.67/-5.67 | |
59.38 | +5.67/-5.67 | |
54.17 | +5.78/-5.78 | |
52.78 | +5.77/-5.78 | |
51.74 | +5.77/-5.77 | |
10 | 51.39 | +5.77/-5.77 |
50.35 | +5.78/-5.78 | |
50.35 | +5.78/-5.78 | |
40.42 | +5.68/-5.68 | |
37.23 | +5.60/-5.60 | |
30.21 | +5.30/-5.30 | |
17.42 | +4.39/-4.39 |
Coding→
Learn More
Model | Score | 95% Confidence |
---|---|---|
1st | 1265 | +40/-32 |
2nd | 1195 | +32/-32 |
1115 | +24/-24 | |
1086 | +28/-31 | |
1076 | +26/-26 | |
1074 | +22/-23 | |
1073 | +29/-29 | |
1072 | +28/-27 | |
1062 | +25/-25 | |
1022 | +27/-24 | |
1020 | +30/-34 | |
995 | +22/-23 | |
972 | +27/-25 | |
931 | +27/-30 | |
916 | +27/-29 | |
912 | +24/-25 | |
852 | +28/-28 | |
726 | +33/-33 | |
636 | +37/-39 |
Instruction Following→
Learn More
Model | Score | 95% Confidence |
---|---|---|
1st | 87.32 | +1.71/-1.71 |
87.09 | +1.51/-1.52 | |
86.01 | +1.54/-1.53 | |
85.29 | +1.61/-1.61 | |
85.09 | +1.83/-1.83 | |
84.63 | +1.81/-1.82 | |
83.87 | +1.42/-1.43 | |
83.72 | +1.88/-1.88 | |
81.85 | +1.96/-1.96 | |
81.32 | +1.75/-1.75 | |
80.77 | +1.84/-1.83 | |
80.49 | +1.72/-1.72 | |
80.03 | +1.57/-1.58 | |
78.52 | +2.33/-2.32 | |
78.24 | +2.19/-2.19 | |
77.25 | +1.96/-1.97 | |
67.97 | +2.61/-2.62 | |
57.69 | +2.58/-2.57 |
Spanish→
Learn More
Model | Score | 95% Confidence |
---|---|---|
1st | 1130 | +32/-30 |
1106 | +24/-24 | |
1090 | +26/-26 | |
1089 | +29/-33 | |
1080 | +31/-27 | |
1051 | +21/-20 | |
1050 | +30/-33 | |
1026 | +30/-30 | |
1004 | +26/-27 | |
1002 | +34/-31 | |
977 | +28/-33 | |
943 | +22/-22 | |
940 | +25/-25 | |
905 | +29/-25 | |
870 | +29/-30 | |
869 | +28/-28 | |
869 | +27/-27 |
Math→
Learn More
Model | Score | 95% Confidence |
---|---|---|
96.60 | +1.02/-1.02 | |
95.68 | +1.15/-1.15 | |
95.60 | +1.16/-1.16 | |
95.19 | +1.21/-1.21 | |
95.10 | +1.22/-1.22 | |
94.85 | +1.25/-1.25 | |
94.69 | +1.27/-1.27 | |
93.94 | +1.35/-1.35 | |
93.28 | +1.41/-1.41 | |
92.28 | +1.51/-1.51 | |
90.54 | +1.65/-1.65 | |
90.12 | +1.69/-1.69 | |
90.12 | +1.69/-1.69 | |
87.47 | +1.87/-1.87 | |
79.83 | +2.27/-2.27 | |
37.51 | +2.73/-2.73 |
Adversarial Robustness→
Learn More
Model | Number of Violations | 95% Confidence |
---|---|---|
8 | +8/-4 | |
10 | +8/-5 | |
13 | +9/-5 | |
14 | +9/-6 | |
16 | +10/-6 | |
20 | +11/-7 | |
37 | +14/-10 | |
67 | +17/-14 |
If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.