Overview

Large Language Models serve as on-demand tutors for learners worldwide, yet, a critical evaluation gap exists. While most benchmarks assess an LLM's ability to solve problems, this capability alone does not make the models effective tutors. Effective tutors require nuanced, human-centered skills essential for student learning like providing adaptive explanations, offering guiding feedback, and adjusting to a learner's specific needs.

To address this gap, we introduce TutorBench, a comprehensive benchmark designed to rigorously evaluate the core tutoring skills of LLMs. TutorBench moves beyond simple answer-correctness to measure how well models perform three common and critical tutoring tasks:

TutorBench

Overview

Methodology

Rubrics

LLM-Judge

Dataset Design

Authoring Process

Filtering for Difficulty

Multimodality

Example Anatomy

Evaluation & Results

Performance Comparison