HiL-Bench (Human-in-Loop Benchmark)

Overview

HiL-Bench (Human-in-Loop Benchmark) measures help-seeking judgment in agents: their ability to recognize when missing, ambiguous, or conflicting information cannot be resolved through exploration or inference alone, and to ask targeted questions that clarify the right information at the right time. We refer to this as "selective escalation."

Current benchmarks’ tasks typically provide complete specifications and reward solely execution correctness, so an agent that would silently guess past a missing requirement can score the same as one that would have asked. However, in real-world applications, tasks are rarely perfectly specified. We design HiL-Bench to reflect this; tasks in this benchmark require agents to ask the right questions for full clarity before they have the complete context needed to solve the task correctly. Across software engineering and text-to-SQL tasks, HiL-Bench reveals a large judgment gap: models that perform strongly with full information provided upfront recover only a fraction of that performance when they must decide for themselves whether and when to ask for help (via an ask_human() tool).

Resources:

Dataset Overview

HiL-Bench includes two domains: software engineering and text-to-SQL. Tasks are pulled from SWE-Bench Pro and BIRD and have blockers injected. We filtered to only include original benchmark tasks where frontier models already performed strongly, so that performance drops under blocked conditions reflect failures of judgment rather than failures of underlying capability.

The benchmark contains 300 tasks split evenly across the two domains, with 200 public tasks and 100 private held-out tasks for unbiased evaluation. Across those tasks, the dataset contains 1,131 total blockers, averaging 3.8 blockers per task.

	SWE	SQL	Total
Tasks	150	150	300
Avg. blockers / task	3.55	3.99	3.77
Total blockers	533	598	1,131
Public / held-out split	100 / 50	100 / 50	200 / 100

Dataset Design

Each task is modified to include 3–5 human-validated, realistic, and unguessable blockers: pieces of critical information that have been removed, obscured, or made contradictory. These blockers take three forms: missing information, ambiguous requests, and contradictory information.

A key design feature is progressive discovery. The blockers are not meant to be obvious from the initial prompt. They often surface during exploration, as the agent inspects the codebase or schema, begins execution, and encounters a point it cannot resolve. The agent must begin working, detect that a gap exists, determine that it cannot be resolved from available context, and then ask a targeted question.

Every blocker must satisfy seven criteria:

Realism: the blocker must plausibly arise in a real-world engineering or data-analysis context
Criticality: the blocker must prevent the core task from being completed correctly
Objectivity: the blocker must have a single, unambiguous resolution with exact values or behaviors
Vast search space: the correct resolution cannot be found through guessing or brute-force enumeration within the agent's step budget
Independence: resolving one blocker must not reveal the resolution of any other blocker in the same task
No contamination: the resolution cannot be inferred from any information available to the agent; it exists only in the blocker registry
Non-contrivance: the blocker must be grounded in existing task context, not an artificially inserted requirement

In practice, this means blockers must be plausible, genuinely task-blocking, resistant to guessing, independent from one another, and unable to leak their resolution anywhere in the task environment. Any blocker that fails one of these criteria is rejected.

Evaluation Methodology

All models are evaluated using the same SWE-Agent scaffolding with identical tool access and step budgets. The system prompt informs the agent that a knowledgeable human collaborator is available via ask_human() and instructs it to use the tool when it encounters information it cannot resolve from the environment. For SQL tasks, agents instead have custom tools for schema exploration, business-logic retrieval, and SQL execution.

The ask_human() tool is backed by a frozen open-source LLM (Llama-3.3-70B-Instruct) acting as a semantic judge. It returns a blocker's resolution only when the agent's question directly targets a registered information gap; otherwise it returns a fixed response “irrelevant question.” This produces a binary, reproducible signal without free-form simulation confounds.

Tasks are validated in two directions. Necessity: without ask_human(), pass rate must stay below 5%, confirming blockers cannot be bypassed by inference or luck. Sufficiency: with all resolutions provided upfront, pass rate must reach 85%+ by at least one model, confirming the blockers are the only real obstacle. Tasks failing either condition are discarded.
Core Metrics

The leaderboard reports two complementary metrics that measure different things.

ASK-F1 measures selective escalation quality: how well the agent detects information gaps and asks targeted questions. It is the harmonic mean of Question Precision (share of questions that target real blockers) and Blocker Recall (share of blockers the agent identifies and asks about). The harmonic mean is deliberate: it architecturally prevents gaming through question spam, since high recall via fifty questions per task is crushed by near-zero precision.

Pass@3 measures task outcome: whether the agent produces a correct solution in at least one of three independent runs. This depends on both help-seeking quality and the agent's ability to integrate resolved information into a correct solution.

These metrics can diverge. An agent with strong ASK-F1 but low Pass@3 detects gaps well but fails to convert answers into correct solutions. An agent with higher Pass@3 but lower ASK-F1 gets lucky on a few tasks while exhibiting worse help-seeking judgment overall. Both signals matter: ASK-F1 tells you how reliably the agent collaborates; Pass@3 tells you how often you get a correct deliverable.

results chart for hil-bench — HiL-Bench results with ask_human() available. Each agent decides whether and when to use the tool. Pass@3 reports task outcome; ASK-F1 (harmonic mean of question precision and blocker recall) reports selective escalation quality.

The table reports results under the "with tool" condition: each agent has access to ask_human() and must decide for itself whether and when to use it. Results are shown combined and broken out by domain (SWE, SQL).

SQL scores are consistently higher than SWE across all models. SWE tasks lean on general engineering patterns where models have strong existing priors, so they default to confident assumptions rather than recognizing gaps. SQL tasks involve domain-specific business logic (threshold definitions, schema ambiguities, column semantics) where missing information is more recognizable during exploration.

ASK-F1 and Pass@3 can tell different stories. GLM-5.1 and Gemini 3.1 Pro illustrate this: both achieve roughly 20% combined Pass@3, but Gemini's ASK-F1 is 43% vs GLM's 30%. Gemini asks more questions and resolves more blockers per task, reflecting stronger uncertainty detection. But because each task requires all blockers resolved for a correct solution, partial resolution does not guarantee completion. GLM asks fewer questions but converts its resolutions into finished tasks more efficiently. This is exactly the distinction the two metrics are designed to surface.

Takeaways

The leaderboard reveals a large and consistent judgment gap. The best model (Claude Opus 4.6) achieves only 24% combined Pass@3 when it must decide whether to ask for help, despite reaching 75-91% when all information is provided upfront (see Table 1 in the paper). The bottleneck is not capability but judgment: knowing when to act and when to escalate.

No model achieves strong selective escalation. The highest combined ASK-F1 is 44%, meaning even the best agent fails to identify or properly target more than half the information gaps it encounters. Models cluster into distinct failure profiles: some under-ask with high precision but low recall (GPT-5.3-Codex: 56% precision, 18% recall), while others detect more gaps but ask imprecisely (GLM-5.1: 23% precision, 42% recall). No model scores consistently high on both.

SWE is where the gap is sharpest. Pass@3 ranges from 1-9% on SWE vs 5-39% on SQL. Blocker recall peaks at 36% on SWE, compared to 61% on SQL.

Performance Comparison

GPT-5.5

NEW

29.10±5.38

Claude Opus 4.7

27.67±5.32

Claude Opus 4.6

24.33±5.16

Claude Opus 4.8

NEW

22.33±5.05

GLM-5.1

21.00±4.96

Gemini 3.1 Pro

20.33±4.92

kimi-k2.6

14.67±4.45

Gemini 3.5 Flash

NEW

14.33±4.42

GPT-5.4

9.33±3.83

Grok-4.20

8.00±4.60

Minimax-M2.5

7.33±3.52

GPT-5.3-codex

3.67±2.78

All leaderboards