VeRO: Can AI Agents Build Better AI Agents? - Scale Labs

An agent’s performance on a task depends on both the underlying LLM and the agent harness: the prompts, tools, and control flow that wrap it. This harness is the best way to optimize an agent’s performance. This process is highly manual: Modify code, run evaluations, inspect traces, tweak prompts or tools, commit, and repeat. As agent deployments scale, this iteration loop becomes a bottleneck.

However, harnesses are code. What if we treated harness optimization as a code generation problem? Developers use coding agents like Claude Code and Codex every day for tasks ranging from front-end development to cluster management, either in tight HiTL loops in their IDE or terminal, or in an autonomous fashion in the cloud. But how good are they at the harness optimization task?

Versioning, Rewards, and Observations (VeRO) seeks to answer this question. VeRO is an evaluation harness to benchmark coding agents on agent optimization, treating the entire target agent program as the search space. Across 105 optimization runs spanning five benchmarks, we found that while tool-use tasks admit meaningful optimization, current models suffer from limited exploration diversity, fragile cross-model generalization, and a strong bias toward prompt modifications over architectural changes.

A system architecture diagram of an automated optimization framework, illustrating a "Builder" coding agent iteratively refining, testing, and improving a "Target" AI agent's code through a sandboxed evaluation infrastructure. — Figure 1. VeRO system architecture. Top (orange): example optimization trajectory. Bottom (green): system components. VeRO enforces versioning, reproducible execution, and controlled feedback, enabling systematic comparison of optimizers for agent optimization.

The Problem: Optimizing Stochastic Programs

Agent optimization is a specific subset of software engineering with its own peculiarities. First, there’s domain knowledge: optimizing agents requires understanding how they work and the ever-growing quiver of tricks and methods described in academic literature, blog posts, internal documentation, etc. Second, agents are stochastic programs; they mix deterministic logic with probabilistic LLM completions. You need to attribute performance differences to specific code changes while accounting for the particular noise distribution introduced by LLMs.

As an optimization problem, the search space is massive. Framed in the above way, agent program optimization requires searching over the space of valid Python programs, technically unbounded, constrained only by what coding agents can generate.

The optimization loop goes like this: a coding agent receives a target agent's codebase, a task specification, and a budget. It then:

Runs the target agent on training samples and observes execution traces
Identifies failure modes through those traces
Modifies any part of the implementation (prompts, tools, control flow, parameters)
Evaluates modifications and iterates until its budget is exhausted

Mathematically, this is the problem we are trying to solve:

In other words, we're searching over a restricted program space Aᵣ (e.g. fixed model checkpoint, API constraints, no test data access) to maximize expected evaluation score, where the expectation marginalizes over test distribution, evaluator stochasticity, and agent stochasticity. The budget constraint nE ≤ B reflects the total allowance for target agent evaluation.

The VeRO Harness: Reproducible Evaluation Infrastructure

Our architecture is partitioned into three functional spaces, as illustrated in Figure 1, to ensure controlled resource enforcement and observability:

Optimizer Space (Builder): Contains the coding agent tasked with proposing edits, inspecting traces, and navigating the version history.
VeRO Harness (=Infrastructure): Features Git Worktrees for version control (auto-committing snapshots), an Experiment Database to store granular traces/scores and expose them to the agent, and an Evaluation Engine that scores agents generated by the builder and stores results in the database.
Target Space (Subject): Contains the target agent code as a Python package, providing rewards and traces back to the harness.

The coding agent interacts with the target agent through a mix of native and harness-specific tools. Native tools are hooked or restricted. This allows us to fully observe what coding agents are doing and keeps comparisons fair. In particular we use the following:

Auto-Commit: Every edit to the target agent auto-commits to Git, so we have a full record of what the coding agent did.
ExperimentRunner: A gated evaluation tool. The coding agent can hand it a commit hash and a number of samples, and it checks out that version, runs the target agent, stores results, and decrements the budget. The agent cannot evaluate more times than its budget allows, the tool enforces this hard.
ExperimentViewer: An interface to all prior evaluation results, per-sample scores, full execution traces, error logs. This is how the coding agent diagnoses failures and identifies room for improvement.
DatasetViewer: An interface for inspecting target agent inputs. Access to the held-out test set is blocked at the tool level so the agent can’t game its evaluation.

Evaluation Methodology: The Edit-Execute-Evaluate Cycle

Every optimization run follows the same loop. The optimizer starts with a base target agent A₀ and a budget B (in our experiments, B = 8 full evaluation calls). Though we don’t impose any hard restrictions on what the coding agent does with its time, we prompt it to:

Inspect: The optimizer reads the current agent code and checks any prior evaluation traces in the Experiment Database to understand what the agent is getting wrong.
Implement: The optimizer changes the code – this could be a prompt rewrite, a new tool, a change to control flow, or a parameter tweak, and writes it to disk. The auto-commit hook fires immediately, capturing the modification as a versioned Git snapshot.
Evaluate: The optimizer calls ExperimentRunner on the new commit. The harness checks out that version in an isolated environment, runs the target agent on the training split, logs per-sample scores and full execution traces, and returns aggregate performance back to the optimizer.
Iterate: If the new version underperforms, the optimizer can roll back and try something different. The best-performing commit across all evaluations is selected as the final output.

A key design decision: we select the best commit based on validation performance, and only evaluate the test set at the start (baseline) and end (best commit). This prevents optimizers from overfitting their search to the test distribution, a subtle but important form of reward hacking to guard against.

We compare five optimizer configurations spanning two coding agent scaffolds (a minimal custom scaffold and Claude Code) and three underlying LLMs (Claude Sonnet 4.5, Claude Opus 4.5, and GPT-5.2-Codex), running each configuration N = 3 times per task for a total of 105 experiments.

A performance comparison table evaluating different AI scaffolds (Baseline, Claude Code, and VERO) and models (Sonnet, Opus, GPT-5.2) across five benchmarks: GAIA, GPQA, MATH, Retail, and SimpleQA. The table shows that the "VERO" scaffold using the "Default" variant with the "Sonnet" model achieves the highest average performance score of 0.61 (0.65), outperforming both the baseline and Claude Code variants.

Main Findings from Benchmark Study

The headline number: across the three tool-use-oriented benchmarks (GAIA, TAU-Bench Retail, SimpleQA), the best optimizer configurations achieved roughly 8–9% average lift over baseline. VeRO showed some impressive jumps in its highest checkpoints including a 4.3x increase in GAIA, 1.9x increase in Tau-Bench Retail, and a 1.4x increase in SimpleQA. Reasoning-heavy benchmarks (GPQA, MATH) showed almost no improvement across any configuration; this is a consistent finding that suggests current coding agents can add and refine tools effectively, but can't yet improve the reasoning a model does inside its forward pass.

A few other notable findings:

Infrastructure matters more than model. Claude Code with no harness-specific tools (i.e. Claude Code in YOLO mode with the full dataset and a budgeted API key) improved average performance by just 3% over baseline. Adding VeRO tools and harness support pushed that to 8%. Tools exposing structured traces and versioned snapshots guide the coding agent in the optimization process.

The Optimizer model is task-dependent. Claude Sonnet and Opus significantly outperformed GPT-5.2-Codex on GAIA, Retail, and SimpleQA. GPT-5.2-Codex was best on GPQA. No single model dominated across all tasks. Interestingly, Sonnet outperformed Opus on TAU-Bench Retail despite being the smaller model, suggesting task fit matters more than raw model scale.

Instruction templates have a variance-performance tradeoff. More prescriptive templates (like our Cookbook+Reasoning, which includes a library of optimization patterns and a structured 4-phase workflow) produced higher peak performance but also more variance and occasional catastrophic regressions. Constrained templates (Evidence-Based, which explicitly discourages complex new tools and enforces single-variable experimentation) produced stable but capped improvements. If you care about worst-case behavior, use tighter templates; if you're hunting for breakthroughs and can tolerate regressions, give the optimizer creative freedom.

The best commit is usually found early. For tool-use tasks, the optimal agent version was typically discovered before the halfway point in the optimization trajectory. This hints at diminishing returns and suggests that smarter budget allocation, front-loading evaluation calls and branching from promising commits rather than running linearly, could be a fruitful research direction.

Three Uncomfortable Truths of Agent Optimization

After analyzing all 105 runs, three patterns emerged that we weren't entirely expecting:

1. Coding agents almost always reach for the prompt first.

Over the course of a single optimization run, the coding agent alternates between modifying code and evaluating the target agent - we call a cycle of modification and evaluations a phase. Across all configurations and all optimization phases beyond the first, prompt modifications dominated, accounting for over 50% of all changes. This was true even when the task clearly required new tools or structural changes to improve. It's the path of least resistance: prompts are easy to change, easy to evaluate, and rarely break things catastrophically. The downside is that prompt tuning has a ceiling. The most impactful improvements came from structural changes (new tools, better control flow) that optimizers were hesitant to attempt even when instructed to do so.

A stacked bar chart showing the probability of six code change types across seven optimization phases for three VeRO scaffolds, illustrating a shift from tool-focused to prompt-focused modifications over time. — Figure 2. We plot the probability of changes made between successive evaluations in a trajectory falling into one of several types (e.g. prompt, tool, workflow changes) across all trajectories conducted using VERO scaffolds. Bold numbers capture the entropy of the probability distribution defined by each bar.

2. What works for one model doesn't always work for another.

The best commits found using GPT-4.1 mini as the target agent’s model sometimes improved performance on GPT-4.1, Gemini 2.5 Flash, or Qwen3 variants, yet sometimes degraded it substantially. This matters when productionizing the optimized target agent if your model changes (e.g., you upgrade to a newer API version), the optimized scaffolding may not carry over. We found that improvements driven by tool additions and workflow changes tended to generalize better than prompt-heavy modifications, which can be sensitive to the specific model's instruction-following tendencies.

A table comparing the initial and final performance scores of various target AI models across four benchmark tasks (GAIA, GPQA, SimpleQA, and TAU-Bench Retail) when using either Codex or Sonnet as the optimizer. The table highlights the score changes, indicating whether the optimization improved or degraded performance.

3. Simpler agents are easier to improve, but harder to improve stably.

Our case study compared a minimal "Pawn" agent (4 tools, 25-line prompt) against a sophisticated "Knight" agent (6 tools, 140-line prompt with best-practise patterns). Pawn showed larger peak gains, up to +13.3% on SimpleQA, but also much higher variance. A single bad modification could regress performance across multiple benchmarks simultaneously. Knight was harder to move but more predictable. This suggests that optimization headroom and optimization risk are tightly coupled: the same open space that allows big improvements also allows big regressions.

Candle plots comparing accuracy ratio (iteration / baseline) across four instruction template types—Cookbook+Reasoning, Minimal, Tool-Centric, and Evidence-Based—for two agents: Pawn (orange) and Knight (green). The dashed line marks the baseline ratio of 1.0. Pawn results show wider variability, sometimes above and below baseline, while Knight results cluster tightly around baseline with smaller variance.

Aside: The Tricks Coding Agents Discovered

Across all runs, the coding agents made several particularly insightful optimizations that move beyond simple prompt adjustments:

What’s Next

The results establish that agent optimization is an essential, but far from solved, capability for coding agents.

Recursive Self-Optimization: If agents are code, then the optimizers themselves are subjects for refinement. Applying VeRO to allow a coding agent to optimize its own scaffolds could lead to an "auto-pilot" for agent building.
Budget-Aware ROI: Future work should incorporate "Agentic ROI" as an explicit objective, rewarding modifications that improve performance while reducing token costs or API latency.
RL for Trajectory Tuning: The structured trajectories captured by VeRO are themselves a source of training data for fine-tuning. By training LLMs on successful optimization phases, where a model accurately identifies a failure from a trace and proposes a structural fix, we can teach the specific "builder" skills currently lacking.

By focusing on the "edit-execute-evaluate" loop, we can transition from a world of human-crafted agents to one where agents and their optimizers evolve together to solve increasingly complex challenges. For details on the harness and the experiments described above, you can read our pre-print on ArXiV: https://arxiv.org/abs/2602.22480.