Skip to main content

Benchmark AI Agents in 10 Minutes

Test your coding agent on real-world tasks with automatic evaluation.

What You'll Build

Test Cases → Run Agent (parallel) → Evaluate Solutions → Summary Report

Prerequisites

  • A coding agent compatible with TerminalBench or SWE-Bench
  • API access to Jetty
  • Configured model API keys

The Workflow

Parent Workflow: benchmark-agent

{
"init_params": {
"agent": "your-agent-name",
"model": "anthropic/claude-sonnet-4-20250514",
"test_cases": [
"django__django-11119",
"astropy__astropy-7166",
"django__django-10880"
]
},
"step_configs": {
"run_tests": {
"activity": "list_emit_await",
"items_path": "init_params.test_cases",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"agent": "{{ init_params.agent }}",
"model": "{{ init_params.model }}",
"task_id": "{{ $item }}"
},
"execution_config": {
"max_parallel": 3,
"timeout_seconds": 35000
}
},
"collect_results": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_tests.outputs.trajectory_references",
"extract_keys": {
"task_id": "init_params.task_id",
"resolved": "eval.outputs.resolved_instances",
"errors": "tbench.outputs.n_errors"
}
},
"summarize": {
"model": "gemini/gemini-2.5-flash",
"activity": "litellm_chat",
"user_prompt": "Generate a markdown summary of these benchmark results:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: pass rate, common failure patterns, and recommendations."
}
},
"steps": ["run_tests", "collect_results", "summarize"]
}

Child Workflow: run-single-test

{
"init_params": {
"agent": "",
"model": "",
"task_id": ""
},
"step_configs": {
"tbench": {
"activity": "terminal_bench",
"agent_path": "init_params.agent",
"model_path": "init_params.model",
"task_id_path": "init_params.task_id",
"env": "swebench"
},
"eval": {
"activity": "swe_bench_docker_eval",
"predictions_path": "tbench.outputs.predictions",
"task_id_path": "init_params.task_id"
}
},
"steps": ["tbench", "eval"]
}

What This Does

  1. Runs your agent on multiple SWE-Bench tasks in parallel
  2. Evaluates each solution against the official test suite
  3. Generates a summary of pass/fail rates with analysis

Supported Benchmarks

BenchmarkActivityDescription
SWE-Benchterminal_bench + swe_bench_docker_evalReal GitHub issues
TerminalBenchharbor_terminal_benchTerminal-based coding tasks
CustomBuild your ownAny Docker-based evaluation

Harbor Terminal Bench

For container-based agent evaluation:

{
"init_params": {
"agent": "your-agent",
"model": "anthropic/claude-sonnet-4-20250514",
"dataset": "your-dataset"
},
"step_configs": {
"run": {
"activity": "harbor_terminal_bench",
"agent_path": "init_params.agent",
"model_path": "init_params.model",
"dataset_path": "init_params.dataset"
}
},
"steps": ["run"]
}

Configure Execution

Control parallelism

{
"execution_config": {
"max_parallel": 3,
"timeout_seconds": 35000
}
}

Agent benchmarks are resource-intensive. Recommended settings:

Test Countmax_paralleltimeout_seconds
1-5335000
10-505-1035000
100+10-2045000

Handle long-running tests

Some SWE-Bench tasks take 10+ minutes. Set appropriate timeouts:

{
"execution_config": {
"timeout_seconds": 60000
}
}

The Output

{
"collect_results": {
"outputs": {
"extracted_data": [
{"task_id": "django__django-11119", "resolved": 1, "errors": 0},
{"task_id": "astropy__astropy-7166", "resolved": 0, "errors": 2},
{"task_id": "django__django-10880", "resolved": 1, "errors": 0}
]
}
},
"summarize": {
"outputs": {
"content": "## Benchmark Results\n\n**Pass Rate:** 66.7% (2/3)\n\n### Passed\n- django__django-11119\n- django__django-10880\n\n### Failed\n- astropy__astropy-7166 (2 errors)\n\n### Recommendations\n..."
}
}
}

SWE-Bench Lite (Quick Test)

Test with a smaller subset:

{
"init_params": {
"test_cases": [
"sympy__sympy-13031",
"django__django-11099",
"requests__requests-3362"
]
}
}

These are known to be solvable and run quickly.


Full SWE-Bench Verified

For comprehensive evaluation:

{
"init_params": {
"agent": "your-agent",
"model": "anthropic/claude-sonnet-4-20250514"
},
"step_configs": {
"load_dataset": {
"activity": "download_file",
"url": "https://your-dataset-url/swebench-verified.json"
},
"run_all": {
"activity": "list_emit_await",
"items_path": "load_dataset.outputs.content",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"task_id": "{{ $item.instance_id }}"
},
"execution_config": {
"max_parallel": 10,
"timeout_seconds": 45000
}
}
}
}

Compare Multiple Agents

Run the same tests with different agents:

{
"init_params": {
"agents": ["agent-v1", "agent-v2", "agent-v3"],
"model": "anthropic/claude-sonnet-4-20250514",
"test_case": "django__django-11119"
},
"step_configs": {
"run_all_agents": {
"activity": "list_emit_await",
"items_path": "init_params.agents",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"agent": "{{ $item }}",
"model": "{{ init_params.model }}",
"task_id": "{{ init_params.test_case }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all_agents.outputs.trajectory_references",
"extract_keys": {
"agent": "init_params.agent",
"resolved": "eval.outputs.resolved_instances"
}
}
},
"steps": ["run_all_agents", "collect"]
}

Best Practices

  1. Start with 3-5 tests: Verify your setup before scaling
  2. Use known-solvable tasks: Test with tasks your agent has solved before
  3. Monitor resources: Agent benchmarks are compute-intensive
  4. Save trajectories: Review individual runs for debugging
  5. Track over time: Compare results across agent versions

Next Steps