Benchmark AI Agents in 10 Minutes
Test your coding agent on real-world tasks with automatic evaluation.
What You'll Build
Test Cases → Run Agent (parallel) → Evaluate Solutions → Summary Report
Prerequisites
- A coding agent compatible with TerminalBench or SWE-Bench
- API access to Jetty
- Configured model API keys
The Workflow
Parent Workflow: benchmark-agent
{
"init_params": {
"agent": "your-agent-name",
"model": "anthropic/claude-sonnet-4-20250514",
"test_cases": [
"django__django-11119",
"astropy__astropy-7166",
"django__django-10880"
]
},
"step_configs": {
"run_tests": {
"activity": "list_emit_await",
"items_path": "init_params.test_cases",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"agent": "{{ init_params.agent }}",
"model": "{{ init_params.model }}",
"task_id": "{{ $item }}"
},
"execution_config": {
"max_parallel": 3,
"timeout_seconds": 35000
}
},
"collect_results": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_tests.outputs.trajectory_references",
"extract_keys": {
"task_id": "init_params.task_id",
"resolved": "eval.outputs.resolved_instances",
"errors": "tbench.outputs.n_errors"
}
},
"summarize": {
"model": "gemini/gemini-2.5-flash",
"activity": "litellm_chat",
"user_prompt": "Generate a markdown summary of these benchmark results:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: pass rate, common failure patterns, and recommendations."
}
},
"steps": ["run_tests", "collect_results", "summarize"]
}
Child Workflow: run-single-test
{
"init_params": {
"agent": "",
"model": "",
"task_id": ""
},
"step_configs": {
"tbench": {
"activity": "terminal_bench",
"agent_path": "init_params.agent",
"model_path": "init_params.model",
"task_id_path": "init_params.task_id",
"env": "swebench"
},
"eval": {
"activity": "swe_bench_docker_eval",
"predictions_path": "tbench.outputs.predictions",
"task_id_path": "init_params.task_id"
}
},
"steps": ["tbench", "eval"]
}
What This Does
- Runs your agent on multiple SWE-Bench tasks in parallel
- Evaluates each solution against the official test suite
- Generates a summary of pass/fail rates with analysis
Supported Benchmarks
| Benchmark | Activity | Description |
|---|---|---|
| SWE-Bench | terminal_bench + swe_bench_docker_eval | Real GitHub issues |
| TerminalBench | harbor_terminal_bench | Terminal-based coding tasks |
| Custom | Build your own | Any Docker-based evaluation |
Harbor Terminal Bench
For container-based agent evaluation:
{
"init_params": {
"agent": "your-agent",
"model": "anthropic/claude-sonnet-4-20250514",
"dataset": "your-dataset"
},
"step_configs": {
"run": {
"activity": "harbor_terminal_bench",
"agent_path": "init_params.agent",
"model_path": "init_params.model",
"dataset_path": "init_params.dataset"
}
},
"steps": ["run"]
}
Configure Execution
Control parallelism
{
"execution_config": {
"max_parallel": 3,
"timeout_seconds": 35000
}
}
Agent benchmarks are resource-intensive. Recommended settings:
| Test Count | max_parallel | timeout_seconds |
|---|---|---|
| 1-5 | 3 | 35000 |
| 10-50 | 5-10 | 35000 |
| 100+ | 10-20 | 45000 |
Handle long-running tests
Some SWE-Bench tasks take 10+ minutes. Set appropriate timeouts:
{
"execution_config": {
"timeout_seconds": 60000
}
}
The Output
{
"collect_results": {
"outputs": {
"extracted_data": [
{"task_id": "django__django-11119", "resolved": 1, "errors": 0},
{"task_id": "astropy__astropy-7166", "resolved": 0, "errors": 2},
{"task_id": "django__django-10880", "resolved": 1, "errors": 0}
]
}
},
"summarize": {
"outputs": {
"content": "## Benchmark Results\n\n**Pass Rate:** 66.7% (2/3)\n\n### Passed\n- django__django-11119\n- django__django-10880\n\n### Failed\n- astropy__astropy-7166 (2 errors)\n\n### Recommendations\n..."
}
}
}
SWE-Bench Lite (Quick Test)
Test with a smaller subset:
{
"init_params": {
"test_cases": [
"sympy__sympy-13031",
"django__django-11099",
"requests__requests-3362"
]
}
}
These are known to be solvable and run quickly.
Full SWE-Bench Verified
For comprehensive evaluation:
{
"init_params": {
"agent": "your-agent",
"model": "anthropic/claude-sonnet-4-20250514"
},
"step_configs": {
"load_dataset": {
"activity": "download_file",
"url": "https://your-dataset-url/swebench-verified.json"
},
"run_all": {
"activity": "list_emit_await",
"items_path": "load_dataset.outputs.content",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"task_id": "{{ $item.instance_id }}"
},
"execution_config": {
"max_parallel": 10,
"timeout_seconds": 45000
}
}
}
}
Compare Multiple Agents
Run the same tests with different agents:
{
"init_params": {
"agents": ["agent-v1", "agent-v2", "agent-v3"],
"model": "anthropic/claude-sonnet-4-20250514",
"test_case": "django__django-11119"
},
"step_configs": {
"run_all_agents": {
"activity": "list_emit_await",
"items_path": "init_params.agents",
"task_reference": {
"task_name": "run-single-test"
},
"data_mapping": {
"agent": "{{ $item }}",
"model": "{{ init_params.model }}",
"task_id": "{{ init_params.test_case }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all_agents.outputs.trajectory_references",
"extract_keys": {
"agent": "init_params.agent",
"resolved": "eval.outputs.resolved_instances"
}
}
},
"steps": ["run_all_agents", "collect"]
}
Best Practices
- Start with 3-5 tests: Verify your setup before scaling
- Use known-solvable tasks: Test with tasks your agent has solved before
- Monitor resources: Agent benchmarks are compute-intensive
- Save trajectories: Review individual runs for debugging
- Track over time: Compare results across agent versions
Next Steps
- Batch Processing - General batch workflow patterns
- LLM Evaluation - Evaluate agent outputs with judges