Benchmark AI Agents in 10 Minutes

Test your coding agent on real-world tasks with automatic evaluation.

What You'll Build

Test Cases → Run Agent (parallel) → Evaluate Solutions → Summary Report

Prerequisites

A coding agent compatible with TerminalBench or SWE-Bench
API access to Jetty
Configured model API keys

The Workflow

Parent Workflow: `benchmark-agent`

{
  "init_params": {
    "agent": "your-agent-name",
    "model": "anthropic/claude-sonnet-4-20250514",
    "test_cases": [
      "django__django-11119",
      "astropy__astropy-7166",
      "django__django-10880"
    ]
  },
  "step_configs": {
    "run_tests": {
      "activity": "list_emit_await",
      "items_path": "init_params.test_cases",
      "task_reference": {
        "task_name": "run-single-test"
      },
      "data_mapping": {
        "agent": "{{ init_params.agent }}",
        "model": "{{ init_params.model }}",
        "task_id": "{{ $item }}"
      },
      "execution_config": {
        "max_parallel": 3,
        "timeout_seconds": 35000
      }
    },
    "collect_results": {
      "activity": "extract_from_trajectories",
      "trajectory_list_path": "run_tests.outputs.trajectory_references",
      "extract_keys": {
        "task_id": "init_params.task_id",
        "resolved": "eval.outputs.resolved_instances",
        "errors": "tbench.outputs.n_errors"
      }
    },
    "summarize": {
      "model": "gemini/gemini-2.5-flash",
      "activity": "litellm_chat",
      "user_prompt": "Generate a markdown summary of these benchmark results:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: pass rate, common failure patterns, and recommendations."
    }
  },
  "steps": ["run_tests", "collect_results", "summarize"]
}

Child Workflow: `run-single-test`

{
  "init_params": {
    "agent": "",
    "model": "",
    "task_id": ""
  },
  "step_configs": {
    "tbench": {
      "activity": "terminal_bench",
      "agent_path": "init_params.agent",
      "model_path": "init_params.model",
      "task_id_path": "init_params.task_id",
      "env": "swebench"
    },
    "eval": {
      "activity": "swe_bench_docker_eval",
      "predictions_path": "tbench.outputs.predictions",
      "task_id_path": "init_params.task_id"
    }
  },
  "steps": ["tbench", "eval"]
}

What This Does

Runs your agent on multiple SWE-Bench tasks in parallel
Evaluates each solution against the official test suite
Generates a summary of pass/fail rates with analysis

Supported Benchmarks

Benchmark	Activity	Description
SWE-Bench	`terminal_bench` + `swe_bench_docker_eval`	Real GitHub issues
TerminalBench	`harbor_terminal_bench`	Terminal-based coding tasks
Custom	Build your own	Any Docker-based evaluation

Harbor Terminal Bench

For container-based agent evaluation:

{
  "init_params": {
    "agent": "your-agent",
    "model": "anthropic/claude-sonnet-4-20250514",
    "dataset": "your-dataset"
  },
  "step_configs": {
    "run": {
      "activity": "harbor_terminal_bench",
      "agent_path": "init_params.agent",
      "model_path": "init_params.model",
      "dataset_path": "init_params.dataset"
    }
  },
  "steps": ["run"]
}

Configure Execution

Control parallelism

{
  "execution_config": {
    "max_parallel": 3,
    "timeout_seconds": 35000
  }
}

Agent benchmarks are resource-intensive. Recommended settings:

Test Count	`max_parallel`	`timeout_seconds`
1-5	3	35000
10-50	5-10	35000
100+	10-20	45000

Handle long-running tests

Some SWE-Bench tasks take 10+ minutes. Set appropriate timeouts:

{
  "execution_config": {
    "timeout_seconds": 60000
  }
}

The Output

{
  "collect_results": {
    "outputs": {
      "extracted_data": [
        {"task_id": "django__django-11119", "resolved": 1, "errors": 0},
        {"task_id": "astropy__astropy-7166", "resolved": 0, "errors": 2},
        {"task_id": "django__django-10880", "resolved": 1, "errors": 0}
      ]
    }
  },
  "summarize": {
    "outputs": {
      "content": "## Benchmark Results\n\n**Pass Rate:** 66.7% (2/3)\n\n### Passed\n- django__django-11119\n- django__django-10880\n\n### Failed\n- astropy__astropy-7166 (2 errors)\n\n### Recommendations\n..."
    }
  }
}

SWE-Bench Lite (Quick Test)

Test with a smaller subset:

{
  "init_params": {
    "test_cases": [
      "sympy__sympy-13031",
      "django__django-11099",
      "requests__requests-3362"
    ]
  }
}

These are known to be solvable and run quickly.

Full SWE-Bench Verified

For comprehensive evaluation:

{
  "init_params": {
    "agent": "your-agent",
    "model": "anthropic/claude-sonnet-4-20250514"
  },
  "step_configs": {
    "load_dataset": {
      "activity": "download_file",
      "url": "https://your-dataset-url/swebench-verified.json"
    },
    "run_all": {
      "activity": "list_emit_await",
      "items_path": "load_dataset.outputs.content",
      "task_reference": {
        "task_name": "run-single-test"
      },
      "data_mapping": {
        "task_id": "{{ $item.instance_id }}"
      },
      "execution_config": {
        "max_parallel": 10,
        "timeout_seconds": 45000
      }
    }
  }
}

Compare Multiple Agents

Run the same tests with different agents:

{
  "init_params": {
    "agents": ["agent-v1", "agent-v2", "agent-v3"],
    "model": "anthropic/claude-sonnet-4-20250514",
    "test_case": "django__django-11119"
  },
  "step_configs": {
    "run_all_agents": {
      "activity": "list_emit_await",
      "items_path": "init_params.agents",
      "task_reference": {
        "task_name": "run-single-test"
      },
      "data_mapping": {
        "agent": "{{ $item }}",
        "model": "{{ init_params.model }}",
        "task_id": "{{ init_params.test_case }}"
      }
    },
    "collect": {
      "activity": "extract_from_trajectories",
      "trajectory_list_path": "run_all_agents.outputs.trajectory_references",
      "extract_keys": {
        "agent": "init_params.agent",
        "resolved": "eval.outputs.resolved_instances"
      }
    }
  },
  "steps": ["run_all_agents", "collect"]
}

Best Practices

Start with 3-5 tests: Verify your setup before scaling
Use known-solvable tasks: Test with tasks your agent has solved before
Monitor resources: Agent benchmarks are compute-intensive
Save trajectories: Review individual runs for debugging
Track over time: Compare results across agent versions

Next Steps

Batch Processing - General batch workflow patterns
LLM Evaluation - Evaluate agent outputs with judges

What You'll Build​

Prerequisites​

The Workflow​

Parent Workflow: benchmark-agent​

Child Workflow: run-single-test​

What This Does​

Supported Benchmarks​

Harbor Terminal Bench​

Configure Execution​

Control parallelism​

Handle long-running tests​

The Output​

SWE-Bench Lite (Quick Test)​

Full SWE-Bench Verified​

Compare Multiple Agents​

Best Practices​

Next Steps​