Skip to main content

Custom Benchmarks: Runtime Agent and Dataset Upload

This guide explains how to create and submit custom agents and datasets to Jetty for execution via the Terminal-Bench evaluation framework. You can upload a zip file containing your custom agent code and dataset, allowing you to run benchmarks without pre-registering them.

Overview

Jetty's harbor_terminal_bench step supports runtime upload of both custom agents and datasets. This allows you to:

  • Run benchmarks on datasets you've created locally
  • Test custom agents against any task
  • Iterate quickly without publishing to a registry
  • Bundle everything needed for evaluation in a single zip file
ComponentPurposeFormat
DatasetTask definitions with environments, tests, and solutionsDirectory structure with task.toml
AgentCustom AI agent implementationPython class extending Harbor's base agent
agents.jsonAgent registry defining import paths and configurationJSON configuration file

Payload Structure

Your zip file should contain the following structure:

my-payload/
├── agents/
│ ├── __init__.py # Package init (re-export your agents)
│ ├── agents.json # Agent definitions
│ └── my_agent/ # Agent implementation folder
│ ├── __init__.py
│ ├── cli_agent.py # Agent class implementation
│ └── install.sh.j2 # Jinja2 template for agent installation
└── datasets/
├── __init__.py
└── my_dataset/ # Dataset name (matches "dataset" param)
└── my_task/ # Task name (matches "task_name" param)
├── task.toml
├── instruction.md
├── environment/
│ ├── Dockerfile
│ └── setup.sh
├── tests/
│ ├── test.sh
│ └── test_outputs.py
└── solution/
└── solve.sh

Creating a Custom Dataset

Task Structure

Each task is a directory containing everything needed to evaluate an agent:

my_task/
├── task.toml # Task metadata and configuration
├── instruction.md # The task instruction given to the agent
├── environment/ # Docker environment setup
│ ├── Dockerfile # Base environment definition
│ └── setup.sh # Additional setup commands
├── tests/ # Verification tests
│ ├── test.sh # Test runner script
│ └── test_outputs.py # Python test implementation
└── solution/ # Reference solution (optional)
└── solve.sh # Commands that solve the task

task.toml

The task metadata file defines how the task should be run:

[task]
name = "hello-world"
description = "Create a simple hello world HTML page"
difficulty = "easy"

[task.environment]
type = "docker"
dockerfile = "environment/Dockerfile"
setup_script = "environment/setup.sh"

[task.verification]
test_script = "tests/test.sh"
timeout_seconds = 120

instruction.md

The instruction given to the agent. Be clear and specific:

Create an `index.html` file in the `/app` directory with a simple "Hello, World!" message.

The page should:
1. Be valid HTML5
2. Display "Hello, World!" as the main heading
3. Include a basic document structure

Save your work to `/app/index.html`.

Environment Setup

Dockerfile

Define the base environment your task runs in:

FROM python:3.12-slim

WORKDIR /app

# Install any required dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*

# Create working directory
RUN mkdir -p /app

setup.sh

Additional setup that runs after the container starts:

#!/bin/bash
# Setup script for the task environment
echo "Environment ready"

Tests

test.sh

The main test runner that returns exit code 0 on success:

#!/bin/bash
set -e

# Run Python tests
python /task/tests/test_outputs.py

echo "All tests passed!"

test_outputs.py

Python tests to verify the agent's work:

"""Tests for hello-world task."""
import os
import sys

def test_index_html_exists():
"""Check that index.html was created."""
assert os.path.exists("/app/index.html"), "index.html not found"

def test_hello_world_content():
"""Check that index.html contains Hello World."""
with open("/app/index.html", "r") as f:
content = f.read()
assert "Hello" in content, "Missing 'Hello' in content"
assert "World" in content, "Missing 'World' in content"

if __name__ == "__main__":
try:
test_index_html_exists()
test_hello_world_content()
print("All tests passed!")
sys.exit(0)
except AssertionError as e:
print(f"Test failed: {e}")
sys.exit(1)

Creating a Custom Agent

Agent Class

Implement a Python class extending Harbor's BaseInstalledAgent:

"""Custom agent implementation."""

import logging
import os
import shlex
from pathlib import Path
from shlex import quote

from harbor.agents.installed.base import BaseInstalledAgent, ExecInput
from harbor.models.agent.context import AgentContext


class MyCustomAgent(BaseInstalledAgent):
"""Custom agent for Terminal-Bench evaluation."""

def __init__(
self,
project_dir: Path = Path("/app"),
**kwargs,
):
"""Initialize the agent.

Args:
project_dir: The working directory for the agent.
**kwargs: Additional configuration passed from agents.json
"""
super().__init__(**kwargs)
self._project_dir = project_dir
self._logger = logging.getLogger(__name__)
self.kwargs = kwargs # Store for use in _template_variables

@staticmethod
def name() -> str:
"""Return the agent name for logging/identification."""
return "my-custom-agent"

@property
def _install_agent_template_path(self) -> Path:
"""Path to the Jinja2 installation template."""
return Path(__file__).parent / "install.sh.j2"

@property
def _template_variables(self) -> dict[str, str]:
"""Variables available in the install template."""
variables = super()._template_variables

# Add custom variables from kwargs
# Example: pass a wheel URL for installing dependencies
if "wheel_url" in self.kwargs:
variables["wheel_url"] = self.kwargs["wheel_url"]

return variables

def populate_context_post_run(self, context: AgentContext) -> None:
"""Called after agent execution to update context."""
pass

def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
"""Create the commands to run the agent.

Args:
instruction: The task instruction to execute.

Returns:
List of ExecInput commands to execute.
"""
# Build environment with API keys
env = {}
api_keys = [
"ANTHROPIC_API_KEY",
"OPENAI_API_KEY",
"OPENROUTER_API_KEY",
]
for key in api_keys:
value = os.environ.get(key)
if value:
env[key] = value

# Create the command to run your agent
# This example writes instruction to a file and runs a CLI tool
prompt_file = "/tmp/agent_prompt.txt"

full_command = (
f"mkdir -p {quote(str(self._project_dir))} && "
f"echo {shlex.quote(instruction)} > {prompt_file} && "
f"my-agent-cli run --prompt {prompt_file} --workdir {quote(str(self._project_dir))}"
)

return [
ExecInput(
command=full_command,
cwd=str(self._project_dir),
timeout_sec=None,
env=env,
),
]

Installation Template (install.sh.j2)

A Jinja2 template that installs your agent in the Docker environment:

#!/bin/bash
# Install script for custom agent
set -e

echo "Installing custom agent..."

# Install system dependencies
apt-get update
apt-get install -y curl python3-pip

{% if wheel_url %}
# Install from wheel URL if provided
pip install "{{ wheel_url }}"
{% else %}
# Install from PyPI
pip install my-agent-package
{% endif %}

# Verify installation
my-agent-cli --version

echo "Agent installation complete!"

agents.json

Define your agents in agents/agents.json:

{
"my_agent": {
"import_path": "agents.my_agent.cli_agent:MyCustomAgent",
"kwargs": {
"wheel_url": "https://example.com/my-agent-0.1.0-py3-none-any.whl",
"custom_option": "value"
}
}
}
FieldDescription
import_pathPython import path in format module.submodule:ClassName
kwargsDictionary of configuration passed to agent's __init__

Package init.py Files

agents/init.py

from .my_agent import MyCustomAgent

__all__ = ["MyCustomAgent"]

agents/my_agent/init.py

from .cli_agent import MyCustomAgent

__all__ = ["MyCustomAgent"]

Submitting to Jetty

Creating the Zip File

# From your payload directory
cd /path/to/my-payload

# Create zip excluding unnecessary files
zip -r my-payload.zip . \
-x "*.DS_Store" \
-x "*__pycache__*" \
-x "*.pyc" \
-x "*.git*"

API Request

Submit your payload using a multipart form request:

curl -X POST "https://api.jetty.io/api/v1/run-sync/{collection}/{task}" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F 'init_params={
"agent": "my_agent",
"model": "anthropic/claude-sonnet-4-20250514",
"dataset": "my_dataset",
"task_name": "my_task"
}' \
-F "files=@my-payload.zip"

Request Parameters

ParameterDescriptionExample
agentAgent name (key in agents.json)"my_agent"
modelLLM model for the agent"anthropic/claude-sonnet-4-20250514"
datasetDataset folder name"my_dataset"
task_nameTask folder name within dataset"my_task"
envEnvironment type"sandbox" (default)
debugEnable debug loggingfalse
network_enabledAllow network access in containertrue

Async Execution

For long-running tasks, use the async endpoint:

# Start the workflow
curl -X POST "https://api.jetty.io/api/v1/run/{collection}/{task}" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F 'init_params={
"agent": "my_agent",
"model": "openai/gpt-4o",
"dataset": "my_dataset",
"task_name": "my_task"
}' \
-F "files=@my-payload.zip"

# Response includes workflow_id for polling
# {"workflow_id": "collection-task--abc12345", ...}

# Check status
curl "https://api.jetty.io/api/v1/flows/github-action-status/{collection}/{task}/{trajectory_id}" \
-H "Authorization: Bearer YOUR_API_TOKEN"

Response Structure

Successful Execution

{
"message": "Bakery workflow executed successfully.",
"workflow_id": "jettyiodev-my-task--abc12345",
"trajectory": {
"status": "completed",
"steps": {
"harbor_tbench": {
"inputs": {
"agent": "agents.my_agent.cli_agent:MyCustomAgent",
"dataset": "my_dataset",
"task_name": "my_task"
},
"outputs": {
"success": true,
"agent_source": "uploaded",
"dataset_source": "uploaded",
"mean_reward": 1.0,
"n_trials": 1,
"n_errors": 0,
"results": [...]
}
}
}
}
}

Output Fields

FieldDescription
successWhether the job completed without errors
agent_source"uploaded" for custom agents, "standard" for built-in
dataset_source"uploaded" for custom datasets, "registry" for built-in
mean_rewardAverage reward across trials (1.0 = all tests passed)
n_trialsNumber of trials executed
n_errorsNumber of errors encountered
resultsDetailed trial results including verifier output
filesArtifacts saved to storage (logs, results, etc.)

Best Practices

Dataset Design

  1. Clear Instructions: Write unambiguous task instructions
  2. Comprehensive Tests: Test all requirements, not just the happy path
  3. Minimal Environment: Include only necessary dependencies in Dockerfile
  4. Reproducible Setup: Ensure setup scripts are idempotent

Agent Implementation

  1. Error Handling: Handle failures gracefully in your agent
  2. Timeout Awareness: Design for potential timeouts
  3. Clean Outputs: Avoid excessive logging that clutters results
  4. Environment Variables: Use standard API key environment variables

Packaging

  1. Exclude Unnecessary Files: Don't include .git, __pycache__, etc.
  2. Test Locally First: Verify your agent works before uploading
  3. Version Your Payloads: Keep track of payload versions
  4. Document Dependencies: List all required packages

Troubleshooting

Common Errors

"No module named 'agents.xxx'"

Cause: Missing or incorrect __init__.py files in your agents directory.

Solution: Ensure all directories have __init__.py files with proper imports:

# agents/__init__.py
from .my_agent import MyCustomAgent
__all__ = ["MyCustomAgent"]

"Dataset 'xxx' not found in uploaded zip"

Cause: The dataset folder name doesn't match the dataset parameter.

Solution: Verify the folder structure:

datasets/
└── my_dataset/ # Must match "dataset": "my_dataset"
└── my_task/ # Must match "task_name": "my_task"

"'MyAgent' object has no attribute 'kwargs'"

Cause: Agent class doesn't store kwargs from __init__.

Solution: Add self.kwargs = kwargs in your agent's __init__:

def __init__(self, **kwargs):
super().__init__(**kwargs)
self.kwargs = kwargs # Add this line

"Invalid agent import path"

Cause: PYTHONPATH issue or incorrect import path in agents.json.

Solution:

  1. Verify the import path matches your file structure
  2. Ensure all __init__.py files exist
  3. Check for circular imports

Debugging Tips

  1. Enable Debug Mode: Set "debug": true in init_params
  2. Check Artifacts: Review saved log files in the response
  3. Test Components Separately: Test your agent locally first
  4. Verify Docker Environment: Ensure your Dockerfile builds correctly

Example: Complete Payload

Here's a complete example payload structure:

example-payload/
├── agents/
│ ├── __init__.py
│ ├── agents.json
│ └── simple_agent/
│ ├── __init__.py
│ ├── agent.py
│ └── install.sh.j2
└── datasets/
├── __init__.py
└── simple/
└── hello-world/
├── task.toml
├── instruction.md
├── environment/
│ ├── Dockerfile
│ └── setup.sh
├── tests/
│ ├── test.sh
│ └── test_outputs.py
└── solution/
└── solve.sh

agents.json:

{
"simple": {
"import_path": "agents.simple_agent.agent:SimpleAgent",
"kwargs": {}
}
}

Run command:

curl -X POST "https://api.jetty.io/api/v1/run-sync/myorg/benchmark" \
-H "Authorization: Bearer $JETTY_API_TOKEN" \
-F 'init_params={"agent": "simple", "dataset": "simple", "task_name": "hello-world"}' \
-F "files=@example-payload.zip"