Custom Benchmarks: Runtime Agent and Dataset Upload

This guide explains how to create and submit custom agents and datasets to Jetty for execution via the Terminal-Bench evaluation framework. You can upload a zip file containing your custom agent code and dataset, allowing you to run benchmarks without pre-registering them.

Overview

Jetty's harbor_terminal_bench step supports runtime upload of both custom agents and datasets. This allows you to:

Run benchmarks on datasets you've created locally
Test custom agents against any task
Iterate quickly without publishing to a registry
Bundle everything needed for evaluation in a single zip file

Component	Purpose	Format
Dataset	Task definitions with environments, tests, and solutions	Directory structure with `task.toml`
Agent	Custom AI agent implementation	Python class extending Harbor's base agent
agents.json	Agent registry defining import paths and configuration	JSON configuration file

Payload Structure

Your zip file should contain the following structure:

my-payload/
├── agents/
│   ├── __init__.py           # Package init (re-export your agents)
│   ├── agents.json           # Agent definitions
│   └── my_agent/             # Agent implementation folder
│       ├── __init__.py
│       ├── cli_agent.py      # Agent class implementation
│       └── install.sh.j2     # Jinja2 template for agent installation
└── datasets/
    ├── __init__.py
    └── my_dataset/           # Dataset name (matches "dataset" param)
        └── my_task/          # Task name (matches "task_name" param)
            ├── task.toml
            ├── instruction.md
            ├── environment/
            │   ├── Dockerfile
            │   └── setup.sh
            ├── tests/
            │   ├── test.sh
            │   └── test_outputs.py
            └── solution/
                └── solve.sh

Creating a Custom Dataset

Task Structure

Each task is a directory containing everything needed to evaluate an agent:

my_task/
├── task.toml           # Task metadata and configuration
├── instruction.md      # The task instruction given to the agent
├── environment/        # Docker environment setup
│   ├── Dockerfile      # Base environment definition
│   └── setup.sh        # Additional setup commands
├── tests/              # Verification tests
│   ├── test.sh         # Test runner script
│   └── test_outputs.py # Python test implementation
└── solution/           # Reference solution (optional)
    └── solve.sh        # Commands that solve the task

task.toml

The task metadata file defines how the task should be run:

[task]
name = "hello-world"
description = "Create a simple hello world HTML page"
difficulty = "easy"

[task.environment]
type = "docker"
dockerfile = "environment/Dockerfile"
setup_script = "environment/setup.sh"

[task.verification]
test_script = "tests/test.sh"
timeout_seconds = 120

instruction.md

The instruction given to the agent. Be clear and specific:

Create an `index.html` file in the `/app` directory with a simple "Hello, World!" message.

The page should:
1. Be valid HTML5
2. Display "Hello, World!" as the main heading
3. Include a basic document structure

Save your work to `/app/index.html`.

Environment Setup

Dockerfile

Define the base environment your task runs in:

FROM python:3.12-slim

WORKDIR /app

# Install any required dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create working directory
RUN mkdir -p /app

setup.sh

Additional setup that runs after the container starts:

#!/bin/bash
# Setup script for the task environment
echo "Environment ready"

Tests

test.sh

The main test runner that returns exit code 0 on success:

#!/bin/bash
set -e

# Run Python tests
python /task/tests/test_outputs.py

echo "All tests passed!"

test_outputs.py

Python tests to verify the agent's work:

"""Tests for hello-world task."""
import os
import sys

def test_index_html_exists():
    """Check that index.html was created."""
    assert os.path.exists("/app/index.html"), "index.html not found"

def test_hello_world_content():
    """Check that index.html contains Hello World."""
    with open("/app/index.html", "r") as f:
        content = f.read()
    assert "Hello" in content, "Missing 'Hello' in content"
    assert "World" in content, "Missing 'World' in content"

if __name__ == "__main__":
    try:
        test_index_html_exists()
        test_hello_world_content()
        print("All tests passed!")
        sys.exit(0)
    except AssertionError as e:
        print(f"Test failed: {e}")
        sys.exit(1)

Creating a Custom Agent

Agent Class

Implement a Python class extending Harbor's BaseInstalledAgent:

"""Custom agent implementation."""

import logging
import os
import shlex
from pathlib import Path
from shlex import quote

from harbor.agents.installed.base import BaseInstalledAgent, ExecInput
from harbor.models.agent.context import AgentContext


class MyCustomAgent(BaseInstalledAgent):
    """Custom agent for Terminal-Bench evaluation."""

    def __init__(
        self,
        project_dir: Path = Path("/app"),
        **kwargs,
    ):
        """Initialize the agent.

        Args:
            project_dir: The working directory for the agent.
            **kwargs: Additional configuration passed from agents.json
        """
        super().__init__(**kwargs)
        self._project_dir = project_dir
        self._logger = logging.getLogger(__name__)
        self.kwargs = kwargs  # Store for use in _template_variables

    @staticmethod
    def name() -> str:
        """Return the agent name for logging/identification."""
        return "my-custom-agent"

    @property
    def _install_agent_template_path(self) -> Path:
        """Path to the Jinja2 installation template."""
        return Path(__file__).parent / "install.sh.j2"

    @property
    def _template_variables(self) -> dict[str, str]:
        """Variables available in the install template."""
        variables = super()._template_variables
        
        # Add custom variables from kwargs
        # Example: pass a wheel URL for installing dependencies
        if "wheel_url" in self.kwargs:
            variables["wheel_url"] = self.kwargs["wheel_url"]
        
        return variables

    def populate_context_post_run(self, context: AgentContext) -> None:
        """Called after agent execution to update context."""
        pass

    def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
        """Create the commands to run the agent.

        Args:
            instruction: The task instruction to execute.

        Returns:
            List of ExecInput commands to execute.
        """
        # Build environment with API keys
        env = {}
        api_keys = [
            "ANTHROPIC_API_KEY",
            "OPENAI_API_KEY",
            "OPENROUTER_API_KEY",
        ]
        for key in api_keys:
            value = os.environ.get(key)
            if value:
                env[key] = value

        # Create the command to run your agent
        # This example writes instruction to a file and runs a CLI tool
        prompt_file = "/tmp/agent_prompt.txt"
        
        full_command = (
            f"mkdir -p {quote(str(self._project_dir))} && "
            f"echo {shlex.quote(instruction)} > {prompt_file} && "
            f"my-agent-cli run --prompt {prompt_file} --workdir {quote(str(self._project_dir))}"
        )

        return [
            ExecInput(
                command=full_command,
                cwd=str(self._project_dir),
                timeout_sec=None,
                env=env,
            ),
        ]

Installation Template (install.sh.j2)

A Jinja2 template that installs your agent in the Docker environment:

#!/bin/bash
# Install script for custom agent
set -e

echo "Installing custom agent..."

# Install system dependencies
apt-get update
apt-get install -y curl python3-pip

{% if wheel_url %}
# Install from wheel URL if provided
pip install "{{ wheel_url }}"
{% else %}
# Install from PyPI
pip install my-agent-package
{% endif %}

# Verify installation
my-agent-cli --version

echo "Agent installation complete!"

agents.json

Define your agents in agents/agents.json:

{
    "my_agent": {
        "import_path": "agents.my_agent.cli_agent:MyCustomAgent",
        "kwargs": {
            "wheel_url": "https://example.com/my-agent-0.1.0-py3-none-any.whl",
            "custom_option": "value"
        }
    }
}

Field	Description
`import_path`	Python import path in format `module.submodule:ClassName`
`kwargs`	Dictionary of configuration passed to agent's `__init__`

Package init.py Files

agents/init.py

from .my_agent import MyCustomAgent

__all__ = ["MyCustomAgent"]

agents/my_agent/init.py

from .cli_agent import MyCustomAgent

__all__ = ["MyCustomAgent"]

Submitting to Jetty

Creating the Zip File

# From your payload directory
cd /path/to/my-payload

# Create zip excluding unnecessary files
zip -r my-payload.zip . \
    -x "*.DS_Store" \
    -x "*__pycache__*" \
    -x "*.pyc" \
    -x "*.git*"

API Request

Submit your payload using a multipart form request:

curl -X POST "https://api.jetty.io/api/v1/run-sync/{collection}/{task}" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F 'init_params={
    "agent": "my_agent",
    "model": "anthropic/claude-sonnet-4-20250514",
    "dataset": "my_dataset",
    "task_name": "my_task"
  }' \
  -F "files=@my-payload.zip"

Request Parameters

Parameter	Description	Example
`agent`	Agent name (key in agents.json)	`"my_agent"`
`model`	LLM model for the agent	`"anthropic/claude-sonnet-4-20250514"`
`dataset`	Dataset folder name	`"my_dataset"`
`task_name`	Task folder name within dataset	`"my_task"`
`env`	Environment type	`"sandbox"` (default)
`debug`	Enable debug logging	`false`
`network_enabled`	Allow network access in container	`true`

Async Execution

For long-running tasks, use the async endpoint:

# Start the workflow
curl -X POST "https://api.jetty.io/api/v1/run/{collection}/{task}" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F 'init_params={
    "agent": "my_agent",
    "model": "openai/gpt-4o",
    "dataset": "my_dataset",
    "task_name": "my_task"
  }' \
  -F "files=@my-payload.zip"

# Response includes workflow_id for polling
# {"workflow_id": "collection-task--abc12345", ...}

# Check status
curl "https://api.jetty.io/api/v1/flows/github-action-status/{collection}/{task}/{trajectory_id}" \
  -H "Authorization: Bearer YOUR_API_TOKEN"

Response Structure

Successful Execution

{
  "message": "Bakery workflow executed successfully.",
  "workflow_id": "jettyiodev-my-task--abc12345",
  "trajectory": {
    "status": "completed",
    "steps": {
      "harbor_tbench": {
        "inputs": {
          "agent": "agents.my_agent.cli_agent:MyCustomAgent",
          "dataset": "my_dataset",
          "task_name": "my_task"
        },
        "outputs": {
          "success": true,
          "agent_source": "uploaded",
          "dataset_source": "uploaded",
          "mean_reward": 1.0,
          "n_trials": 1,
          "n_errors": 0,
          "results": [...]
        }
      }
    }
  }
}

Output Fields

Field	Description
`success`	Whether the job completed without errors
`agent_source`	`"uploaded"` for custom agents, `"standard"` for built-in
`dataset_source`	`"uploaded"` for custom datasets, `"registry"` for built-in
`mean_reward`	Average reward across trials (1.0 = all tests passed)
`n_trials`	Number of trials executed
`n_errors`	Number of errors encountered
`results`	Detailed trial results including verifier output
`files`	Artifacts saved to storage (logs, results, etc.)

Best Practices

Dataset Design

Clear Instructions: Write unambiguous task instructions
Comprehensive Tests: Test all requirements, not just the happy path
Minimal Environment: Include only necessary dependencies in Dockerfile
Reproducible Setup: Ensure setup scripts are idempotent

Agent Implementation

Error Handling: Handle failures gracefully in your agent
Timeout Awareness: Design for potential timeouts
Clean Outputs: Avoid excessive logging that clutters results
Environment Variables: Use standard API key environment variables

Packaging

Exclude Unnecessary Files: Don't include .git, __pycache__, etc.
Test Locally First: Verify your agent works before uploading
Version Your Payloads: Keep track of payload versions
Document Dependencies: List all required packages

Troubleshooting

Common Errors

"No module named 'agents.xxx'"

Cause: Missing or incorrect __init__.py files in your agents directory.

Solution: Ensure all directories have __init__.py files with proper imports:

# agents/__init__.py
from .my_agent import MyCustomAgent
__all__ = ["MyCustomAgent"]

"Dataset 'xxx' not found in uploaded zip"

Cause: The dataset folder name doesn't match the dataset parameter.

Solution: Verify the folder structure:

datasets/
└── my_dataset/    # Must match "dataset": "my_dataset"
    └── my_task/   # Must match "task_name": "my_task"

"'MyAgent' object has no attribute 'kwargs'"

Cause: Agent class doesn't store kwargs from __init__.

Solution: Add self.kwargs = kwargs in your agent's __init__:

def __init__(self, **kwargs):
    super().__init__(**kwargs)
    self.kwargs = kwargs  # Add this line

"Invalid agent import path"

Cause: PYTHONPATH issue or incorrect import path in agents.json.

Solution:

Verify the import path matches your file structure
Ensure all __init__.py files exist
Check for circular imports

Debugging Tips

Enable Debug Mode: Set "debug": true in init_params
Check Artifacts: Review saved log files in the response
Test Components Separately: Test your agent locally first
Verify Docker Environment: Ensure your Dockerfile builds correctly

Example: Complete Payload

Here's a complete example payload structure:

example-payload/
├── agents/
│   ├── __init__.py
│   ├── agents.json
│   └── simple_agent/
│       ├── __init__.py
│       ├── agent.py
│       └── install.sh.j2
└── datasets/
    ├── __init__.py
    └── simple/
        └── hello-world/
            ├── task.toml
            ├── instruction.md
            ├── environment/
            │   ├── Dockerfile
            │   └── setup.sh
            ├── tests/
            │   ├── test.sh
            │   └── test_outputs.py
            └── solution/
                └── solve.sh

agents.json:

{
    "simple": {
        "import_path": "agents.simple_agent.agent:SimpleAgent",
        "kwargs": {}
    }
}

Run command:

curl -X POST "https://api.jetty.io/api/v1/run-sync/myorg/benchmark" \
  -H "Authorization: Bearer $JETTY_API_TOKEN" \
  -F 'init_params={"agent": "simple", "dataset": "simple", "task_name": "hello-world"}' \
  -F "files=@example-payload.zip"

Step Library - Understanding Jetty workflows

Overview​

Payload Structure​

Creating a Custom Dataset​

Task Structure​

task.toml​

instruction.md​

Environment Setup​

Dockerfile​

setup.sh​

Tests​

test.sh​

test_outputs.py​

Creating a Custom Agent​

Agent Class​

Installation Template (install.sh.j2)​

agents.json​

Package init.py Files​

agents/init.py​

agents/my_agent/init.py​

Submitting to Jetty​

Creating the Zip File​

API Request​

Request Parameters​

Async Execution​

Response Structure​

Successful Execution​

Output Fields​

Best Practices​

Dataset Design​

Agent Implementation​

Packaging​

Troubleshooting​

Common Errors​

"No module named 'agents.xxx'"​

"Dataset 'xxx' not found in uploaded zip"​

"'MyAgent' object has no attribute 'kwargs'"​

"Invalid agent import path"​

Debugging Tips​

Example: Complete Payload​

Related Documentation​