Custom Benchmarks: Runtime Agent and Dataset Upload
This guide explains how to create and submit custom agents and datasets to Jetty for execution via the Terminal-Bench evaluation framework. You can upload a zip file containing your custom agent code and dataset, allowing you to run benchmarks without pre-registering them.
Overview
Jetty's harbor_terminal_bench step supports runtime upload of both custom agents and datasets. This allows you to:
- Run benchmarks on datasets you've created locally
- Test custom agents against any task
- Iterate quickly without publishing to a registry
- Bundle everything needed for evaluation in a single zip file
| Component | Purpose | Format |
|---|---|---|
| Dataset | Task definitions with environments, tests, and solutions | Directory structure with task.toml |
| Agent | Custom AI agent implementation | Python class extending Harbor's base agent |
| agents.json | Agent registry defining import paths and configuration | JSON configuration file |
Payload Structure
Your zip file should contain the following structure:
my-payload/
├── agents/
│ ├── __init__.py # Package init (re-export your agents)
│ ├── agents.json # Agent definitions
│ └── my_agent/ # Agent implementation folder
│ ├── __init__.py
│ ├── cli_agent.py # Agent class implementation
│ └── install.sh.j2 # Jinja2 template for agent installation
└── datasets/
├── __init__.py
└── my_dataset/ # Dataset name (matches "dataset" param)
└── my_task/ # Task name (matches "task_name" param)
├── task.toml
├── instruction.md
├── environment/
│ ├── Dockerfile
│ └── setup.sh
├── tests/
│ ├── test.sh
│ └── test_outputs.py
└── solution/
└── solve.sh
Creating a Custom Dataset
Task Structure
Each task is a directory containing everything needed to evaluate an agent:
my_task/
├── task.toml # Task metadata and configuration
├── instruction.md # The task instruction given to the agent
├── environment/ # Docker environment setup
│ ├── Dockerfile # Base environment definition
│ └── setup.sh # Additional setup commands
├── tests/ # Verification tests
│ ├── test.sh # Test runner script
│ └── test_outputs.py # Python test implementation
└── solution/ # Reference solution (optional)
└── solve.sh # Commands that solve the task
task.toml
The task metadata file defines how the task should be run:
[task]
name = "hello-world"
description = "Create a simple hello world HTML page"
difficulty = "easy"
[task.environment]
type = "docker"
dockerfile = "environment/Dockerfile"
setup_script = "environment/setup.sh"
[task.verification]
test_script = "tests/test.sh"
timeout_seconds = 120
instruction.md
The instruction given to the agent. Be clear and specific:
Create an `index.html` file in the `/app` directory with a simple "Hello, World!" message.
The page should:
1. Be valid HTML5
2. Display "Hello, World!" as the main heading
3. Include a basic document structure
Save your work to `/app/index.html`.
Environment Setup
Dockerfile
Define the base environment your task runs in:
FROM python:3.12-slim
WORKDIR /app
# Install any required dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Create working directory
RUN mkdir -p /app
setup.sh
Additional setup that runs after the container starts:
#!/bin/bash
# Setup script for the task environment
echo "Environment ready"
Tests
test.sh
The main test runner that returns exit code 0 on success:
#!/bin/bash
set -e
# Run Python tests
python /task/tests/test_outputs.py
echo "All tests passed!"
test_outputs.py
Python tests to verify the agent's work:
"""Tests for hello-world task."""
import os
import sys
def test_index_html_exists():
"""Check that index.html was created."""
assert os.path.exists("/app/index.html"), "index.html not found"
def test_hello_world_content():
"""Check that index.html contains Hello World."""
with open("/app/index.html", "r") as f:
content = f.read()
assert "Hello" in content, "Missing 'Hello' in content"
assert "World" in content, "Missing 'World' in content"
if __name__ == "__main__":
try:
test_index_html_exists()
test_hello_world_content()
print("All tests passed!")
sys.exit(0)
except AssertionError as e:
print(f"Test failed: {e}")
sys.exit(1)
Creating a Custom Agent
Agent Class
Implement a Python class extending Harbor's BaseInstalledAgent:
"""Custom agent implementation."""
import logging
import os
import shlex
from pathlib import Path
from shlex import quote
from harbor.agents.installed.base import BaseInstalledAgent, ExecInput
from harbor.models.agent.context import AgentContext
class MyCustomAgent(BaseInstalledAgent):
"""Custom agent for Terminal-Bench evaluation."""
def __init__(
self,
project_dir: Path = Path("/app"),
**kwargs,
):
"""Initialize the agent.
Args:
project_dir: The working directory for the agent.
**kwargs: Additional configuration passed from agents.json
"""
super().__init__(**kwargs)
self._project_dir = project_dir
self._logger = logging.getLogger(__name__)
self.kwargs = kwargs # Store for use in _template_variables
@staticmethod
def name() -> str:
"""Return the agent name for logging/identification."""
return "my-custom-agent"
@property
def _install_agent_template_path(self) -> Path:
"""Path to the Jinja2 installation template."""
return Path(__file__).parent / "install.sh.j2"
@property
def _template_variables(self) -> dict[str, str]:
"""Variables available in the install template."""
variables = super()._template_variables
# Add custom variables from kwargs
# Example: pass a wheel URL for installing dependencies
if "wheel_url" in self.kwargs:
variables["wheel_url"] = self.kwargs["wheel_url"]
return variables
def populate_context_post_run(self, context: AgentContext) -> None:
"""Called after agent execution to update context."""
pass
def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
"""Create the commands to run the agent.
Args:
instruction: The task instruction to execute.
Returns:
List of ExecInput commands to execute.
"""
# Build environment with API keys
env = {}
api_keys = [
"ANTHROPIC_API_KEY",
"OPENAI_API_KEY",
"OPENROUTER_API_KEY",
]
for key in api_keys:
value = os.environ.get(key)
if value:
env[key] = value
# Create the command to run your agent
# This example writes instruction to a file and runs a CLI tool
prompt_file = "/tmp/agent_prompt.txt"
full_command = (
f"mkdir -p {quote(str(self._project_dir))} && "
f"echo {shlex.quote(instruction)} > {prompt_file} && "
f"my-agent-cli run --prompt {prompt_file} --workdir {quote(str(self._project_dir))}"
)
return [
ExecInput(
command=full_command,
cwd=str(self._project_dir),
timeout_sec=None,
env=env,
),
]
Installation Template (install.sh.j2)
A Jinja2 template that installs your agent in the Docker environment:
#!/bin/bash
# Install script for custom agent
set -e
echo "Installing custom agent..."
# Install system dependencies
apt-get update
apt-get install -y curl python3-pip
{% if wheel_url %}
# Install from wheel URL if provided
pip install "{{ wheel_url }}"
{% else %}
# Install from PyPI
pip install my-agent-package
{% endif %}
# Verify installation
my-agent-cli --version
echo "Agent installation complete!"
agents.json
Define your agents in agents/agents.json:
{
"my_agent": {
"import_path": "agents.my_agent.cli_agent:MyCustomAgent",
"kwargs": {
"wheel_url": "https://example.com/my-agent-0.1.0-py3-none-any.whl",
"custom_option": "value"
}
}
}
| Field | Description |
|---|---|
import_path | Python import path in format module.submodule:ClassName |
kwargs | Dictionary of configuration passed to agent's __init__ |
Package init.py Files
agents/init.py
from .my_agent import MyCustomAgent
__all__ = ["MyCustomAgent"]
agents/my_agent/init.py
from .cli_agent import MyCustomAgent
__all__ = ["MyCustomAgent"]
Submitting to Jetty
Creating the Zip File
# From your payload directory
cd /path/to/my-payload
# Create zip excluding unnecessary files
zip -r my-payload.zip . \
-x "*.DS_Store" \
-x "*__pycache__*" \
-x "*.pyc" \
-x "*.git*"
API Request
Submit your payload using a multipart form request:
curl -X POST "https://api.jetty.io/api/v1/run-sync/{collection}/{task}" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F 'init_params={
"agent": "my_agent",
"model": "anthropic/claude-sonnet-4-20250514",
"dataset": "my_dataset",
"task_name": "my_task"
}' \
-F "files=@my-payload.zip"
Request Parameters
| Parameter | Description | Example |
|---|---|---|
agent | Agent name (key in agents.json) | "my_agent" |
model | LLM model for the agent | "anthropic/claude-sonnet-4-20250514" |
dataset | Dataset folder name | "my_dataset" |
task_name | Task folder name within dataset | "my_task" |
env | Environment type | "sandbox" (default) |
debug | Enable debug logging | false |
network_enabled | Allow network access in container | true |
Async Execution
For long-running tasks, use the async endpoint:
# Start the workflow
curl -X POST "https://api.jetty.io/api/v1/run/{collection}/{task}" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F 'init_params={
"agent": "my_agent",
"model": "openai/gpt-4o",
"dataset": "my_dataset",
"task_name": "my_task"
}' \
-F "files=@my-payload.zip"
# Response includes workflow_id for polling
# {"workflow_id": "collection-task--abc12345", ...}
# Check status
curl "https://api.jetty.io/api/v1/flows/github-action-status/{collection}/{task}/{trajectory_id}" \
-H "Authorization: Bearer YOUR_API_TOKEN"
Response Structure
Successful Execution
{
"message": "Bakery workflow executed successfully.",
"workflow_id": "jettyiodev-my-task--abc12345",
"trajectory": {
"status": "completed",
"steps": {
"harbor_tbench": {
"inputs": {
"agent": "agents.my_agent.cli_agent:MyCustomAgent",
"dataset": "my_dataset",
"task_name": "my_task"
},
"outputs": {
"success": true,
"agent_source": "uploaded",
"dataset_source": "uploaded",
"mean_reward": 1.0,
"n_trials": 1,
"n_errors": 0,
"results": [...]
}
}
}
}
}
Output Fields
| Field | Description |
|---|---|
success | Whether the job completed without errors |
agent_source | "uploaded" for custom agents, "standard" for built-in |
dataset_source | "uploaded" for custom datasets, "registry" for built-in |
mean_reward | Average reward across trials (1.0 = all tests passed) |
n_trials | Number of trials executed |
n_errors | Number of errors encountered |
results | Detailed trial results including verifier output |
files | Artifacts saved to storage (logs, results, etc.) |
Best Practices
Dataset Design
- Clear Instructions: Write unambiguous task instructions
- Comprehensive Tests: Test all requirements, not just the happy path
- Minimal Environment: Include only necessary dependencies in Dockerfile
- Reproducible Setup: Ensure setup scripts are idempotent
Agent Implementation
- Error Handling: Handle failures gracefully in your agent
- Timeout Awareness: Design for potential timeouts
- Clean Outputs: Avoid excessive logging that clutters results
- Environment Variables: Use standard API key environment variables
Packaging
- Exclude Unnecessary Files: Don't include
.git,__pycache__, etc. - Test Locally First: Verify your agent works before uploading
- Version Your Payloads: Keep track of payload versions
- Document Dependencies: List all required packages
Troubleshooting
Common Errors
"No module named 'agents.xxx'"
Cause: Missing or incorrect __init__.py files in your agents directory.
Solution: Ensure all directories have __init__.py files with proper imports:
# agents/__init__.py
from .my_agent import MyCustomAgent
__all__ = ["MyCustomAgent"]
"Dataset 'xxx' not found in uploaded zip"
Cause: The dataset folder name doesn't match the dataset parameter.
Solution: Verify the folder structure:
datasets/
└── my_dataset/ # Must match "dataset": "my_dataset"
└── my_task/ # Must match "task_name": "my_task"
"'MyAgent' object has no attribute 'kwargs'"
Cause: Agent class doesn't store kwargs from __init__.
Solution: Add self.kwargs = kwargs in your agent's __init__:
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.kwargs = kwargs # Add this line
"Invalid agent import path"
Cause: PYTHONPATH issue or incorrect import path in agents.json.
Solution:
- Verify the import path matches your file structure
- Ensure all
__init__.pyfiles exist - Check for circular imports
Debugging Tips
- Enable Debug Mode: Set
"debug": truein init_params - Check Artifacts: Review saved log files in the response
- Test Components Separately: Test your agent locally first
- Verify Docker Environment: Ensure your Dockerfile builds correctly
Example: Complete Payload
Here's a complete example payload structure:
example-payload/
├── agents/
│ ├── __init__.py
│ ├── agents.json
│ └── simple_agent/
│ ├── __init__.py
│ ├── agent.py
│ └── install.sh.j2
└── datasets/
├── __init__.py
└── simple/
└── hello-world/
├── task.toml
├── instruction.md
├── environment/
│ ├── Dockerfile
│ └── setup.sh
├── tests/
│ ├── test.sh
│ └── test_outputs.py
└── solution/
└── solve.sh
agents.json:
{
"simple": {
"import_path": "agents.simple_agent.agent:SimpleAgent",
"kwargs": {}
}
}
Run command:
curl -X POST "https://api.jetty.io/api/v1/run-sync/myorg/benchmark" \
-H "Authorization: Bearer $JETTY_API_TOKEN" \
-F 'init_params={"agent": "simple", "dataset": "simple", "task_name": "hello-world"}' \
-F "files=@example-payload.zip"
Related Documentation
- Step Library - Understanding Jetty workflows