Writing Runbooks

A runbook is a structured markdown document that tells a coding agent (Claude Code, Cursor, Codex, Gemini CLI) how to accomplish a complex, multi-step task end-to-end — with built-in evaluation loops, iteration, and quality gates.

If you haven't run an agent on Jetty yet, start with the Agentic Workflows Quickstart first.

When You Need a Runbook

A simple system prompt works fine for straightforward tasks. A runbook becomes necessary when:

The first attempt is rarely sufficient — the agent needs to evaluate and iterate
There's a quality bar beyond "did it run?" — pass/fail criteria or a scoring rubric
The task produces multiple artifacts that must be consistent with each other
The task involves external API calls that fail in domain-specific ways
You want to encode domain expertise as tips, common fixes, and evaluation criteria

Where Runbooks Fit

	Skill	Workflow	Runbook
Format	Markdown (`SKILL.md`)	JSON (step configs)	Markdown (`RUNBOOK.md`)
Executed by	Coding agent	Jetty engine	Coding agent, calling workflows/APIs
Complexity	Single tool or short procedure	Fixed DAG of step templates	Multi-phase process with judgment
Iteration	None — one-shot	None — runs to completion	Built-in: evaluate → refine → re-evaluate
Output	Varies	Trajectory (structured)	Defined file manifest + summary

A skill says "here's how to call the Jetty API." A workflow says "run this pipeline." A runbook says "accomplish this outcome — and here's how to know when you're done."

A runbook often calls workflows or raw APIs as sub-steps within its larger process.

Anatomy of a Runbook

Every well-formed runbook follows a canonical structure. Here's each section and why it matters.

Frontmatter

---
version: "1.0.0"
evaluation: programmatic
agent: claude-code                    # claude-code | codex | gemini-cli
model: claude-sonnet-4-6             # Model for the agent runtime
snapshot: python312-uv                # python312-uv | prism-playwright | custom image URL
secrets:
  # EXAMPLE_API_KEY:
  #   env: EXAMPLE_API_KEY
  #   description: "API key for ..."
  #   required: true
---

Two required fields:

version — Semantic version. Bump when evaluation criteria, steps, or the output manifest change.
evaluation — Either programmatic (objective pass/fail) or rubric (scored against criteria). See Evaluation Patterns below.

Three recommended fields:

agent — The agent runtime: claude-code, codex, or gemini-cli. If omitted, Jetty infers from the model name.
model — The LLM model powering the agent. Examples: claude-sonnet-4-6, gpt-5.4, gemini-3.1-pro-preview.
snapshot — The sandbox environment: python312-uv (default, lightweight) or prism-playwright (includes Playwright + Chromium for browser tasks). You can also specify a custom container image URL.

Optional field:

secrets — Declares API keys and credentials the runbook needs. See Secrets below.

Objective

## Objective

Pull failed NL-to-SQL queries from Langfuse, replay them against the
translation API, execute the resulting SQL on Snowflake, evaluate
correctness, and produce a regression report.

2-5 sentences answering: what am I doing, what am I producing, and for whom? This is the agent's north star.

Output Manifest

## REQUIRED OUTPUT FILES (MANDATORY)

**You MUST write all of the following files to `{{results_dir}}`.
The task is NOT complete until every file exists and is non-empty. No exceptions.**

| File | Description |
|------|-------------|
| `{{results_dir}}/results.csv` | Per-query evaluation results |
| `{{results_dir}}/summary.md` | Executive summary |
| `{{results_dir}}/validation_report.json` | Structured validation results |

The aggressive tone ("MANDATORY", "No exceptions") is intentional — agents tend to stop early when they hit errors. This section is the forcing function that prevents partial completion.

Every runbook must include validation_report.json as its machine-readable results file. This is the standardized filename — don't use scores.json, results.json, or other variants.

Parameters

## Parameters

| Parameter | Template Variable | Default | Description |
|-----------|------------------|---------|-------------|
| Results directory | `{{results_dir}}` | `/app/results` (Jetty) / `./results` (local) | Output directory |
| Sample size | `{{sample_size}}` | `10` | Number of queries to evaluate |
| Tenant filter | `{{tenant_filter}}` | _(none)_ | Optional tenant to scope queries |

Template variables ({{param}}) are injected at runtime. Every parameter should have a sensible default so the runbook can run with minimal configuration.

Convention: {{results_dir}} defaults to /app/results on Jetty and ./results locally.

Dependencies

## Dependencies

| Dependency | Type | Required | Description |
|------------|------|----------|-------------|
| `my-org/nl-to-sql` | Jetty workflow | Yes | Translates natural language to SQL |
| `LANGFUSE_SECRET_KEY` | Credential | Yes | Auth for Langfuse API |
| `pandas` | Python package | Yes | Data analysis |

Declares everything the runbook needs beyond the base agent environment: workflows, APIs, credentials, and packages. This makes the runbook portable — a new user can scan dependencies to understand what they need before running.

Steps

A runbook contains these step types in order:

1. Environment Setup

## Step 1: Environment Setup

Install dependencies, create directories, verify inputs exist.

```bash
pip install pandas mlcroissant
mkdir -p {{results_dir}}

Idempotent — running it twice should not break anything.

#### 2. Processing Steps (variable count)

```markdown
## Step 2: Fetch Failed Queries

Call the Langfuse API to retrieve queries with error traces...

### API Call

```bash
curl -s "https://cloud.langfuse.com/api/public/traces" \
  -u "{{langfuse_public_key}}:{{langfuse_secret_key}}" \
  -H "Content-Type: application/json"

Expected Response

{ "data": [...], "meta": { "totalItems": 42 } }

Each processing step should include: **what** to do, **how** (concrete API calls or code), **expected output**, **error handling**, and **what to record** for downstream steps.

#### 3. Evaluation Step

This is the heart of the iteration loop. See [Evaluation Patterns](#evaluation-patterns) for the two approaches.

#### 4. Iteration Step

```markdown
## Step 5: Iterate on Errors (max 3 rounds)

If any outputs received FAIL or PARTIAL status:
1. Read the specific error message
2. Apply the targeted fix from the Common Fixes table
3. Re-run the failed item through Step 3
4. Re-evaluate with Step 4 criteria
5. Repeat up to 3 times total

After 3 rounds, keep the best result and flag remaining failures.

### Common Fixes

| Issue | Fix |
|-------|-----|
| SQL syntax error on date functions | Use Snowflake's `DATE_TRUNC`, not Spark's |
| Empty result set | Check table name capitalization |

Always bounded (typically 3 rounds max). The Common Fixes table encodes domain expertise that accelerates convergence — without it, the agent may thrash or give up.

5. Report + Validation Report

Every runbook produces both a human-readable summary.md and a machine-readable validation_report.json. The runbook provides templates for both so the agent doesn't have to guess at structure.

The validation_report.json always includes:

{
  "version": "1.0.0",
  "run_date": "2026-03-26T...",
  "parameters": { },
  "stages": [
    { "name": "...", "passed": true, "message": "..." }
  ],
  "results": { },
  "overall_passed": true
}

6. Final Checklist

## Step 8: Final Checklist (MANDATORY — do not skip)

### Verification Script

```bash
echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="{{results_dir}}"
for f in "$RESULTS_DIR/results.csv" "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do
  if [ ! -s "$f" ]; then
    echo "FAIL: $f is missing or empty"
  else
    echo "PASS: $f ($(wc -c < "$f") bytes)"
  fi
done

Checklist

results.csv exists and has data rows
summary.md exists and follows the template
validation_report.json exists with stages, results, and overall_passed
Verification script printed PASS for all files

If ANY item fails, go back and fix it. Do NOT finish until all items pass.

This is the runbook's exit gate. The imperative language overrides the agent's natural tendency to wrap up.

### Tips

```markdown
## Tips

- Langfuse auth uses HTTP Basic (username:password), not Bearer tokens
- Snowflake function names differ from Spark — check `DATE_TRUNC` vs `TRUNC`
- The Jetty workflow returns results at `.outputs.results[0]`, not `.outputs.result`

Hard-won operational knowledge from watching agents run (and fail). The agent should read these before starting.

Evaluation Patterns

Declare the pattern in frontmatter with evaluation: programmatic or evaluation: rubric.

Programmatic Validation

Best for structured output — JSON, CSV, SQL, code, schemas.

## Step 4: Evaluate Outputs

| Status | Criteria |
|--------|----------|
| `PASS` | SQL executes successfully and returns ≥1 row matching expected output |
| `PARTIAL` | SQL executes but results don't match expected output |
| `FAIL` | SQL has syntax errors or doesn't execute |

Characteristics:

Pass/fail is objective (schema validates, SQL executes, tests pass)
Error messages are specific and actionable
Iteration converges quickly (usually 1-2 rounds)

Rubric-Based Judgment

Best for creative or complex output — reports, images, generated content.

## Step 4: Evaluate Outputs

### Rubric

| # | Criterion | 5 (Excellent) | 3 (Acceptable) | 1 (Poor) |
|---|-----------|---------------|-----------------|----------|
| 1 | Accuracy | All facts verified | Minor inaccuracies | Major errors |
| 2 | Completeness | All sections filled | Key sections present | Missing sections |
| 3 | Clarity | Clear, well-structured | Readable | Confusing |

**Pass threshold: >= 4.0 overall, no individual criterion below 3.**

Characteristics:

Quality is subjective, assessed via rubric (1-5 scale)
The agent is both producer and judge (self-evaluation)
Iteration targets the weakest criteria, guided by the Common Fixes table
Can also delegate judgment to a Jetty workflow with judge steps

Don't mix patterns. Use programmatic for structured output, rubric for creative output. Don't rubric-score a JSON file or schema-validate a social graphic.

Creating a Runbook with the Agent Skill

The agent-skill package includes a guided runbook creation wizard. If you have the Jetty Claude Code plugin or MCP server installed, run:

/create-runbook

The wizard walks you through:

Choosing an evaluation pattern — programmatic or rubric
Defining your objective — what the task does end-to-end
Setting up the output manifest — which files the runbook must produce
Declaring parameters and dependencies — what varies between runs and what's needed
Designing processing steps — the substantive work
Writing evaluation criteria — pass/fail statuses or a scoring rubric
Adding common fixes and tips — domain knowledge for the agent

It scaffolds a complete RUNBOOK.md from one of two starter templates (programmatic or rubric), then validates the result against the canonical structure.

The templates live in the agent-skill repo at:

agent-skill/skills/create-runbook/templates/programmatic.md
agent-skill/skills/create-runbook/templates/rubric.md

You can also create runbooks manually using these templates as a starting point.

Running a Runbook

Locally

Open the runbook in a new agent conversation and tell the agent to follow it:

Follow the runbook in ./RUNBOOK.md.
Use these parameters: results_dir=./results, sample_size=10

On Jetty

Send the runbook as the system message with the jetty block:

{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "system",
      "content": "<contents of your RUNBOOK.md>"
    },
    {
      "role": "user",
      "content": "Execute the runbook"
    }
  ],
  "stream": true,
  "jetty": {
    "runbook": true,
    "collection": "my-org",
    "task": "nl-to-sql-regression",
    "agent": "claude-code",
    "file_paths": ["uploads/test-queries.csv"]
  }
}

The agent sandbox receives the runbook as its instruction set. Everything written to /app/results/ is persisted to cloud storage and available via the trajectory.

See Chat Completions Reference for the full API spec.

Validating a Runbook

Before running, validate your runbook's structure. The agent-skill package includes a validation script that checks:

Check	Severity
Frontmatter with `version` and `evaluation`	Error
`evaluation` is `programmatic` or `rubric`	Error
`## Objective` section present	Error
`## REQUIRED OUTPUT FILES` section present	Error
`validation_report.json` in manifest	Error
`summary.md` in manifest	Warning
All `{{template_vars}}` declared in Parameters	Error
At least one evaluation step	Error
Iteration step with max rounds	Error
`## Final Checklist` with verification script	Error
`## Dependencies` section	Warning
`## Tips` section	Warning

The /create-runbook wizard runs this automatically at the end. You can also run it manually — the full validation script is in the agent-skill/skills/create-runbook/SKILL.md.

Evolving Runbooks Over Time

Runbooks improve through use:

Tips accumulate as agents encounter new failure modes
Common Fixes tables grow as patterns emerge
Rubrics get refined as the quality bar becomes clearer
Parameters get added as new use cases arise
Evaluation criteria tighten as the system matures

Bump the version in frontmatter when you make structural changes that affect output or evaluation. This lets you track which version produced a given trajectory.

Authoring Checklist

Do:

Be specific about API calls — include full curl examples with expected request/response shapes
Show the expected output structure — JSON templates, CSV columns, markdown skeletons
Encode domain knowledge in Tips — save the agent significant debugging time
Make evaluation criteria concrete — "score >= 4.0, no criterion below 3" not "good quality"
Bound iteration — always specify a max round count
Use imperative language in the output manifest and final checklist
List all dependencies — workflows, APIs, credentials, packages

Don't:

Over-specify intermediate steps — the agent should have room to adapt
Skip the verification script — it's the only reliable way to ensure all outputs exist
Assume the agent remembers earlier steps — re-state key context when needed
Mix evaluation patterns — programmatic for structured output, rubric for creative output

Next Steps

Agentic Workflows — Sandbox execution lifecycle and API reference
Quickstart: Agentic Workflows — Run your first agent in 5 minutes
CI Integration — Trigger runbooks from GitHub Actions
Agent Skill repo — Templates, MCP server, and the /create-runbook wizard
Runbook PRD — Full specification and design decisions

When You Need a Runbook​

Where Runbooks Fit​

Anatomy of a Runbook​

Frontmatter​

Objective​

Output Manifest​

Parameters​

Dependencies​

Steps​

1. Environment Setup​

Expected Response​

5. Report + Validation Report​

6. Final Checklist​

Checklist​

Evaluation Patterns​

Programmatic Validation​

Rubric-Based Judgment​

Creating a Runbook with the Agent Skill​

Running a Runbook​

Locally​

On Jetty​

Validating a Runbook​

Evolving Runbooks Over Time​

Authoring Checklist​

Next Steps​