Skip to main content

Writing Runbooks

A runbook is a structured markdown document that tells a coding agent (Claude Code, Cursor, Codex, Gemini CLI) how to accomplish a complex, multi-step task end-to-end — with built-in evaluation loops, iteration, and quality gates.

If you haven't run an agent on Jetty yet, start with the Agentic Workflows Quickstart first.

When You Need a Runbook

A simple system prompt works fine for straightforward tasks. A runbook becomes necessary when:

  • The first attempt is rarely sufficient — the agent needs to evaluate and iterate
  • There's a quality bar beyond "did it run?" — pass/fail criteria or a scoring rubric
  • The task produces multiple artifacts that must be consistent with each other
  • The task involves external API calls that fail in domain-specific ways
  • You want to encode domain expertise as tips, common fixes, and evaluation criteria

Where Runbooks Fit

SkillWorkflowRunbook
FormatMarkdown (SKILL.md)JSON (step configs)Markdown (RUNBOOK.md)
Executed byCoding agentJetty engineCoding agent, calling workflows/APIs
ComplexitySingle tool or short procedureFixed DAG of step templatesMulti-phase process with judgment
IterationNone — one-shotNone — runs to completionBuilt-in: evaluate → refine → re-evaluate
OutputVariesTrajectory (structured)Defined file manifest + summary

A skill says "here's how to call the Jetty API." A workflow says "run this pipeline." A runbook says "accomplish this outcome — and here's how to know when you're done."

A runbook often calls workflows or raw APIs as sub-steps within its larger process.

Anatomy of a Runbook

Every well-formed runbook follows a canonical structure. Here's each section and why it matters.

Frontmatter

---
version: "1.0.0"
evaluation: programmatic
agent: claude-code # claude-code | codex | gemini-cli
model: claude-sonnet-4-6 # Model for the agent runtime
snapshot: python312-uv # python312-uv | prism-playwright | custom image URL
secrets:
# EXAMPLE_API_KEY:
# env: EXAMPLE_API_KEY
# description: "API key for ..."
# required: true
---

Two required fields:

  • version — Semantic version. Bump when evaluation criteria, steps, or the output manifest change.
  • evaluation — Either programmatic (objective pass/fail) or rubric (scored against criteria). See Evaluation Patterns below.

Three recommended fields:

  • agent — The agent runtime: claude-code, codex, or gemini-cli. If omitted, Jetty infers from the model name.
  • model — The LLM model powering the agent. Examples: claude-sonnet-4-6, gpt-5.4, gemini-3.1-pro-preview.
  • snapshot — The sandbox environment: python312-uv (default, lightweight) or prism-playwright (includes Playwright + Chromium for browser tasks). You can also specify a custom container image URL.

Optional field:

  • secrets — Declares API keys and credentials the runbook needs. See Secrets below.

Objective

## Objective

Pull failed NL-to-SQL queries from Langfuse, replay them against the
translation API, execute the resulting SQL on Snowflake, evaluate
correctness, and produce a regression report.

2-5 sentences answering: what am I doing, what am I producing, and for whom? This is the agent's north star.

Output Manifest

## REQUIRED OUTPUT FILES (MANDATORY)

**You MUST write all of the following files to `{{results_dir}}`.
The task is NOT complete until every file exists and is non-empty. No exceptions.**

| File | Description |
|------|-------------|
| `{{results_dir}}/results.csv` | Per-query evaluation results |
| `{{results_dir}}/summary.md` | Executive summary |
| `{{results_dir}}/validation_report.json` | Structured validation results |

The aggressive tone ("MANDATORY", "No exceptions") is intentional — agents tend to stop early when they hit errors. This section is the forcing function that prevents partial completion.

Every runbook must include validation_report.json as its machine-readable results file. This is the standardized filename — don't use scores.json, results.json, or other variants.

Parameters

## Parameters

| Parameter | Template Variable | Default | Description |
|-----------|------------------|---------|-------------|
| Results directory | `{{results_dir}}` | `/app/results` (Jetty) / `./results` (local) | Output directory |
| Sample size | `{{sample_size}}` | `10` | Number of queries to evaluate |
| Tenant filter | `{{tenant_filter}}` | _(none)_ | Optional tenant to scope queries |

Template variables ({{param}}) are injected at runtime. Every parameter should have a sensible default so the runbook can run with minimal configuration.

Convention: {{results_dir}} defaults to /app/results on Jetty and ./results locally.

Dependencies

## Dependencies

| Dependency | Type | Required | Description |
|------------|------|----------|-------------|
| `my-org/nl-to-sql` | Jetty workflow | Yes | Translates natural language to SQL |
| `LANGFUSE_SECRET_KEY` | Credential | Yes | Auth for Langfuse API |
| `pandas` | Python package | Yes | Data analysis |

Declares everything the runbook needs beyond the base agent environment: workflows, APIs, credentials, and packages. This makes the runbook portable — a new user can scan dependencies to understand what they need before running.

Steps

A runbook contains these step types in order:

1. Environment Setup

## Step 1: Environment Setup

Install dependencies, create directories, verify inputs exist.

```bash
pip install pandas mlcroissant
mkdir -p {{results_dir}}

Idempotent — running it twice should not break anything.

#### 2. Processing Steps (variable count)

```markdown
## Step 2: Fetch Failed Queries

Call the Langfuse API to retrieve queries with error traces...

### API Call

```bash
curl -s "https://cloud.langfuse.com/api/public/traces" \
-u "{{langfuse_public_key}}:{{langfuse_secret_key}}" \
-H "Content-Type: application/json"

Expected Response

{ "data": [...], "meta": { "totalItems": 42 } }

Each processing step should include: **what** to do, **how** (concrete API calls or code), **expected output**, **error handling**, and **what to record** for downstream steps.

#### 3. Evaluation Step

This is the heart of the iteration loop. See [Evaluation Patterns](#evaluation-patterns) for the two approaches.

#### 4. Iteration Step

```markdown
## Step 5: Iterate on Errors (max 3 rounds)

If any outputs received FAIL or PARTIAL status:
1. Read the specific error message
2. Apply the targeted fix from the Common Fixes table
3. Re-run the failed item through Step 3
4. Re-evaluate with Step 4 criteria
5. Repeat up to 3 times total

After 3 rounds, keep the best result and flag remaining failures.

### Common Fixes

| Issue | Fix |
|-------|-----|
| SQL syntax error on date functions | Use Snowflake's `DATE_TRUNC`, not Spark's |
| Empty result set | Check table name capitalization |

Always bounded (typically 3 rounds max). The Common Fixes table encodes domain expertise that accelerates convergence — without it, the agent may thrash or give up.

5. Report + Validation Report

Every runbook produces both a human-readable summary.md and a machine-readable validation_report.json. The runbook provides templates for both so the agent doesn't have to guess at structure.

The validation_report.json always includes:

{
"version": "1.0.0",
"run_date": "2026-03-26T...",
"parameters": { },
"stages": [
{ "name": "...", "passed": true, "message": "..." }
],
"results": { },
"overall_passed": true
}

6. Final Checklist

## Step 8: Final Checklist (MANDATORY — do not skip)

### Verification Script

```bash
echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="{{results_dir}}"
for f in "$RESULTS_DIR/results.csv" "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do
if [ ! -s "$f" ]; then
echo "FAIL: $f is missing or empty"
else
echo "PASS: $f ($(wc -c < "$f") bytes)"
fi
done

Checklist

  • results.csv exists and has data rows
  • summary.md exists and follows the template
  • validation_report.json exists with stages, results, and overall_passed
  • Verification script printed PASS for all files

If ANY item fails, go back and fix it. Do NOT finish until all items pass.


This is the runbook's exit gate. The imperative language overrides the agent's natural tendency to wrap up.

### Tips

```markdown
## Tips

- Langfuse auth uses HTTP Basic (username:password), not Bearer tokens
- Snowflake function names differ from Spark — check `DATE_TRUNC` vs `TRUNC`
- The Jetty workflow returns results at `.outputs.results[0]`, not `.outputs.result`

Hard-won operational knowledge from watching agents run (and fail). The agent should read these before starting.

Evaluation Patterns

Declare the pattern in frontmatter with evaluation: programmatic or evaluation: rubric.

Programmatic Validation

Best for structured output — JSON, CSV, SQL, code, schemas.

## Step 4: Evaluate Outputs

| Status | Criteria |
|--------|----------|
| `PASS` | SQL executes successfully and returns ≥1 row matching expected output |
| `PARTIAL` | SQL executes but results don't match expected output |
| `FAIL` | SQL has syntax errors or doesn't execute |

Characteristics:

  • Pass/fail is objective (schema validates, SQL executes, tests pass)
  • Error messages are specific and actionable
  • Iteration converges quickly (usually 1-2 rounds)

Rubric-Based Judgment

Best for creative or complex output — reports, images, generated content.

## Step 4: Evaluate Outputs

### Rubric

| # | Criterion | 5 (Excellent) | 3 (Acceptable) | 1 (Poor) |
|---|-----------|---------------|-----------------|----------|
| 1 | Accuracy | All facts verified | Minor inaccuracies | Major errors |
| 2 | Completeness | All sections filled | Key sections present | Missing sections |
| 3 | Clarity | Clear, well-structured | Readable | Confusing |

**Pass threshold: >= 4.0 overall, no individual criterion below 3.**

Characteristics:

  • Quality is subjective, assessed via rubric (1-5 scale)
  • The agent is both producer and judge (self-evaluation)
  • Iteration targets the weakest criteria, guided by the Common Fixes table
  • Can also delegate judgment to a Jetty workflow with judge steps

Don't mix patterns. Use programmatic for structured output, rubric for creative output. Don't rubric-score a JSON file or schema-validate a social graphic.

Creating a Runbook with the Agent Skill

The agent-skill package includes a guided runbook creation wizard. If you have the Jetty Claude Code plugin or MCP server installed, run:

/create-runbook

The wizard walks you through:

  1. Choosing an evaluation pattern — programmatic or rubric
  2. Defining your objective — what the task does end-to-end
  3. Setting up the output manifest — which files the runbook must produce
  4. Declaring parameters and dependencies — what varies between runs and what's needed
  5. Designing processing steps — the substantive work
  6. Writing evaluation criteria — pass/fail statuses or a scoring rubric
  7. Adding common fixes and tips — domain knowledge for the agent

It scaffolds a complete RUNBOOK.md from one of two starter templates (programmatic or rubric), then validates the result against the canonical structure.

The templates live in the agent-skill repo at:

  • agent-skill/skills/create-runbook/templates/programmatic.md
  • agent-skill/skills/create-runbook/templates/rubric.md

You can also create runbooks manually using these templates as a starting point.

Running a Runbook

Locally

Open the runbook in a new agent conversation and tell the agent to follow it:

Follow the runbook in ./RUNBOOK.md.
Use these parameters: results_dir=./results, sample_size=10

On Jetty

Send the runbook as the system message with the jetty block:

{
"model": "claude-sonnet-4-6",
"messages": [
{
"role": "system",
"content": "<contents of your RUNBOOK.md>"
},
{
"role": "user",
"content": "Execute the runbook"
}
],
"stream": true,
"jetty": {
"runbook": true,
"collection": "my-org",
"task": "nl-to-sql-regression",
"agent": "claude-code",
"file_paths": ["uploads/test-queries.csv"]
}
}

The agent sandbox receives the runbook as its instruction set. Everything written to /app/results/ is persisted to cloud storage and available via the trajectory.

See Chat Completions Reference for the full API spec.

Validating a Runbook

Before running, validate your runbook's structure. The agent-skill package includes a validation script that checks:

CheckSeverity
Frontmatter with version and evaluationError
evaluation is programmatic or rubricError
## Objective section presentError
## REQUIRED OUTPUT FILES section presentError
validation_report.json in manifestError
summary.md in manifestWarning
All {{template_vars}} declared in ParametersError
At least one evaluation stepError
Iteration step with max roundsError
## Final Checklist with verification scriptError
## Dependencies sectionWarning
## Tips sectionWarning

The /create-runbook wizard runs this automatically at the end. You can also run it manually — the full validation script is in the agent-skill/skills/create-runbook/SKILL.md.

Evolving Runbooks Over Time

Runbooks improve through use:

  • Tips accumulate as agents encounter new failure modes
  • Common Fixes tables grow as patterns emerge
  • Rubrics get refined as the quality bar becomes clearer
  • Parameters get added as new use cases arise
  • Evaluation criteria tighten as the system matures

Bump the version in frontmatter when you make structural changes that affect output or evaluation. This lets you track which version produced a given trajectory.

Authoring Checklist

Do:

  • Be specific about API calls — include full curl examples with expected request/response shapes
  • Show the expected output structure — JSON templates, CSV columns, markdown skeletons
  • Encode domain knowledge in Tips — save the agent significant debugging time
  • Make evaluation criteria concrete — "score >= 4.0, no criterion below 3" not "good quality"
  • Bound iteration — always specify a max round count
  • Use imperative language in the output manifest and final checklist
  • List all dependencies — workflows, APIs, credentials, packages

Don't:

  • Over-specify intermediate steps — the agent should have room to adapt
  • Skip the verification script — it's the only reliable way to ensure all outputs exist
  • Assume the agent remembers earlier steps — re-state key context when needed
  • Mix evaluation patterns — programmatic for structured output, rubric for creative output

Next Steps