Writing Runbooks
A runbook is a structured markdown document that tells a coding agent (Claude Code, Cursor, Codex, Gemini CLI) how to accomplish a complex, multi-step task end-to-end — with built-in evaluation loops, iteration, and quality gates.
If you haven't run an agent on Jetty yet, start with the Agentic Workflows Quickstart first.
When You Need a Runbook
A simple system prompt works fine for straightforward tasks. A runbook becomes necessary when:
- The first attempt is rarely sufficient — the agent needs to evaluate and iterate
- There's a quality bar beyond "did it run?" — pass/fail criteria or a scoring rubric
- The task produces multiple artifacts that must be consistent with each other
- The task involves external API calls that fail in domain-specific ways
- You want to encode domain expertise as tips, common fixes, and evaluation criteria
Where Runbooks Fit
| Skill | Workflow | Runbook | |
|---|---|---|---|
| Format | Markdown (SKILL.md) | JSON (step configs) | Markdown (RUNBOOK.md) |
| Executed by | Coding agent | Jetty engine | Coding agent, calling workflows/APIs |
| Complexity | Single tool or short procedure | Fixed DAG of step templates | Multi-phase process with judgment |
| Iteration | None — one-shot | None — runs to completion | Built-in: evaluate → refine → re-evaluate |
| Output | Varies | Trajectory (structured) | Defined file manifest + summary |
A skill says "here's how to call the Jetty API." A workflow says "run this pipeline." A runbook says "accomplish this outcome — and here's how to know when you're done."
A runbook often calls workflows or raw APIs as sub-steps within its larger process.
Anatomy of a Runbook
Every well-formed runbook follows a canonical structure. Here's each section and why it matters.
Frontmatter
---
version: "1.0.0"
evaluation: programmatic
agent: claude-code # claude-code | codex | gemini-cli
model: claude-sonnet-4-6 # Model for the agent runtime
snapshot: python312-uv # python312-uv | prism-playwright | custom image URL
secrets:
# EXAMPLE_API_KEY:
# env: EXAMPLE_API_KEY
# description: "API key for ..."
# required: true
---
Two required fields:
version— Semantic version. Bump when evaluation criteria, steps, or the output manifest change.evaluation— Eitherprogrammatic(objective pass/fail) orrubric(scored against criteria). See Evaluation Patterns below.
Three recommended fields:
agent— The agent runtime:claude-code,codex, orgemini-cli. If omitted, Jetty infers from the model name.model— The LLM model powering the agent. Examples:claude-sonnet-4-6,gpt-5.4,gemini-3.1-pro-preview.snapshot— The sandbox environment:python312-uv(default, lightweight) orprism-playwright(includes Playwright + Chromium for browser tasks). You can also specify a custom container image URL.
Optional field:
secrets— Declares API keys and credentials the runbook needs. See Secrets below.
Objective
## Objective
Pull failed NL-to-SQL queries from Langfuse, replay them against the
translation API, execute the resulting SQL on Snowflake, evaluate
correctness, and produce a regression report.
2-5 sentences answering: what am I doing, what am I producing, and for whom? This is the agent's north star.
Output Manifest
## REQUIRED OUTPUT FILES (MANDATORY)
**You MUST write all of the following files to `{{results_dir}}`.
The task is NOT complete until every file exists and is non-empty. No exceptions.**
| File | Description |
|------|-------------|
| `{{results_dir}}/results.csv` | Per-query evaluation results |
| `{{results_dir}}/summary.md` | Executive summary |
| `{{results_dir}}/validation_report.json` | Structured validation results |
The aggressive tone ("MANDATORY", "No exceptions") is intentional — agents tend to stop early when they hit errors. This section is the forcing function that prevents partial completion.
Every runbook must include validation_report.json as its machine-readable results file. This is the standardized filename — don't use scores.json, results.json, or other variants.
Parameters
## Parameters
| Parameter | Template Variable | Default | Description |
|-----------|------------------|---------|-------------|
| Results directory | `{{results_dir}}` | `/app/results` (Jetty) / `./results` (local) | Output directory |
| Sample size | `{{sample_size}}` | `10` | Number of queries to evaluate |
| Tenant filter | `{{tenant_filter}}` | _(none)_ | Optional tenant to scope queries |
Template variables ({{param}}) are injected at runtime. Every parameter should have a sensible default so the runbook can run with minimal configuration.
Convention: {{results_dir}} defaults to /app/results on Jetty and ./results locally.
Dependencies
## Dependencies
| Dependency | Type | Required | Description |
|------------|------|----------|-------------|
| `my-org/nl-to-sql` | Jetty workflow | Yes | Translates natural language to SQL |
| `LANGFUSE_SECRET_KEY` | Credential | Yes | Auth for Langfuse API |
| `pandas` | Python package | Yes | Data analysis |
Declares everything the runbook needs beyond the base agent environment: workflows, APIs, credentials, and packages. This makes the runbook portable — a new user can scan dependencies to understand what they need before running.
Steps
A runbook contains these step types in order:
1. Environment Setup
## Step 1: Environment Setup
Install dependencies, create directories, verify inputs exist.
```bash
pip install pandas mlcroissant
mkdir -p {{results_dir}}
Idempotent — running it twice should not break anything.
#### 2. Processing Steps (variable count)
```markdown
## Step 2: Fetch Failed Queries
Call the Langfuse API to retrieve queries with error traces...
### API Call
```bash
curl -s "https://cloud.langfuse.com/api/public/traces" \
-u "{{langfuse_public_key}}:{{langfuse_secret_key}}" \
-H "Content-Type: application/json"
Expected Response
{ "data": [...], "meta": { "totalItems": 42 } }
Each processing step should include: **what** to do, **how** (concrete API calls or code), **expected output**, **error handling**, and **what to record** for downstream steps.
#### 3. Evaluation Step
This is the heart of the iteration loop. See [Evaluation Patterns](#evaluation-patterns) for the two approaches.
#### 4. Iteration Step
```markdown
## Step 5: Iterate on Errors (max 3 rounds)
If any outputs received FAIL or PARTIAL status:
1. Read the specific error message
2. Apply the targeted fix from the Common Fixes table
3. Re-run the failed item through Step 3
4. Re-evaluate with Step 4 criteria
5. Repeat up to 3 times total
After 3 rounds, keep the best result and flag remaining failures.
### Common Fixes
| Issue | Fix |
|-------|-----|
| SQL syntax error on date functions | Use Snowflake's `DATE_TRUNC`, not Spark's |
| Empty result set | Check table name capitalization |
Always bounded (typically 3 rounds max). The Common Fixes table encodes domain expertise that accelerates convergence — without it, the agent may thrash or give up.
5. Report + Validation Report
Every runbook produces both a human-readable summary.md and a machine-readable validation_report.json. The runbook provides templates for both so the agent doesn't have to guess at structure.
The validation_report.json always includes:
{
"version": "1.0.0",
"run_date": "2026-03-26T...",
"parameters": { },
"stages": [
{ "name": "...", "passed": true, "message": "..." }
],
"results": { },
"overall_passed": true
}
6. Final Checklist
## Step 8: Final Checklist (MANDATORY — do not skip)
### Verification Script
```bash
echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="{{results_dir}}"
for f in "$RESULTS_DIR/results.csv" "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do
if [ ! -s "$f" ]; then
echo "FAIL: $f is missing or empty"
else
echo "PASS: $f ($(wc -c < "$f") bytes)"
fi
done
Checklist
-
results.csvexists and has data rows -
summary.mdexists and follows the template -
validation_report.jsonexists withstages,results, andoverall_passed - Verification script printed PASS for all files
If ANY item fails, go back and fix it. Do NOT finish until all items pass.
This is the runbook's exit gate. The imperative language overrides the agent's natural tendency to wrap up.
### Tips
```markdown
## Tips
- Langfuse auth uses HTTP Basic (username:password), not Bearer tokens
- Snowflake function names differ from Spark — check `DATE_TRUNC` vs `TRUNC`
- The Jetty workflow returns results at `.outputs.results[0]`, not `.outputs.result`
Hard-won operational knowledge from watching agents run (and fail). The agent should read these before starting.
Evaluation Patterns
Declare the pattern in frontmatter with evaluation: programmatic or evaluation: rubric.
Programmatic Validation
Best for structured output — JSON, CSV, SQL, code, schemas.
## Step 4: Evaluate Outputs
| Status | Criteria |
|--------|----------|
| `PASS` | SQL executes successfully and returns ≥1 row matching expected output |
| `PARTIAL` | SQL executes but results don't match expected output |
| `FAIL` | SQL has syntax errors or doesn't execute |
Characteristics:
- Pass/fail is objective (schema validates, SQL executes, tests pass)
- Error messages are specific and actionable
- Iteration converges quickly (usually 1-2 rounds)
Rubric-Based Judgment
Best for creative or complex output — reports, images, generated content.
## Step 4: Evaluate Outputs
### Rubric
| # | Criterion | 5 (Excellent) | 3 (Acceptable) | 1 (Poor) |
|---|-----------|---------------|-----------------|----------|
| 1 | Accuracy | All facts verified | Minor inaccuracies | Major errors |
| 2 | Completeness | All sections filled | Key sections present | Missing sections |
| 3 | Clarity | Clear, well-structured | Readable | Confusing |
**Pass threshold: >= 4.0 overall, no individual criterion below 3.**
Characteristics:
- Quality is subjective, assessed via rubric (1-5 scale)
- The agent is both producer and judge (self-evaluation)
- Iteration targets the weakest criteria, guided by the Common Fixes table
- Can also delegate judgment to a Jetty workflow with judge steps
Don't mix patterns. Use programmatic for structured output, rubric for creative output. Don't rubric-score a JSON file or schema-validate a social graphic.
Creating a Runbook with the Agent Skill
The agent-skill package includes a guided runbook creation wizard. If you have the Jetty Claude Code plugin or MCP server installed, run:
/create-runbook
The wizard walks you through:
- Choosing an evaluation pattern — programmatic or rubric
- Defining your objective — what the task does end-to-end
- Setting up the output manifest — which files the runbook must produce
- Declaring parameters and dependencies — what varies between runs and what's needed
- Designing processing steps — the substantive work
- Writing evaluation criteria — pass/fail statuses or a scoring rubric
- Adding common fixes and tips — domain knowledge for the agent
It scaffolds a complete RUNBOOK.md from one of two starter templates (programmatic or rubric), then validates the result against the canonical structure.
The templates live in the agent-skill repo at:
agent-skill/skills/create-runbook/templates/programmatic.mdagent-skill/skills/create-runbook/templates/rubric.md
You can also create runbooks manually using these templates as a starting point.
Running a Runbook
Locally
Open the runbook in a new agent conversation and tell the agent to follow it:
Follow the runbook in ./RUNBOOK.md.
Use these parameters: results_dir=./results, sample_size=10
On Jetty
Send the runbook as the system message with the jetty block:
{
"model": "claude-sonnet-4-6",
"messages": [
{
"role": "system",
"content": "<contents of your RUNBOOK.md>"
},
{
"role": "user",
"content": "Execute the runbook"
}
],
"stream": true,
"jetty": {
"runbook": true,
"collection": "my-org",
"task": "nl-to-sql-regression",
"agent": "claude-code",
"file_paths": ["uploads/test-queries.csv"]
}
}
The agent sandbox receives the runbook as its instruction set. Everything written to /app/results/ is persisted to cloud storage and available via the trajectory.
See Chat Completions Reference for the full API spec.
Validating a Runbook
Before running, validate your runbook's structure. The agent-skill package includes a validation script that checks:
| Check | Severity |
|---|---|
Frontmatter with version and evaluation | Error |
evaluation is programmatic or rubric | Error |
## Objective section present | Error |
## REQUIRED OUTPUT FILES section present | Error |
validation_report.json in manifest | Error |
summary.md in manifest | Warning |
All {{template_vars}} declared in Parameters | Error |
| At least one evaluation step | Error |
| Iteration step with max rounds | Error |
## Final Checklist with verification script | Error |
## Dependencies section | Warning |
## Tips section | Warning |
The /create-runbook wizard runs this automatically at the end. You can also run it manually — the full validation script is in the agent-skill/skills/create-runbook/SKILL.md.
Evolving Runbooks Over Time
Runbooks improve through use:
- Tips accumulate as agents encounter new failure modes
- Common Fixes tables grow as patterns emerge
- Rubrics get refined as the quality bar becomes clearer
- Parameters get added as new use cases arise
- Evaluation criteria tighten as the system matures
Bump the version in frontmatter when you make structural changes that affect output or evaluation. This lets you track which version produced a given trajectory.
Authoring Checklist
Do:
- Be specific about API calls — include full curl examples with expected request/response shapes
- Show the expected output structure — JSON templates, CSV columns, markdown skeletons
- Encode domain knowledge in Tips — save the agent significant debugging time
- Make evaluation criteria concrete — "score >= 4.0, no criterion below 3" not "good quality"
- Bound iteration — always specify a max round count
- Use imperative language in the output manifest and final checklist
- List all dependencies — workflows, APIs, credentials, packages
Don't:
- Over-specify intermediate steps — the agent should have room to adapt
- Skip the verification script — it's the only reliable way to ensure all outputs exist
- Assume the agent remembers earlier steps — re-state key context when needed
- Mix evaluation patterns — programmatic for structured output, rubric for creative output
Next Steps
- Agentic Workflows — Sandbox execution lifecycle and API reference
- Quickstart: Agentic Workflows — Run your first agent in 5 minutes
- CI Integration — Trigger runbooks from GitHub Actions
- Agent Skill repo — Templates, MCP server, and the
/create-runbookwizard - Runbook PRD — Full specification and design decisions