Skip to main content

Workflow Orchestration

Build AI data pipelines with built-in evaluation.

The Idea

Single-agent runs are powerful, but production workloads often need multiple steps: extract data, process it with an LLM, evaluate quality, branch on results. Jetty's workflow engine composes these into DAGs with 47+ step types, path expressions to wire data between steps, and durable execution that survives crashes.

Multi-Step Workflow DAG

Step Types

CategoryStep TypesPurpose
AI Modelslitellm_chat, image generation, vision, embeddingsLLM calls with any provider
Evaluationsimple_judge, structured evals, scoringLLM-as-judge, criteria-based assessment
Control FlowBranching, loops, iteration over collectionsConditional logic and repetition
Data ProcessingText templates, tool execution, file transformsTransform and reshape data
Agent Executionrunbook_agentFull sandboxed agent runs (see Agentic Workflows)

Path Expressions

Steps wire together through path expressions. One step's output becomes the next step's input:

step_a.outputs.files[0].path  →  becomes input for step_b
extract.outputs.content → text extracted by the LLM
evaluate.outputs.results[0] → first evaluation result

The workflow engine handles sequencing, error recovery, retry logic, and artifact management automatically.

Eval-Driven Quality Gates

The real power of orchestration is closing the loop between execution and evaluation.

The Pattern

  1. Runbook Agent — Executes the task: generate code, analyze data, process documents. Outputs files + structured data.
  2. LLM-as-Judge — Evaluates output against your criteria. Returns a score, explanation, pass/fail, and criteria breakdown.
  3. Quality Gate — Branches on pass/fail. Pass = proceed. Fail = block + explain. Post status to PR.

Example Use Cases

Use CaseAgent DoesJudge ChecksGate Action
AI Code ReviewDiffs PR for bugs, security, styleScores quality 1-5Block merge if score < 3
Document ValidationExtracts info from uploadsChecks accuracy, completenessFail if critical fields missing
Test GenerationGenerates tests for new code, runs in sandboxEvaluates correctness, coverageGate on pass rate

Building Evaluation Datasets

Every workflow run produces a trajectory with full execution history. Over time, you accumulate a labeled dataset of quality evaluations:

  • Compare runs across models
  • Track quality trends
  • Replay failures
  • Train on outcomes (which recommendations were accepted vs. rejected)

CI for AI

Workflows integrate directly into your CI pipeline via GitHub Actions.

Two Endpoint Modes

ModeEndpointBehavior
Sync/run-github-action/{collection}/{task}Blocks until complete. Best for fast checks within CI timeout.
Async/run-github-action-async/{collection}/{task}Returns immediately with poll URL. Webhook callback when done. For long-running tasks.

See CI Integration Guide for a step-by-step setup walkthrough.

Next Steps