Run Batch Evaluations in 10 Minutes

Process hundreds of items in parallel using fan-out/fan-in patterns.

What You'll Build

Dataset → Fan-out (parallel processing) → Fan-in (collect results) → Report

The Challenge

You have a list of prompts. You want to:

Generate images for each prompt
Evaluate each image with LLM-as-Judge
Collect all results into a report

The Workflow

Parent Workflow: `evaluate-dataset`

{
  "init_params": {
    "prompts": [
      {"id": "1", "text": "A cat sitting on a couch"},
      {"id": "2", "text": "A dog playing in a park"},
      {"id": "3", "text": "A bird flying over mountains"}
    ]
  },
  "step_configs": {
    "evaluate_all": {
      "activity": "list_emit_await",
      "items_path": "init_params.prompts",
      "task_reference": {
        "task_name": "single-evaluation"
      },
      "data_mapping": {
        "prompt_id": "{{ $item.id }}",
        "prompt_text": "{{ $item.text }}"
      },
      "execution_config": {
        "max_parallel": 5
      }
    },
    "collect_results": {
      "activity": "extract_from_trajectories",
      "trajectory_list_path": "evaluate_all.outputs.trajectory_references",
      "extract_keys": {
        "prompt_id": "init_params.prompt_id",
        "prompt_text": "init_params.prompt_text",
        "image": "generate.outputs.images[0].path",
        "score": "judge.outputs.average_score"
      }
    }
  },
  "steps": ["evaluate_all", "collect_results"]
}

Child Workflow: `single-evaluation`

Create this as a separate task in your collection:

{
  "init_params": {
    "prompt_id": "",
    "prompt_text": ""
  },
  "step_configs": {
    "generate": {
      "model": "black-forest-labs/flux-schnell",
      "activity": "replicate_text2image",
      "prompt_path": "init_params.prompt_text"
    },
    "judge": {
      "model": "gpt-4o",
      "activity": "simple_judge",
      "items_path": "generate.outputs.images[0].path",
      "judge_type": "scale",
      "instruction": "Does this image accurately depict the prompt?",
      "scale_range": [1, 5],
      "model_provider": "openai"
    }
  },
  "steps": ["generate", "judge"]
}

Key Concepts

Concept	What It Does
`list_emit_await`	Fan-out: run child workflow for each item in parallel
`task_reference`	Specifies which child workflow to run
`data_mapping`	How to pass data from parent to child
`extract_from_trajectories`	Fan-in: collect results from all children
`max_parallel`	Limit concurrent executions (avoid rate limits)

How Data Flows

Parent Workflow
    │
    ├─→ Child 1: {prompt_id: "1", prompt_text: "A cat..."} → generate → judge
    ├─→ Child 2: {prompt_id: "2", prompt_text: "A dog..."} → generate → judge
    └─→ Child 3: {prompt_id: "3", prompt_text: "A bird..."} → generate → judge
    │
    └── collect_results: Gather all scores and images

The Output

{
  "collect_results": {
    "outputs": {
      "extracted_data": [
        {"prompt_id": "1", "prompt_text": "A cat sitting on a couch", "score": 4.0, "image": "..."},
        {"prompt_id": "2", "prompt_text": "A dog playing in a park", "score": 5.0, "image": "..."},
        {"prompt_id": "3", "prompt_text": "A bird flying over mountains", "score": 3.0, "image": "..."}
      ]
    }
  }
}

Add a Summary Step

Generate a report from the collected data:

{
  "init_params": {
    "prompts": [...]
  },
  "step_configs": {
    "evaluate_all": { ... },
    "collect_results": { ... },
    "summarize": {
      "model": "openai/gpt-4o",
      "activity": "litellm_chat",
      "user_prompt": "Analyze these evaluation results and provide a summary report:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: average score, highest/lowest performing prompts, and recommendations."
    }
  },
  "steps": ["evaluate_all", "collect_results", "summarize"]
}

Scale to Thousands

Control parallelism

{
  "execution_config": {
    "max_parallel": 10,
    "timeout_seconds": 3600
  }
}

`max_parallel`	Use Case
5	Avoid API rate limits
10-20	Standard batch processing
50+	High-throughput (check your limits)

Handle failures gracefully

Failed child workflows are tracked. You can:

Review failures in the trajectory
Re-run only failed items
Set retry policies in the child workflow

Load Data from CSV

Instead of hardcoding prompts, load from a file:

{
  "init_params": {},
  "step_configs": {
    "load_data": {
      "activity": "read_text_file",
      "text_path": "init_params.file_paths[0]"
    },
    "parse_csv": {
      "activity": "litellm_chat",
      "model": "openai/gpt-4o-mini",
      "user_prompt": "Parse this CSV and return as JSON array: {{ load_data.outputs.text }}"
    },
    "evaluate_all": {
      "activity": "list_emit_await",
      "items_path": "parse_csv.outputs.content",
      ...
    }
  },
  "steps": ["load_data", "parse_csv", "evaluate_all", "collect_results"]
}

Batch LLM Evaluation

Evaluate text responses across a dataset:

{
  "init_params": {
    "test_cases": [
      {"prompt": "Explain gravity", "expected": "gravitational force"},
      {"prompt": "What is photosynthesis?", "expected": "plants convert sunlight"},
      {"prompt": "Define democracy", "expected": "government by the people"}
    ],
    "model": "openai/gpt-4o-mini"
  },
  "step_configs": {
    "run_all": {
      "activity": "list_emit_await",
      "items_path": "init_params.test_cases",
      "task_reference": {
        "task_name": "single-qa-eval"
      },
      "data_mapping": {
        "prompt": "{{ $item.prompt }}",
        "expected": "{{ $item.expected }}",
        "model": "{{ init_params.model }}"
      }
    },
    "collect": {
      "activity": "extract_from_trajectories",
      "trajectory_list_path": "run_all.outputs.trajectory_references",
      "extract_keys": {
        "prompt": "init_params.prompt",
        "response": "generate.outputs.content",
        "score": "evaluate.outputs.average_score",
        "passed": "evaluate.outputs.rating"
      }
    }
  },
  "steps": ["run_all", "collect"]
}

Best Practices

Start small: Test with 3-5 items before scaling to hundreds
Set max_parallel: Respect API rate limits
Use timeouts: Prevent runaway workflows
Check child trajectories: Debug failures individually
Extract specific fields: Only collect data you need

Next Steps

Agent Benchmarking - Run agent evaluations at scale
LLM Evaluation - Design better evaluation criteria
Image Generation - Image-specific batch workflows

What You'll Build​

The Challenge​

The Workflow​

Parent Workflow: evaluate-dataset​

Child Workflow: single-evaluation​

Key Concepts​

How Data Flows​

The Output​

Add a Summary Step​

Scale to Thousands​

Control parallelism​

Handle failures gracefully​

Load Data from CSV​

Batch LLM Evaluation​

Best Practices​

Next Steps​