Skip to main content

Run Batch Evaluations in 10 Minutes

Process hundreds of items in parallel using fan-out/fan-in patterns.

What You'll Build

Dataset → Fan-out (parallel processing) → Fan-in (collect results) → Report

The Challenge

You have a list of prompts. You want to:

  1. Generate images for each prompt
  2. Evaluate each image with LLM-as-Judge
  3. Collect all results into a report

The Workflow

Parent Workflow: evaluate-dataset

{
"init_params": {
"prompts": [
{"id": "1", "text": "A cat sitting on a couch"},
{"id": "2", "text": "A dog playing in a park"},
{"id": "3", "text": "A bird flying over mountains"}
]
},
"step_configs": {
"evaluate_all": {
"activity": "list_emit_await",
"items_path": "init_params.prompts",
"task_reference": {
"task_name": "single-evaluation"
},
"data_mapping": {
"prompt_id": "{{ $item.id }}",
"prompt_text": "{{ $item.text }}"
},
"execution_config": {
"max_parallel": 5
}
},
"collect_results": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "evaluate_all.outputs.trajectory_references",
"extract_keys": {
"prompt_id": "init_params.prompt_id",
"prompt_text": "init_params.prompt_text",
"image": "generate.outputs.images[0].path",
"score": "judge.outputs.average_score"
}
}
},
"steps": ["evaluate_all", "collect_results"]
}

Child Workflow: single-evaluation

Create this as a separate task in your collection:

{
"init_params": {
"prompt_id": "",
"prompt_text": ""
},
"step_configs": {
"generate": {
"model": "black-forest-labs/flux-schnell",
"activity": "replicate_text2image",
"prompt_path": "init_params.prompt_text"
},
"judge": {
"model": "gpt-4o",
"activity": "simple_judge",
"items_path": "generate.outputs.images[0].path",
"judge_type": "scale",
"instruction": "Does this image accurately depict the prompt?",
"scale_range": [1, 5],
"model_provider": "openai"
}
},
"steps": ["generate", "judge"]
}

Key Concepts

ConceptWhat It Does
list_emit_awaitFan-out: run child workflow for each item in parallel
task_referenceSpecifies which child workflow to run
data_mappingHow to pass data from parent to child
extract_from_trajectoriesFan-in: collect results from all children
max_parallelLimit concurrent executions (avoid rate limits)

How Data Flows

Parent Workflow

├─→ Child 1: {prompt_id: "1", prompt_text: "A cat..."} → generate → judge
├─→ Child 2: {prompt_id: "2", prompt_text: "A dog..."} → generate → judge
└─→ Child 3: {prompt_id: "3", prompt_text: "A bird..."} → generate → judge

└── collect_results: Gather all scores and images

The Output

{
"collect_results": {
"outputs": {
"extracted_data": [
{"prompt_id": "1", "prompt_text": "A cat sitting on a couch", "score": 4.0, "image": "..."},
{"prompt_id": "2", "prompt_text": "A dog playing in a park", "score": 5.0, "image": "..."},
{"prompt_id": "3", "prompt_text": "A bird flying over mountains", "score": 3.0, "image": "..."}
]
}
}
}

Add a Summary Step

Generate a report from the collected data:

{
"init_params": {
"prompts": [...]
},
"step_configs": {
"evaluate_all": { ... },
"collect_results": { ... },
"summarize": {
"model": "openai/gpt-4o",
"activity": "litellm_chat",
"user_prompt": "Analyze these evaluation results and provide a summary report:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: average score, highest/lowest performing prompts, and recommendations."
}
},
"steps": ["evaluate_all", "collect_results", "summarize"]
}

Scale to Thousands

Control parallelism

{
"execution_config": {
"max_parallel": 10,
"timeout_seconds": 3600
}
}
max_parallelUse Case
5Avoid API rate limits
10-20Standard batch processing
50+High-throughput (check your limits)

Handle failures gracefully

Failed child workflows are tracked. You can:

  • Review failures in the trajectory
  • Re-run only failed items
  • Set retry policies in the child workflow

Load Data from CSV

Instead of hardcoding prompts, load from a file:

{
"init_params": {},
"step_configs": {
"load_data": {
"activity": "read_text_file",
"text_path": "init_params.file_paths[0]"
},
"parse_csv": {
"activity": "litellm_chat",
"model": "openai/gpt-4o-mini",
"user_prompt": "Parse this CSV and return as JSON array: {{ load_data.outputs.text }}"
},
"evaluate_all": {
"activity": "list_emit_await",
"items_path": "parse_csv.outputs.content",
...
}
},
"steps": ["load_data", "parse_csv", "evaluate_all", "collect_results"]
}

Batch LLM Evaluation

Evaluate text responses across a dataset:

{
"init_params": {
"test_cases": [
{"prompt": "Explain gravity", "expected": "gravitational force"},
{"prompt": "What is photosynthesis?", "expected": "plants convert sunlight"},
{"prompt": "Define democracy", "expected": "government by the people"}
],
"model": "openai/gpt-4o-mini"
},
"step_configs": {
"run_all": {
"activity": "list_emit_await",
"items_path": "init_params.test_cases",
"task_reference": {
"task_name": "single-qa-eval"
},
"data_mapping": {
"prompt": "{{ $item.prompt }}",
"expected": "{{ $item.expected }}",
"model": "{{ init_params.model }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all.outputs.trajectory_references",
"extract_keys": {
"prompt": "init_params.prompt",
"response": "generate.outputs.content",
"score": "evaluate.outputs.average_score",
"passed": "evaluate.outputs.rating"
}
}
},
"steps": ["run_all", "collect"]
}

Best Practices

  1. Start small: Test with 3-5 items before scaling to hundreds
  2. Set max_parallel: Respect API rate limits
  3. Use timeouts: Prevent runaway workflows
  4. Check child trajectories: Debug failures individually
  5. Extract specific fields: Only collect data you need

Next Steps