Run Batch Evaluations in 10 Minutes
Process hundreds of items in parallel using fan-out/fan-in patterns.
What You'll Build
Dataset → Fan-out (parallel processing) → Fan-in (collect results) → Report
The Challenge
You have a list of prompts. You want to:
- Generate images for each prompt
- Evaluate each image with LLM-as-Judge
- Collect all results into a report
The Workflow
Parent Workflow: evaluate-dataset
{
"init_params": {
"prompts": [
{"id": "1", "text": "A cat sitting on a couch"},
{"id": "2", "text": "A dog playing in a park"},
{"id": "3", "text": "A bird flying over mountains"}
]
},
"step_configs": {
"evaluate_all": {
"activity": "list_emit_await",
"items_path": "init_params.prompts",
"task_reference": {
"task_name": "single-evaluation"
},
"data_mapping": {
"prompt_id": "{{ $item.id }}",
"prompt_text": "{{ $item.text }}"
},
"execution_config": {
"max_parallel": 5
}
},
"collect_results": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "evaluate_all.outputs.trajectory_references",
"extract_keys": {
"prompt_id": "init_params.prompt_id",
"prompt_text": "init_params.prompt_text",
"image": "generate.outputs.images[0].path",
"score": "judge.outputs.average_score"
}
}
},
"steps": ["evaluate_all", "collect_results"]
}
Child Workflow: single-evaluation
Create this as a separate task in your collection:
{
"init_params": {
"prompt_id": "",
"prompt_text": ""
},
"step_configs": {
"generate": {
"model": "black-forest-labs/flux-schnell",
"activity": "replicate_text2image",
"prompt_path": "init_params.prompt_text"
},
"judge": {
"model": "gpt-4o",
"activity": "simple_judge",
"items_path": "generate.outputs.images[0].path",
"judge_type": "scale",
"instruction": "Does this image accurately depict the prompt?",
"scale_range": [1, 5],
"model_provider": "openai"
}
},
"steps": ["generate", "judge"]
}
Key Concepts
| Concept | What It Does |
|---|---|
list_emit_await | Fan-out: run child workflow for each item in parallel |
task_reference | Specifies which child workflow to run |
data_mapping | How to pass data from parent to child |
extract_from_trajectories | Fan-in: collect results from all children |
max_parallel | Limit concurrent executions (avoid rate limits) |
How Data Flows
Parent Workflow
│
├─→ Child 1: {prompt_id: "1", prompt_text: "A cat..."} → generate → judge
├─→ Child 2: {prompt_id: "2", prompt_text: "A dog..."} → generate → judge
└─→ Child 3: {prompt_id: "3", prompt_text: "A bird..."} → generate → judge
│
└── collect_results: Gather all scores and images
The Output
{
"collect_results": {
"outputs": {
"extracted_data": [
{"prompt_id": "1", "prompt_text": "A cat sitting on a couch", "score": 4.0, "image": "..."},
{"prompt_id": "2", "prompt_text": "A dog playing in a park", "score": 5.0, "image": "..."},
{"prompt_id": "3", "prompt_text": "A bird flying over mountains", "score": 3.0, "image": "..."}
]
}
}
}
Add a Summary Step
Generate a report from the collected data:
{
"init_params": {
"prompts": [...]
},
"step_configs": {
"evaluate_all": { ... },
"collect_results": { ... },
"summarize": {
"model": "openai/gpt-4o",
"activity": "litellm_chat",
"user_prompt": "Analyze these evaluation results and provide a summary report:\n\n{{ collect_results.outputs.extracted_data | tojson }}\n\nInclude: average score, highest/lowest performing prompts, and recommendations."
}
},
"steps": ["evaluate_all", "collect_results", "summarize"]
}
Scale to Thousands
Control parallelism
{
"execution_config": {
"max_parallel": 10,
"timeout_seconds": 3600
}
}
max_parallel | Use Case |
|---|---|
| 5 | Avoid API rate limits |
| 10-20 | Standard batch processing |
| 50+ | High-throughput (check your limits) |
Handle failures gracefully
Failed child workflows are tracked. You can:
- Review failures in the trajectory
- Re-run only failed items
- Set retry policies in the child workflow
Load Data from CSV
Instead of hardcoding prompts, load from a file:
{
"init_params": {},
"step_configs": {
"load_data": {
"activity": "read_text_file",
"text_path": "init_params.file_paths[0]"
},
"parse_csv": {
"activity": "litellm_chat",
"model": "openai/gpt-4o-mini",
"user_prompt": "Parse this CSV and return as JSON array: {{ load_data.outputs.text }}"
},
"evaluate_all": {
"activity": "list_emit_await",
"items_path": "parse_csv.outputs.content",
...
}
},
"steps": ["load_data", "parse_csv", "evaluate_all", "collect_results"]
}
Batch LLM Evaluation
Evaluate text responses across a dataset:
{
"init_params": {
"test_cases": [
{"prompt": "Explain gravity", "expected": "gravitational force"},
{"prompt": "What is photosynthesis?", "expected": "plants convert sunlight"},
{"prompt": "Define democracy", "expected": "government by the people"}
],
"model": "openai/gpt-4o-mini"
},
"step_configs": {
"run_all": {
"activity": "list_emit_await",
"items_path": "init_params.test_cases",
"task_reference": {
"task_name": "single-qa-eval"
},
"data_mapping": {
"prompt": "{{ $item.prompt }}",
"expected": "{{ $item.expected }}",
"model": "{{ init_params.model }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all.outputs.trajectory_references",
"extract_keys": {
"prompt": "init_params.prompt",
"response": "generate.outputs.content",
"score": "evaluate.outputs.average_score",
"passed": "evaluate.outputs.rating"
}
}
},
"steps": ["run_all", "collect"]
}
Best Practices
- Start small: Test with 3-5 items before scaling to hundreds
- Set
max_parallel: Respect API rate limits - Use timeouts: Prevent runaway workflows
- Check child trajectories: Debug failures individually
- Extract specific fields: Only collect data you need
Next Steps
- Agent Benchmarking - Run agent evaluations at scale
- LLM Evaluation - Design better evaluation criteria
- Image Generation - Image-specific batch workflows