Evaluate LLM Outputs in 5 Minutes
Use LLM-as-Judge to automatically score and compare AI responses.
What You'll Build
Prompt → Generate Response → Judge Quality → Score + Explanation
The Workflow
{
"init_params": {
"prompt": "Write a haiku about artificial intelligence",
"model": "openai/gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt",
"temperature": 0.8
},
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate this haiku on creativity, adherence to 5-7-5 syllable structure, and thematic relevance to AI.",
"item_path": "generate.outputs.content"
}
},
"steps": ["generate", "evaluate"]
}
Try It
- Copy the workflow above
- Run it in Jetty
- Change the
promptto generate different content - Modify the
instructionto evaluate different criteria
What You'll Learn
1. simple_judge - The evaluation engine
{
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Your evaluation criteria here",
"item_path": "generate.outputs.content"
}
2. Judge types
| Type | Use Case | Output |
|---|---|---|
scale | Numeric scoring | rating: "4", average_score: 4.0 |
binary | Yes/no decisions | rating: "yes" or rating: "no" |
categorical | Multiple choice | rating: "category_name" |
3. Chaining with path expressions
generate.outputs.content passes the LLM response to the judge:
generate step → outputs.content → evaluate step
The Output
{
"evaluate": {
"outputs": {
"rating": "4",
"explanation": "Good creativity with the 'silicon dreams' metaphor. Follows 5-7-5 structure correctly. Clear AI theme throughout.",
"average_score": 4.0,
"model": "gpt-4o"
}
}
}
Multi-Criteria Evaluation
Evaluate against multiple criteria in parallel:
{
"init_params": {
"prompt": "Explain machine learning to a 10-year-old"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model": "openai/gpt-4o",
"user_prompt_path": "init_params.prompt"
},
"accuracy": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate the technical accuracy of this explanation.",
"item_path": "generate.outputs.content"
},
"clarity": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how understandable this is for a 10-year-old.",
"item_path": "generate.outputs.content"
},
"engagement": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how engaging and fun this explanation is.",
"item_path": "generate.outputs.content"
}
},
"steps": ["generate", "accuracy", "clarity", "engagement"]
}
Binary Evaluation (Pass/Fail)
Check if content meets specific criteria:
{
"safety_check": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "binary",
"instruction": "Does this content contain any harmful, offensive, or inappropriate material?",
"item_path": "generate.outputs.content"
}
}
Output:
{
"rating": "no",
"explanation": "The content is educational and appropriate for all audiences."
}
Compare Multiple Models
Generate from multiple models, then judge which is best:
{
"init_params": {
"question": "What is the meaning of life?"
},
"step_configs": {
"gpt4": {
"activity": "litellm_chat",
"model": "openai/gpt-4o",
"user_prompt_path": "init_params.question"
},
"claude": {
"activity": "litellm_chat",
"model": "anthropic/claude-sonnet-4-20250514",
"user_prompt_path": "init_params.question"
},
"judge_gpt4": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 10],
"instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
"item_path": "gpt4.outputs.content"
},
"judge_claude": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 10],
"instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
"item_path": "claude.outputs.content"
}
},
"steps": ["gpt4", "claude", "judge_gpt4", "judge_claude"]
}
Custom Rubrics
Create detailed evaluation rubrics:
{
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [0, 100],
"instruction": "Evaluate this code review using the rubric:\n\n**Correctness (0-40):**\n- Identifies actual bugs\n- Doesn't flag false positives\n\n**Helpfulness (0-30):**\n- Provides actionable suggestions\n- Explains the 'why'\n\n**Tone (0-30):**\n- Professional and constructive\n- Not condescending\n\nProvide a total score.",
"item_path": "generate.outputs.content"
}
}
Next Steps
- Image Generation - Evaluate generated images
- Batch Processing - Evaluate across datasets
- Model Comparison - Compare multiple models