Simple Judge - LLM-as-a-Judge Evaluation
The Simple Judge step provides a lightweight LLM-as-a-Judge framework for evaluating content using language models. It supports both categorical judgments and numeric scale ratings, with automatic handling of text and image inputs.
Activity Name
simple_judge
Overview
Simple Judge uses LiteLLM under the hood to access 100+ model providers, making it easy to use any LLM as an evaluator. The step handles:
- Text evaluation: Analyze text content against custom criteria
- Image evaluation: Assess images using vision-capable models
- Categorical judgments: Choose from predefined categories (e.g., "yes/no", "excellent/good/poor")
- Scale ratings: Numeric scores within a defined range (e.g., 1-10)
- Batch processing: Evaluate multiple items in a single step
Environment Variables
| Variable | Description |
|---|---|
OPENAI_API_KEY | OpenAI API key (for GPT models) |
ANTHROPIC_API_KEY | Anthropic API key (for Claude models) |
GEMINI_API_KEY | Google API key (for Gemini models) |
LITELLM_API_KEY | LiteLLM proxy API key |
Configuration Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
instruction | string | The evaluation criteria and instructions for the judge |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
judge_type | string | "categorical" | Type of judgment: "categorical" or "scale" |
model | string | "gpt-4o" | LLM model to use for evaluation |
temperature | float | 0.3 | Model temperature (lower = more consistent) |
max_tokens | int | null | Maximum response tokens |
with_explanation | bool | true | Include reasoning in output |
save_raw_response | bool | true | Save raw LLM responses as files |
system_prompt | string | null | Custom system prompt for the model |
Categorical Judgment Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
categories | array | ["yes", "no"] | List of category options (minimum 2) |
Scale Judgment Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
scale_range | array | [1, 5] | Two-element array [min, max] for numeric scale |
Input Parameters
Items to evaluate can be provided in multiple ways:
| Parameter | Type | Description |
|---|---|---|
items | array | Direct list of items to evaluate |
items_path | string | Path expression to extract items from trajectory |
item | any | Single item to evaluate |
item_path | string | Path expression to extract single item |
Input Types
Simple Judge automatically handles different input types:
- Plain text: Evaluated as-is
- Storage paths: Files are read from storage and converted appropriately
- Image files: Converted to data URLs for vision models (PNG, JPG, WebP, GIF)
- Text files: Decoded as UTF-8 (TXT, MD, CSV, JSON, code files)
- Data URLs: Passed directly to vision models
Examples
Categorical Evaluation (Yes/No)
{
"name": "check_compliance",
"activity": "simple_judge",
"config": {
"instruction": "Does this text follow professional communication guidelines? Consider tone, grammar, and appropriateness.",
"judge_type": "categorical",
"categories": ["yes", "no"],
"model": "gpt-4o",
"item_path": "previous_step.outputs.text"
}
}
Multi-Category Evaluation
{
"name": "quality_assessment",
"activity": "simple_judge",
"config": {
"instruction": "Evaluate the quality of this content for publication.",
"judge_type": "categorical",
"categories": ["excellent", "good", "fair", "poor", "reject"],
"model": "claude-3-5-sonnet-20241022",
"items_path": "content_generator.outputs.articles"
}
}
Scale-Based Scoring
{
"name": "score_responses",
"activity": "simple_judge",
"config": {
"instruction": "Rate how helpful and accurate this customer support response is.",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "gpt-4o",
"temperature": 0.2,
"items_path": "support_responses.outputs.replies"
}
}
Image Evaluation
{
"name": "brand_check",
"activity": "simple_judge",
"config": {
"instruction": "Does this image follow our brand guidelines? Check for: correct logo placement, approved color palette, and professional appearance.",
"judge_type": "categorical",
"categories": ["approved", "needs_revision", "rejected"],
"model": "gpt-4o",
"item_path": "image_generator.outputs.images[0].path"
}
}
Batch Evaluation with Explanations
{
"name": "evaluate_submissions",
"activity": "simple_judge",
"config": {
"instruction": "Evaluate this code submission for correctness, efficiency, and code quality.",
"judge_type": "scale",
"scale_range": [0, 100],
"model": "claude-3-5-sonnet-20241022",
"with_explanation": true,
"save_raw_response": true,
"items": [
"def add(a, b): return a + b",
"def add(a, b): return a - b",
"def add(a, b):\n # Add two numbers\n result = a + b\n return result"
]
}
}
Output Structure
Categorical Judgment Output
{
"outputs": {
"results": [
{
"item": "The evaluated content...",
"judgment": "yes",
"score": null,
"explanation": "The content follows all guidelines because...",
"success": true,
"error": null,
"raw_result": "{\"judgment\": \"yes\", \"explanation\": \"...\"}"
}
],
"successful_count": 1,
"failed_count": 0,
"total_count": 1,
"success_rate": 1.0,
"model_used": "gpt-4o",
"judge_type": "categorical",
"category_distribution": {
"yes": 1
},
"raw_response_path": "collection/flow/0001/simple_judge_1.json"
}
}
Scale Judgment Output
{
"outputs": {
"results": [
{
"item": "The evaluated content...",
"judgment": null,
"score": 8.5,
"explanation": "High quality because...",
"success": true,
"error": null,
"raw_result": "{\"score\": 8.5, \"explanation\": \"...\"}"
}
],
"successful_count": 1,
"failed_count": 0,
"total_count": 1,
"success_rate": 1.0,
"model_used": "gpt-4o",
"judge_type": "scale",
"average_score": 8.5,
"min_score": 8.5,
"max_score": 8.5,
"raw_response_path": "collection/flow/0001/simple_judge_1.json"
}
}
Advanced Usage
Chaining with Content Generation
{
"steps": ["generate_content", "evaluate_content"],
"step_configs": {
"generate_content": {
"activity": "litellm_chat",
"model": "gpt-4o",
"prompt": "Write a product description for wireless earbuds"
},
"evaluate_content": {
"activity": "simple_judge",
"instruction": "Rate this product description for marketing effectiveness, clarity, and persuasiveness.",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "claude-3-5-sonnet-20241022",
"item_path": "generate_content.outputs.text"
}
}
}
Multi-Stage Evaluation Pipeline
{
"steps": ["initial_screen", "detailed_review"],
"step_configs": {
"initial_screen": {
"activity": "simple_judge",
"instruction": "Does this content meet minimum quality standards?",
"judge_type": "categorical",
"categories": ["pass", "fail"],
"model": "gpt-3.5-turbo",
"items_path": "init_params.submissions"
},
"detailed_review": {
"activity": "simple_judge",
"instruction": "Provide a detailed quality score considering originality, accuracy, and presentation.",
"judge_type": "scale",
"scale_range": [0, 100],
"model": "gpt-4o",
"items_path": "initial_screen.outputs.results[?success==true].item"
}
}
}
Best Practices
Writing Effective Instructions
- Be specific: Clearly define what criteria the judge should consider
- Provide examples: When possible, describe what "good" and "bad" look like
- Set context: Explain the purpose and use case for the evaluation
- Define edge cases: Specify how ambiguous situations should be handled
Choosing Judge Type
- Categorical: Best for yes/no decisions, classification, or discrete quality levels
- Scale: Best for nuanced scoring, comparison rankings, or continuous metrics
Model Selection
- GPT-4o: Best for complex reasoning and image evaluation
- Claude 3.5 Sonnet: Excellent for nuanced text analysis
- GPT-3.5 Turbo: Fast and cost-effective for simple evaluations
Temperature Settings
- Use low temperature (0.1-0.3) for consistent, reproducible judgments
- Use higher temperature (0.5-0.7) when you want more varied perspectives
Error Handling
The step handles errors gracefully:
- Failed evaluations are marked with
success: falseand include error details - Partial results are returned even if some items fail
- Raw responses are saved for debugging when
save_raw_response: true
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| JSON parse error | Model didn't return valid JSON | Lower temperature, use more capable model |
| Image not found | Invalid storage path | Check path expression syntax |
| Rate limiting | Too many requests | Use batch mode, add delays |
| Token limit | Response too long | Increase max_tokens or simplify instruction |
Related Steps
- LiteLLM Chat - Generate content before evaluation
- Select Trajectories - Filter trajectories for batch evaluation
- Extract From Trajectories - Extract data from child workflows for evaluation