Skip to main content

Simple Judge - LLM-as-a-Judge Evaluation

The Simple Judge step provides a lightweight LLM-as-a-Judge framework for evaluating content using language models. It supports both categorical judgments and numeric scale ratings, with automatic handling of text and image inputs.

Activity Name

simple_judge

Overview

Simple Judge uses LiteLLM under the hood to access 100+ model providers, making it easy to use any LLM as an evaluator. The step handles:

  • Text evaluation: Analyze text content against custom criteria
  • Image evaluation: Assess images using vision-capable models
  • Categorical judgments: Choose from predefined categories (e.g., "yes/no", "excellent/good/poor")
  • Scale ratings: Numeric scores within a defined range (e.g., 1-10)
  • Batch processing: Evaluate multiple items in a single step

Environment Variables

VariableDescription
OPENAI_API_KEYOpenAI API key (for GPT models)
ANTHROPIC_API_KEYAnthropic API key (for Claude models)
GEMINI_API_KEYGoogle API key (for Gemini models)
LITELLM_API_KEYLiteLLM proxy API key

Configuration Parameters

Required Parameters

ParameterTypeDescription
instructionstringThe evaluation criteria and instructions for the judge

Optional Parameters

ParameterTypeDefaultDescription
judge_typestring"categorical"Type of judgment: "categorical" or "scale"
modelstring"gpt-4o"LLM model to use for evaluation
temperaturefloat0.3Model temperature (lower = more consistent)
max_tokensintnullMaximum response tokens
with_explanationbooltrueInclude reasoning in output
save_raw_responsebooltrueSave raw LLM responses as files
system_promptstringnullCustom system prompt for the model

Categorical Judgment Parameters

ParameterTypeDefaultDescription
categoriesarray["yes", "no"]List of category options (minimum 2)

Scale Judgment Parameters

ParameterTypeDefaultDescription
scale_rangearray[1, 5]Two-element array [min, max] for numeric scale

Input Parameters

Items to evaluate can be provided in multiple ways:

ParameterTypeDescription
itemsarrayDirect list of items to evaluate
items_pathstringPath expression to extract items from trajectory
itemanySingle item to evaluate
item_pathstringPath expression to extract single item

Input Types

Simple Judge automatically handles different input types:

  • Plain text: Evaluated as-is
  • Storage paths: Files are read from storage and converted appropriately
  • Image files: Converted to data URLs for vision models (PNG, JPG, WebP, GIF)
  • Text files: Decoded as UTF-8 (TXT, MD, CSV, JSON, code files)
  • Data URLs: Passed directly to vision models

Examples

Categorical Evaluation (Yes/No)

{
"name": "check_compliance",
"activity": "simple_judge",
"config": {
"instruction": "Does this text follow professional communication guidelines? Consider tone, grammar, and appropriateness.",
"judge_type": "categorical",
"categories": ["yes", "no"],
"model": "gpt-4o",
"item_path": "previous_step.outputs.text"
}
}

Multi-Category Evaluation

{
"name": "quality_assessment",
"activity": "simple_judge",
"config": {
"instruction": "Evaluate the quality of this content for publication.",
"judge_type": "categorical",
"categories": ["excellent", "good", "fair", "poor", "reject"],
"model": "claude-3-5-sonnet-20241022",
"items_path": "content_generator.outputs.articles"
}
}

Scale-Based Scoring

{
"name": "score_responses",
"activity": "simple_judge",
"config": {
"instruction": "Rate how helpful and accurate this customer support response is.",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "gpt-4o",
"temperature": 0.2,
"items_path": "support_responses.outputs.replies"
}
}

Image Evaluation

{
"name": "brand_check",
"activity": "simple_judge",
"config": {
"instruction": "Does this image follow our brand guidelines? Check for: correct logo placement, approved color palette, and professional appearance.",
"judge_type": "categorical",
"categories": ["approved", "needs_revision", "rejected"],
"model": "gpt-4o",
"item_path": "image_generator.outputs.images[0].path"
}
}

Batch Evaluation with Explanations

{
"name": "evaluate_submissions",
"activity": "simple_judge",
"config": {
"instruction": "Evaluate this code submission for correctness, efficiency, and code quality.",
"judge_type": "scale",
"scale_range": [0, 100],
"model": "claude-3-5-sonnet-20241022",
"with_explanation": true,
"save_raw_response": true,
"items": [
"def add(a, b): return a + b",
"def add(a, b): return a - b",
"def add(a, b):\n # Add two numbers\n result = a + b\n return result"
]
}
}

Output Structure

Categorical Judgment Output

{
"outputs": {
"results": [
{
"item": "The evaluated content...",
"judgment": "yes",
"score": null,
"explanation": "The content follows all guidelines because...",
"success": true,
"error": null,
"raw_result": "{\"judgment\": \"yes\", \"explanation\": \"...\"}"
}
],
"successful_count": 1,
"failed_count": 0,
"total_count": 1,
"success_rate": 1.0,
"model_used": "gpt-4o",
"judge_type": "categorical",
"category_distribution": {
"yes": 1
},
"raw_response_path": "collection/flow/0001/simple_judge_1.json"
}
}

Scale Judgment Output

{
"outputs": {
"results": [
{
"item": "The evaluated content...",
"judgment": null,
"score": 8.5,
"explanation": "High quality because...",
"success": true,
"error": null,
"raw_result": "{\"score\": 8.5, \"explanation\": \"...\"}"
}
],
"successful_count": 1,
"failed_count": 0,
"total_count": 1,
"success_rate": 1.0,
"model_used": "gpt-4o",
"judge_type": "scale",
"average_score": 8.5,
"min_score": 8.5,
"max_score": 8.5,
"raw_response_path": "collection/flow/0001/simple_judge_1.json"
}
}

Advanced Usage

Chaining with Content Generation

{
"steps": ["generate_content", "evaluate_content"],
"step_configs": {
"generate_content": {
"activity": "litellm_chat",
"model": "gpt-4o",
"prompt": "Write a product description for wireless earbuds"
},
"evaluate_content": {
"activity": "simple_judge",
"instruction": "Rate this product description for marketing effectiveness, clarity, and persuasiveness.",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "claude-3-5-sonnet-20241022",
"item_path": "generate_content.outputs.text"
}
}
}

Multi-Stage Evaluation Pipeline

{
"steps": ["initial_screen", "detailed_review"],
"step_configs": {
"initial_screen": {
"activity": "simple_judge",
"instruction": "Does this content meet minimum quality standards?",
"judge_type": "categorical",
"categories": ["pass", "fail"],
"model": "gpt-3.5-turbo",
"items_path": "init_params.submissions"
},
"detailed_review": {
"activity": "simple_judge",
"instruction": "Provide a detailed quality score considering originality, accuracy, and presentation.",
"judge_type": "scale",
"scale_range": [0, 100],
"model": "gpt-4o",
"items_path": "initial_screen.outputs.results[?success==true].item"
}
}
}

Best Practices

Writing Effective Instructions

  1. Be specific: Clearly define what criteria the judge should consider
  2. Provide examples: When possible, describe what "good" and "bad" look like
  3. Set context: Explain the purpose and use case for the evaluation
  4. Define edge cases: Specify how ambiguous situations should be handled

Choosing Judge Type

  • Categorical: Best for yes/no decisions, classification, or discrete quality levels
  • Scale: Best for nuanced scoring, comparison rankings, or continuous metrics

Model Selection

  • GPT-4o: Best for complex reasoning and image evaluation
  • Claude 3.5 Sonnet: Excellent for nuanced text analysis
  • GPT-3.5 Turbo: Fast and cost-effective for simple evaluations

Temperature Settings

  • Use low temperature (0.1-0.3) for consistent, reproducible judgments
  • Use higher temperature (0.5-0.7) when you want more varied perspectives

Error Handling

The step handles errors gracefully:

  • Failed evaluations are marked with success: false and include error details
  • Partial results are returned even if some items fail
  • Raw responses are saved for debugging when save_raw_response: true

Common Issues

IssueCauseSolution
JSON parse errorModel didn't return valid JSONLower temperature, use more capable model
Image not foundInvalid storage pathCheck path expression syntax
Rate limitingToo many requestsUse batch mode, add delays
Token limitResponse too longIncrease max_tokens or simplify instruction