Simple Judge - LLM-as-a-Judge Evaluation

The Simple Judge step provides a lightweight LLM-as-a-Judge framework for evaluating content using language models. It supports both categorical judgments and numeric scale ratings, with automatic handling of text and image inputs.

Activity Name

simple_judge

Overview

Simple Judge uses LiteLLM under the hood to access 100+ model providers, making it easy to use any LLM as an evaluator. The step handles:

Text evaluation: Analyze text content against custom criteria
Image evaluation: Assess images using vision-capable models
Categorical judgments: Choose from predefined categories (e.g., "yes/no", "excellent/good/poor")
Scale ratings: Numeric scores within a defined range (e.g., 1-10)
Batch processing: Evaluate multiple items in a single step

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key (for GPT models)
`ANTHROPIC_API_KEY`	Anthropic API key (for Claude models)
`GEMINI_API_KEY`	Google API key (for Gemini models)
`LITELLM_API_KEY`	LiteLLM proxy API key

Configuration Parameters

Required Parameters

Parameter	Type	Description
`instruction`	string	The evaluation criteria and instructions for the judge

Optional Parameters

Parameter	Type	Default	Description
`judge_type`	string	`"categorical"`	Type of judgment: `"categorical"` or `"scale"`
`model`	string	`"gpt-4o"`	LLM model to use for evaluation
`temperature`	float	`0.3`	Model temperature (lower = more consistent)
`max_tokens`	int	`null`	Maximum response tokens
`with_explanation`	bool	`true`	Include reasoning in output
`save_raw_response`	bool	`true`	Save raw LLM responses as files
`system_prompt`	string	`null`	Custom system prompt for the model

Categorical Judgment Parameters

Parameter	Type	Default	Description
`categories`	array	`["yes", "no"]`	List of category options (minimum 2)

Scale Judgment Parameters

Parameter	Type	Default	Description
`scale_range`	array	`[1, 5]`	Two-element array `[min, max]` for numeric scale

Input Parameters

Items to evaluate can be provided in multiple ways:

Parameter	Type	Description
`items`	array	Direct list of items to evaluate
`items_path`	string	Path expression to extract items from trajectory
`item`	any	Single item to evaluate
`item_path`	string	Path expression to extract single item

Input Types

Simple Judge automatically handles different input types:

Plain text: Evaluated as-is
Storage paths: Files are read from storage and converted appropriately
Image files: Converted to data URLs for vision models (PNG, JPG, WebP, GIF)
Text files: Decoded as UTF-8 (TXT, MD, CSV, JSON, code files)
Data URLs: Passed directly to vision models

Examples

Categorical Evaluation (Yes/No)

{
  "name": "check_compliance",
  "activity": "simple_judge",
  "config": {
    "instruction": "Does this text follow professional communication guidelines? Consider tone, grammar, and appropriateness.",
    "judge_type": "categorical",
    "categories": ["yes", "no"],
    "model": "gpt-4o",
    "item_path": "previous_step.outputs.text"
  }
}

Multi-Category Evaluation

{
  "name": "quality_assessment",
  "activity": "simple_judge",
  "config": {
    "instruction": "Evaluate the quality of this content for publication.",
    "judge_type": "categorical",
    "categories": ["excellent", "good", "fair", "poor", "reject"],
    "model": "claude-3-5-sonnet-20241022",
    "items_path": "content_generator.outputs.articles"
  }
}

Scale-Based Scoring

{
  "name": "score_responses",
  "activity": "simple_judge",
  "config": {
    "instruction": "Rate how helpful and accurate this customer support response is.",
    "judge_type": "scale",
    "scale_range": [1, 10],
    "model": "gpt-4o",
    "temperature": 0.2,
    "items_path": "support_responses.outputs.replies"
  }
}

Image Evaluation

{
  "name": "brand_check",
  "activity": "simple_judge",
  "config": {
    "instruction": "Does this image follow our brand guidelines? Check for: correct logo placement, approved color palette, and professional appearance.",
    "judge_type": "categorical",
    "categories": ["approved", "needs_revision", "rejected"],
    "model": "gpt-4o",
    "item_path": "image_generator.outputs.images[0].path"
  }
}

Batch Evaluation with Explanations

{
  "name": "evaluate_submissions",
  "activity": "simple_judge",
  "config": {
    "instruction": "Evaluate this code submission for correctness, efficiency, and code quality.",
    "judge_type": "scale",
    "scale_range": [0, 100],
    "model": "claude-3-5-sonnet-20241022",
    "with_explanation": true,
    "save_raw_response": true,
    "items": [
      "def add(a, b): return a + b",
      "def add(a, b): return a - b",
      "def add(a, b):\n    # Add two numbers\n    result = a + b\n    return result"
    ]
  }
}

Output Structure

Categorical Judgment Output

{
  "outputs": {
    "results": [
      {
        "item": "The evaluated content...",
        "judgment": "yes",
        "score": null,
        "explanation": "The content follows all guidelines because...",
        "success": true,
        "error": null,
        "raw_result": "{\"judgment\": \"yes\", \"explanation\": \"...\"}"
      }
    ],
    "successful_count": 1,
    "failed_count": 0,
    "total_count": 1,
    "success_rate": 1.0,
    "model_used": "gpt-4o",
    "judge_type": "categorical",
    "category_distribution": {
      "yes": 1
    },
    "raw_response_path": "collection/flow/0001/simple_judge_1.json"
  }
}

Scale Judgment Output

{
  "outputs": {
    "results": [
      {
        "item": "The evaluated content...",
        "judgment": null,
        "score": 8.5,
        "explanation": "High quality because...",
        "success": true,
        "error": null,
        "raw_result": "{\"score\": 8.5, \"explanation\": \"...\"}"
      }
    ],
    "successful_count": 1,
    "failed_count": 0,
    "total_count": 1,
    "success_rate": 1.0,
    "model_used": "gpt-4o",
    "judge_type": "scale",
    "average_score": 8.5,
    "min_score": 8.5,
    "max_score": 8.5,
    "raw_response_path": "collection/flow/0001/simple_judge_1.json"
  }
}

Advanced Usage

Chaining with Content Generation

{
  "steps": ["generate_content", "evaluate_content"],
  "step_configs": {
    "generate_content": {
      "activity": "litellm_chat",
      "model": "gpt-4o",
      "prompt": "Write a product description for wireless earbuds"
    },
    "evaluate_content": {
      "activity": "simple_judge",
      "instruction": "Rate this product description for marketing effectiveness, clarity, and persuasiveness.",
      "judge_type": "scale",
      "scale_range": [1, 10],
      "model": "claude-3-5-sonnet-20241022",
      "item_path": "generate_content.outputs.text"
    }
  }
}

Multi-Stage Evaluation Pipeline

{
  "steps": ["initial_screen", "detailed_review"],
  "step_configs": {
    "initial_screen": {
      "activity": "simple_judge",
      "instruction": "Does this content meet minimum quality standards?",
      "judge_type": "categorical",
      "categories": ["pass", "fail"],
      "model": "gpt-3.5-turbo",
      "items_path": "init_params.submissions"
    },
    "detailed_review": {
      "activity": "simple_judge",
      "instruction": "Provide a detailed quality score considering originality, accuracy, and presentation.",
      "judge_type": "scale",
      "scale_range": [0, 100],
      "model": "gpt-4o",
      "items_path": "initial_screen.outputs.results[?success==true].item"
    }
  }
}

Best Practices

Writing Effective Instructions

Be specific: Clearly define what criteria the judge should consider
Provide examples: When possible, describe what "good" and "bad" look like
Set context: Explain the purpose and use case for the evaluation
Define edge cases: Specify how ambiguous situations should be handled

Choosing Judge Type

Categorical: Best for yes/no decisions, classification, or discrete quality levels
Scale: Best for nuanced scoring, comparison rankings, or continuous metrics

Model Selection

GPT-4o: Best for complex reasoning and image evaluation
Claude 3.5 Sonnet: Excellent for nuanced text analysis
GPT-3.5 Turbo: Fast and cost-effective for simple evaluations

Temperature Settings

Use low temperature (0.1-0.3) for consistent, reproducible judgments
Use higher temperature (0.5-0.7) when you want more varied perspectives

Error Handling

The step handles errors gracefully:

Failed evaluations are marked with success: false and include error details
Partial results are returned even if some items fail
Raw responses are saved for debugging when save_raw_response: true

Common Issues

Issue	Cause	Solution
JSON parse error	Model didn't return valid JSON	Lower temperature, use more capable model
Image not found	Invalid storage path	Check path expression syntax
Rate limiting	Too many requests	Use batch mode, add delays
Token limit	Response too long	Increase max_tokens or simplify instruction

LiteLLM Chat - Generate content before evaluation
Select Trajectories - Filter trajectories for batch evaluation
Extract From Trajectories - Extract data from child workflows for evaluation

Activity Name​

Overview​

Environment Variables​

Configuration Parameters​

Required Parameters​

Optional Parameters​

Categorical Judgment Parameters​

Scale Judgment Parameters​

Input Parameters​

Input Types​

Examples​

Categorical Evaluation (Yes/No)​

Multi-Category Evaluation​

Scale-Based Scoring​

Image Evaluation​

Batch Evaluation with Explanations​

Output Structure​

Categorical Judgment Output​

Scale Judgment Output​

Advanced Usage​

Chaining with Content Generation​

Multi-Stage Evaluation Pipeline​

Best Practices​

Writing Effective Instructions​

Choosing Judge Type​

Model Selection​

Temperature Settings​

Error Handling​

Common Issues​

Related Steps​