Skip to main content

Evaluate LLM Outputs in 5 Minutes

Use LLM-as-Judge to automatically score and compare AI responses.

What You'll Build

Prompt → Generate Response → Judge Quality → Score + Explanation

The Workflow

{
"init_params": {
"prompt": "Write a haiku about artificial intelligence",
"model": "openai/gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt",
"temperature": 0.8
},
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate this haiku on creativity, adherence to 5-7-5 syllable structure, and thematic relevance to AI.",
"item_path": "generate.outputs.content"
}
},
"steps": ["generate", "evaluate"]
}

Try It

  1. Copy the workflow above
  2. Run it in Jetty
  3. Change the prompt to generate different content
  4. Modify the instruction to evaluate different criteria

What You'll Learn

1. simple_judge - The evaluation engine

{
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Your evaluation criteria here",
"item_path": "generate.outputs.content"
}

2. Judge types

TypeUse CaseOutput
scaleNumeric scoringrating: "4", average_score: 4.0
binaryYes/no decisionsrating: "yes" or rating: "no"
categoricalMultiple choicerating: "category_name"

3. Chaining with path expressions

generate.outputs.content passes the LLM response to the judge:

generate step → outputs.content → evaluate step

The Output

{
"evaluate": {
"outputs": {
"rating": "4",
"explanation": "Good creativity with the 'silicon dreams' metaphor. Follows 5-7-5 structure correctly. Clear AI theme throughout.",
"average_score": 4.0,
"model": "gpt-4o"
}
}
}

Multi-Criteria Evaluation

Evaluate against multiple criteria in parallel:

{
"init_params": {
"prompt": "Explain machine learning to a 10-year-old"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model": "openai/gpt-4o",
"user_prompt_path": "init_params.prompt"
},
"accuracy": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate the technical accuracy of this explanation.",
"item_path": "generate.outputs.content"
},
"clarity": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how understandable this is for a 10-year-old.",
"item_path": "generate.outputs.content"
},
"engagement": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how engaging and fun this explanation is.",
"item_path": "generate.outputs.content"
}
},
"steps": ["generate", "accuracy", "clarity", "engagement"]
}

Binary Evaluation (Pass/Fail)

Check if content meets specific criteria:

{
"safety_check": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "binary",
"instruction": "Does this content contain any harmful, offensive, or inappropriate material?",
"item_path": "generate.outputs.content"
}
}

Output:

{
"rating": "no",
"explanation": "The content is educational and appropriate for all audiences."
}

Compare Multiple Models

Generate from multiple models, then judge which is best:

{
"init_params": {
"question": "What is the meaning of life?"
},
"step_configs": {
"gpt4": {
"activity": "litellm_chat",
"model": "openai/gpt-4o",
"user_prompt_path": "init_params.question"
},
"claude": {
"activity": "litellm_chat",
"model": "anthropic/claude-sonnet-4-20250514",
"user_prompt_path": "init_params.question"
},
"judge_gpt4": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 10],
"instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
"item_path": "gpt4.outputs.content"
},
"judge_claude": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 10],
"instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
"item_path": "claude.outputs.content"
}
},
"steps": ["gpt4", "claude", "judge_gpt4", "judge_claude"]
}

Custom Rubrics

Create detailed evaluation rubrics:

{
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [0, 100],
"instruction": "Evaluate this code review using the rubric:\n\n**Correctness (0-40):**\n- Identifies actual bugs\n- Doesn't flag false positives\n\n**Helpfulness (0-30):**\n- Provides actionable suggestions\n- Explains the 'why'\n\n**Tone (0-30):**\n- Professional and constructive\n- Not condescending\n\nProvide a total score.",
"item_path": "generate.outputs.content"
}
}

Next Steps