Evaluate LLM Outputs in 5 Minutes

Use LLM-as-Judge to automatically score and compare AI responses.

What You'll Build

Prompt → Generate Response → Judge Quality → Score + Explanation

The Workflow

{
  "init_params": {
    "prompt": "Write a haiku about artificial intelligence",
    "model": "openai/gpt-4o-mini"
  },
  "step_configs": {
    "generate": {
      "activity": "litellm_chat",
      "model_path": "init_params.model",
      "user_prompt_path": "init_params.prompt",
      "temperature": 0.8
    },
    "evaluate": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate this haiku on creativity, adherence to 5-7-5 syllable structure, and thematic relevance to AI.",
      "item_path": "generate.outputs.content"
    }
  },
  "steps": ["generate", "evaluate"]
}

Try It

Copy the workflow above
Run it in Jetty
Change the prompt to generate different content
Modify the instruction to evaluate different criteria

What You'll Learn

1. `simple_judge` - The evaluation engine

{
  "activity": "simple_judge",
  "model": "gpt-4o",
  "model_provider": "openai",
  "judge_type": "scale",
  "scale_range": [1, 5],
  "instruction": "Your evaluation criteria here",
  "item_path": "generate.outputs.content"
}

2. Judge types

Type	Use Case	Output
`scale`	Numeric scoring	`rating: "4"`, `average_score: 4.0`
`binary`	Yes/no decisions	`rating: "yes"` or `rating: "no"`
`categorical`	Multiple choice	`rating: "category_name"`

3. Chaining with path expressions

generate.outputs.content passes the LLM response to the judge:

generate step → outputs.content → evaluate step

The Output

{
  "evaluate": {
    "outputs": {
      "rating": "4",
      "explanation": "Good creativity with the 'silicon dreams' metaphor. Follows 5-7-5 structure correctly. Clear AI theme throughout.",
      "average_score": 4.0,
      "model": "gpt-4o"
    }
  }
}

Multi-Criteria Evaluation

Evaluate against multiple criteria in parallel:

{
  "init_params": {
    "prompt": "Explain machine learning to a 10-year-old"
  },
  "step_configs": {
    "generate": {
      "activity": "litellm_chat",
      "model": "openai/gpt-4o",
      "user_prompt_path": "init_params.prompt"
    },
    "accuracy": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate the technical accuracy of this explanation.",
      "item_path": "generate.outputs.content"
    },
    "clarity": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate how understandable this is for a 10-year-old.",
      "item_path": "generate.outputs.content"
    },
    "engagement": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate how engaging and fun this explanation is.",
      "item_path": "generate.outputs.content"
    }
  },
  "steps": ["generate", "accuracy", "clarity", "engagement"]
}

Binary Evaluation (Pass/Fail)

Check if content meets specific criteria:

{
  "safety_check": {
    "activity": "simple_judge",
    "model": "gpt-4o",
    "model_provider": "openai",
    "judge_type": "binary",
    "instruction": "Does this content contain any harmful, offensive, or inappropriate material?",
    "item_path": "generate.outputs.content"
  }
}

Output:

{
  "rating": "no",
  "explanation": "The content is educational and appropriate for all audiences."
}

Compare Multiple Models

Generate from multiple models, then judge which is best:

{
  "init_params": {
    "question": "What is the meaning of life?"
  },
  "step_configs": {
    "gpt4": {
      "activity": "litellm_chat",
      "model": "openai/gpt-4o",
      "user_prompt_path": "init_params.question"
    },
    "claude": {
      "activity": "litellm_chat",
      "model": "anthropic/claude-sonnet-4-20250514",
      "user_prompt_path": "init_params.question"
    },
    "judge_gpt4": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 10],
      "instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
      "item_path": "gpt4.outputs.content"
    },
    "judge_claude": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "model_provider": "openai",
      "judge_type": "scale",
      "scale_range": [1, 10],
      "instruction": "Rate this response for depth, thoughtfulness, and helpfulness.",
      "item_path": "claude.outputs.content"
    }
  },
  "steps": ["gpt4", "claude", "judge_gpt4", "judge_claude"]
}

Custom Rubrics

Create detailed evaluation rubrics:

{
  "evaluate": {
    "activity": "simple_judge",
    "model": "gpt-4o",
    "model_provider": "openai",
    "judge_type": "scale",
    "scale_range": [0, 100],
    "instruction": "Evaluate this code review using the rubric:\n\n**Correctness (0-40):**\n- Identifies actual bugs\n- Doesn't flag false positives\n\n**Helpfulness (0-30):**\n- Provides actionable suggestions\n- Explains the 'why'\n\n**Tone (0-30):**\n- Professional and constructive\n- Not condescending\n\nProvide a total score.",
    "item_path": "generate.outputs.content"
  }
}

Next Steps

Image Generation - Evaluate generated images
Batch Processing - Evaluate across datasets
Model Comparison - Compare multiple models

What You'll Build​

The Workflow​

Try It​

What You'll Learn​

1. simple_judge - The evaluation engine​

2. Judge types​

3. Chaining with path expressions​

The Output​

Multi-Criteria Evaluation​

Binary Evaluation (Pass/Fail)​

Compare Multiple Models​

Custom Rubrics​

Next Steps​