Skip to main content

Evaluating LLMs with Jetty

Build automated evaluation pipelines to assess LLM outputs for quality, accuracy, and consistency.

Overview

This guide walks through building a complete LLM evaluation pipeline:

  1. Generate outputs from one or more models
  2. Score outputs using LLM-as-Judge
  3. Aggregate and analyze results

Based on production workflows: llm-judge, goose-detector, txt-prompt-attack

Prerequisites

  • Jetty account with API access
  • API keys for target models (OpenAI, Anthropic, etc.)

Step 1: Single Model Evaluation

Start by evaluating outputs from a single model.

Workflow: Basic Quality Check

{
"init_params": {
"prompt": "Explain quantum computing in simple terms",
"model": "gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt",
"temperature": 0.7
},
"evaluate": {
"activity": "simple_judge",
"instruction": "Rate this explanation for clarity and accuracy. Score 1-5.",
"item_path": "generate.outputs.content",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"model_provider": "openai"
}
},
"steps": ["generate", "evaluate"]
}

Understanding simple_judge

The simple_judge activity is the core of Jetty's evaluation system:

{
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Your evaluation criteria here",
"item_path": "previous_step.outputs.content"
}
}

Parameters:

  • judge_type: "scale" for numeric scores, "binary" for yes/no
  • scale_range: [min, max] for scale judgments
  • instruction: What to evaluate and how to score
  • item_path: Path to content being evaluated (text or image)

Output:

{
"outputs": {
"rating": "4",
"explanation": "Clear explanation with good analogies...",
"average_score": 4.0,
"model": "gpt-4o"
}
}

Step 2: Multi-Criteria Evaluation

Evaluate content against multiple criteria in parallel.

Workflow: Comprehensive Assessment

Based on production workflow: llm-judge-plus

{
"init_params": {
"content": "Your content to evaluate"
},
"step_configs": {
"accuracy": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate the factual accuracy of this content.",
"item_path": "init_params.content",
"model_provider": "openai"
},
"clarity": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how clear and understandable this content is.",
"item_path": "init_params.content",
"model_provider": "openai"
},
"completeness": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how complete and thorough this content is.",
"item_path": "init_params.content",
"model_provider": "openai"
}
},
"steps": ["accuracy", "clarity", "completeness"]
}

Step 3: Vision-Based Evaluation

Evaluate images or use vision models to describe content before judging.

Workflow: Image Description + Evaluation

Based on production workflow: goose-detector

{
"init_params": {
"prompt": "A serene lake with birds swimming"
},
"step_configs": {
"generate_image": {
"model": "black-forest-labs/flux-schnell",
"activity": "replicate_text2image",
"prompt_path": "init_params.prompt"
},
"describe_image": {
"model": "openai/gpt-4o",
"activity": "litellm_vision",
"image_path": "generate_image.outputs.images[0].path",
"prompt": "Describe this image in detail. Focus on any birds or waterfowl present.",
"detail": "high",
"temperature": 0.3
},
"evaluate": {
"model": "openai/gpt-4o",
"activity": "simple_judge",
"item_path": "describe_image.outputs.text",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Based on this description, score the presence of waterfowl.",
"explanation_required": true
}
},
"steps": ["generate_image", "describe_image", "evaluate"]
}

Step 4: Batch Evaluation Pipeline

Evaluate a model across a dataset of test cases.

Workflow: Dataset Evaluation

Based on production workflow: txt-prompt-attack

{
"init_params": {
"test_prompts": [
"Explain photosynthesis",
"What causes rain?",
"How do computers work?"
],
"model": "gpt-4o-mini"
},
"step_configs": {
"run_all": {
"activity": "list_emit_await",
"items_path": "init_params.test_prompts",
"task_reference": {
"task_name": "single-evaluation"
},
"data_mapping": {
"prompt": "{{ $item }}",
"model": "{{ init_params.model }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all.outputs.trajectory_references",
"extract_keys": {
"prompt": "init_params.prompt",
"response": "generate.outputs.text",
"score": "evaluate.outputs.average_score"
}
},
"summarize": {
"activity": "litellm_chat",
"model": "gpt-4o",
"user_prompt": "Analyze these evaluation results and provide a summary:\n\n{{ collect.outputs.extracted_data }}"
}
},
"steps": ["run_all", "collect", "summarize"]
}

Child Workflow: single-evaluation

{
"init_params": {
"prompt": "",
"model": "gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt"
},
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate this response for accuracy and helpfulness.",
"item_path": "generate.outputs.text",
"model_provider": "openai"
}
},
"steps": ["generate", "evaluate"]
}

Step 5: Human Score Correlation

Compare automated scores with human annotations.

Workflow: Score Correlation

Based on production workflow: human_score_evaluator

{
"init_params": {},
"step_configs": {
"filter_labeled": {
"activity": "select_trajectories",
"task_name": "jettyio/my-eval-task",
"filter_by": {
"labels": {
"human_score": {
"$in": ["1", "2", "3", "4", "5"]
}
}
}
},
"correlate": {
"activity": "visualize_correlation",
"trajectory_path": "filter_labeled.outputs.selected_trajectories",
"x": "labels[0].value",
"y": "steps.evaluate.outputs.average_score"
}
},
"steps": ["filter_labeled", "correlate"]
}

Custom Evaluation Rubrics

Example: Code Quality Rubric

{
"evaluate_code": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [0, 100],
"instruction": "Evaluate this code using the rubric:\n\n**Correctness (0-40):**\n- Handles all test cases\n- No logical errors\n\n**Efficiency (0-30):**\n- Optimal time complexity\n- Minimal space usage\n\n**Readability (0-30):**\n- Clear variable names\n- Appropriate comments\n\nProvide total score.",
"item_path": "generate.outputs.content",
"model_provider": "openai"
}
}

Best Practices

1. Use Consistent Prompts

  • Keep evaluation instructions identical across comparisons
  • Version your prompts alongside workflows

2. Choose Appropriate Judge Models

  • Use stronger models (GPT-4, Claude-3-Opus) for nuanced evaluation
  • Faster models work for simple pass/fail checks

3. Handle Rate Limits

  • Set appropriate max_parallel values in list_emit_await
  • Use exponential backoff for retries

4. Track Evaluation Metrics

  • Store trajectory IDs for reproducibility
  • Monitor pass rates over time using trajectory selection

Next Steps