Evaluating LLMs with Jetty
Build automated evaluation pipelines to assess LLM outputs for quality, accuracy, and consistency.
Overview
This guide walks through building a complete LLM evaluation pipeline:
- Generate outputs from one or more models
- Score outputs using LLM-as-Judge
- Aggregate and analyze results
Based on production workflows: llm-judge, goose-detector, txt-prompt-attack
Prerequisites
- Jetty account with API access
- API keys for target models (OpenAI, Anthropic, etc.)
Step 1: Single Model Evaluation
Start by evaluating outputs from a single model.
Workflow: Basic Quality Check
{
"init_params": {
"prompt": "Explain quantum computing in simple terms",
"model": "gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt",
"temperature": 0.7
},
"evaluate": {
"activity": "simple_judge",
"instruction": "Rate this explanation for clarity and accuracy. Score 1-5.",
"item_path": "generate.outputs.content",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"model_provider": "openai"
}
},
"steps": ["generate", "evaluate"]
}
Understanding simple_judge
The simple_judge activity is the core of Jetty's evaluation system:
{
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"model_provider": "openai",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Your evaluation criteria here",
"item_path": "previous_step.outputs.content"
}
}
Parameters:
judge_type:"scale"for numeric scores,"binary"for yes/noscale_range:[min, max]for scale judgmentsinstruction: What to evaluate and how to scoreitem_path: Path to content being evaluated (text or image)
Output:
{
"outputs": {
"rating": "4",
"explanation": "Clear explanation with good analogies...",
"average_score": 4.0,
"model": "gpt-4o"
}
}
Step 2: Multi-Criteria Evaluation
Evaluate content against multiple criteria in parallel.
Workflow: Comprehensive Assessment
Based on production workflow: llm-judge-plus
{
"init_params": {
"content": "Your content to evaluate"
},
"step_configs": {
"accuracy": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate the factual accuracy of this content.",
"item_path": "init_params.content",
"model_provider": "openai"
},
"clarity": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how clear and understandable this content is.",
"item_path": "init_params.content",
"model_provider": "openai"
},
"completeness": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate how complete and thorough this content is.",
"item_path": "init_params.content",
"model_provider": "openai"
}
},
"steps": ["accuracy", "clarity", "completeness"]
}
Step 3: Vision-Based Evaluation
Evaluate images or use vision models to describe content before judging.
Workflow: Image Description + Evaluation
Based on production workflow: goose-detector
{
"init_params": {
"prompt": "A serene lake with birds swimming"
},
"step_configs": {
"generate_image": {
"model": "black-forest-labs/flux-schnell",
"activity": "replicate_text2image",
"prompt_path": "init_params.prompt"
},
"describe_image": {
"model": "openai/gpt-4o",
"activity": "litellm_vision",
"image_path": "generate_image.outputs.images[0].path",
"prompt": "Describe this image in detail. Focus on any birds or waterfowl present.",
"detail": "high",
"temperature": 0.3
},
"evaluate": {
"model": "openai/gpt-4o",
"activity": "simple_judge",
"item_path": "describe_image.outputs.text",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Based on this description, score the presence of waterfowl.",
"explanation_required": true
}
},
"steps": ["generate_image", "describe_image", "evaluate"]
}
Step 4: Batch Evaluation Pipeline
Evaluate a model across a dataset of test cases.
Workflow: Dataset Evaluation
Based on production workflow: txt-prompt-attack
{
"init_params": {
"test_prompts": [
"Explain photosynthesis",
"What causes rain?",
"How do computers work?"
],
"model": "gpt-4o-mini"
},
"step_configs": {
"run_all": {
"activity": "list_emit_await",
"items_path": "init_params.test_prompts",
"task_reference": {
"task_name": "single-evaluation"
},
"data_mapping": {
"prompt": "{{ $item }}",
"model": "{{ init_params.model }}"
}
},
"collect": {
"activity": "extract_from_trajectories",
"trajectory_list_path": "run_all.outputs.trajectory_references",
"extract_keys": {
"prompt": "init_params.prompt",
"response": "generate.outputs.text",
"score": "evaluate.outputs.average_score"
}
},
"summarize": {
"activity": "litellm_chat",
"model": "gpt-4o",
"user_prompt": "Analyze these evaluation results and provide a summary:\n\n{{ collect.outputs.extracted_data }}"
}
},
"steps": ["run_all", "collect", "summarize"]
}
Child Workflow: single-evaluation
{
"init_params": {
"prompt": "",
"model": "gpt-4o-mini"
},
"step_configs": {
"generate": {
"activity": "litellm_chat",
"model_path": "init_params.model",
"user_prompt_path": "init_params.prompt"
},
"evaluate": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [1, 5],
"instruction": "Rate this response for accuracy and helpfulness.",
"item_path": "generate.outputs.text",
"model_provider": "openai"
}
},
"steps": ["generate", "evaluate"]
}
Step 5: Human Score Correlation
Compare automated scores with human annotations.
Workflow: Score Correlation
Based on production workflow: human_score_evaluator
{
"init_params": {},
"step_configs": {
"filter_labeled": {
"activity": "select_trajectories",
"task_name": "jettyio/my-eval-task",
"filter_by": {
"labels": {
"human_score": {
"$in": ["1", "2", "3", "4", "5"]
}
}
}
},
"correlate": {
"activity": "visualize_correlation",
"trajectory_path": "filter_labeled.outputs.selected_trajectories",
"x": "labels[0].value",
"y": "steps.evaluate.outputs.average_score"
}
},
"steps": ["filter_labeled", "correlate"]
}
Custom Evaluation Rubrics
Example: Code Quality Rubric
{
"evaluate_code": {
"activity": "simple_judge",
"model": "gpt-4o",
"judge_type": "scale",
"scale_range": [0, 100],
"instruction": "Evaluate this code using the rubric:\n\n**Correctness (0-40):**\n- Handles all test cases\n- No logical errors\n\n**Efficiency (0-30):**\n- Optimal time complexity\n- Minimal space usage\n\n**Readability (0-30):**\n- Clear variable names\n- Appropriate comments\n\nProvide total score.",
"item_path": "generate.outputs.content",
"model_provider": "openai"
}
}
Best Practices
1. Use Consistent Prompts
- Keep evaluation instructions identical across comparisons
- Version your prompts alongside workflows
2. Choose Appropriate Judge Models
- Use stronger models (GPT-4, Claude-3-Opus) for nuanced evaluation
- Faster models work for simple pass/fail checks
3. Handle Rate Limits
- Set appropriate
max_parallelvalues inlist_emit_await - Use exponential backoff for retries
4. Track Evaluation Metrics
- Store trajectory IDs for reproducibility
- Monitor pass rates over time using trajectory selection