Skip to main content

Evaluation Pipeline Patterns

Advanced evaluation workflows using Jetty's verdict system, LLM-as-judge capabilities, and sophisticated assessment pipelines for comprehensive content evaluation and quality assurance.

Overview

Evaluation pipelines provide systematic approaches to assessing content quality, model performance, and decision-making through structured, multi-stage evaluation processes.

Pipeline TypePurposeComplexityReliability
Single JudgeQuick evaluation with one assessorLowMedium

Core Evaluation Patterns

Single Judge Evaluation

Simple, direct evaluation using a single LLM judge:

{
"name": "single_judge_evaluation",
"description": "Basic evaluation pattern with single LLM judge",
"init_params": {
"content_to_evaluate": [
"The implementation uses recursion which may cause stack overflow for large inputs.",
"The algorithm employs dynamic programming with memoization for optimal performance.",
"This solution has O(n²) complexity but could be optimized to O(n log n)."
],
"evaluation_criteria": "Assess the technical accuracy and quality of these code analysis statements"
},
"steps": [
{
"name": "technical_assessment",
"step_type": "simple_judge",
"config": {
"items": "init_params.content_to_evaluate",
"instruction": "init_params.evaluation_criteria",
"judge_type": "scale",
"scale_range": [1, 10],
"with_explanation": true,
"model": "gpt-4",
"temperature": 0.3
}
},
{
"name": "categorize_quality",
"step_type": "simple_judge",
"config": {
"items": "init_params.content_to_evaluate",
"instruction": "Categorize the overall quality of this technical statement",
"judge_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"],
"with_explanation": true,
"model": "gpt-4",
"temperature": 0.2
}
},
{
"name": "evaluation_summary",
"step_type": "litellm_chat",
"config": {
"model": "gpt-4",
"messages": [
{
"role": "system",
"content": "You are an expert at analyzing evaluation results and providing actionable feedback."
},
{
"role": "user",
"content": "Summarize the evaluation results:\n\nTechnical Assessment: steps.technical_assessment.outputs.results\n\nQuality Categories: steps.categorize_quality.outputs.results\n\nProvide: 1) Overall quality assessment, 2) Key strengths and weaknesses, 3) Specific improvement recommendations"
}
],
"temperature": 0.4,
"max_tokens": 800
}
}
]
}