Evaluation Pipeline Patterns

Advanced evaluation workflows using Jetty's verdict system, LLM-as-judge capabilities, and sophisticated assessment pipelines for comprehensive content evaluation and quality assurance.

Overview

Evaluation pipelines provide systematic approaches to assessing content quality, model performance, and decision-making through structured, multi-stage evaluation processes.

Pipeline Type	Purpose	Complexity	Reliability
Single Judge	Quick evaluation with one assessor	Low	Medium

Core Evaluation Patterns

Single Judge Evaluation

Simple, direct evaluation using a single LLM judge:

{
  "name": "single_judge_evaluation",
  "description": "Basic evaluation pattern with single LLM judge",
  "init_params": {
    "content_to_evaluate": [
      "The implementation uses recursion which may cause stack overflow for large inputs.",
      "The algorithm employs dynamic programming with memoization for optimal performance.",
      "This solution has O(n²) complexity but could be optimized to O(n log n)."
    ],
    "evaluation_criteria": "Assess the technical accuracy and quality of these code analysis statements"
  },
  "steps": [
    {
      "name": "technical_assessment",
      "step_type": "simple_judge",
      "config": {
        "items": "init_params.content_to_evaluate",
        "instruction": "init_params.evaluation_criteria",
        "judge_type": "scale",
        "scale_range": [1, 10],
        "with_explanation": true,
        "model": "gpt-4",
        "temperature": 0.3
      }
    },
    {
      "name": "categorize_quality",
      "step_type": "simple_judge",
      "config": {
        "items": "init_params.content_to_evaluate",
        "instruction": "Categorize the overall quality of this technical statement",
        "judge_type": "categorical",
        "categories": ["excellent", "good", "fair", "poor"],
        "with_explanation": true,
        "model": "gpt-4",
        "temperature": 0.2
      }
    },
    {
      "name": "evaluation_summary",
      "step_type": "litellm_chat",
      "config": {
        "model": "gpt-4",
        "messages": [
          {
            "role": "system",
            "content": "You are an expert at analyzing evaluation results and providing actionable feedback."
          },
          {
            "role": "user",
            "content": "Summarize the evaluation results:\n\nTechnical Assessment: steps.technical_assessment.outputs.results\n\nQuality Categories: steps.categorize_quality.outputs.results\n\nProvide: 1) Overall quality assessment, 2) Key strengths and weaknesses, 3) Specific improvement recommendations"
          }
        ],
        "temperature": 0.4,
        "max_tokens": 800
      }
    }
  ]
}

AI Model Comparison - Multi-provider evaluation strategies
Simple Judge - LLM-as-judge capabilities
Statistical Analysis - Data analysis and metrics
Workflow Patterns - General composition patterns

Overview​

Core Evaluation Patterns​

Single Judge Evaluation​

Related Documentation​

Overview

Core Evaluation Patterns

Single Judge Evaluation

Related Documentation