Evaluating LLMs with Jetty

Build automated evaluation pipelines to assess LLM outputs for quality, accuracy, and consistency.

Overview

This guide walks through building a complete LLM evaluation pipeline:

Generate outputs from one or more models
Score outputs using LLM-as-Judge
Aggregate and analyze results

Based on production workflows: llm-judge, goose-detector, txt-prompt-attack

Prerequisites

Jetty account with API access
API keys for target models (OpenAI, Anthropic, etc.)

Step 1: Single Model Evaluation

Start by evaluating outputs from a single model.

Workflow: Basic Quality Check

{
  "init_params": {
    "prompt": "Explain quantum computing in simple terms",
    "model": "gpt-4o-mini"
  },
  "step_configs": {
    "generate": {
      "activity": "litellm_chat",
      "model_path": "init_params.model",
      "user_prompt_path": "init_params.prompt",
      "temperature": 0.7
    },
    "evaluate": {
      "activity": "simple_judge",
      "instruction": "Rate this explanation for clarity and accuracy. Score 1-5.",
      "item_path": "generate.outputs.content",
      "model": "gpt-4o",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "model_provider": "openai"
    }
  },
  "steps": ["generate", "evaluate"]
}

Understanding simple_judge

The simple_judge activity is the core of Jetty's evaluation system:

{
  "evaluate": {
    "activity": "simple_judge",
    "model": "gpt-4o",
    "model_provider": "openai",
    "judge_type": "scale",
    "scale_range": [1, 5],
    "instruction": "Your evaluation criteria here",
    "item_path": "previous_step.outputs.content"
  }
}

Parameters:

judge_type: "scale" for numeric scores, "binary" for yes/no
scale_range: [min, max] for scale judgments
instruction: What to evaluate and how to score
item_path: Path to content being evaluated (text or image)

Output:

{
  "outputs": {
    "rating": "4",
    "explanation": "Clear explanation with good analogies...",
    "average_score": 4.0,
    "model": "gpt-4o"
  }
}

Step 2: Multi-Criteria Evaluation

Evaluate content against multiple criteria in parallel.

Workflow: Comprehensive Assessment

Based on production workflow: llm-judge-plus

{
  "init_params": {
    "content": "Your content to evaluate"
  },
  "step_configs": {
    "accuracy": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate the factual accuracy of this content.",
      "item_path": "init_params.content",
      "model_provider": "openai"
    },
    "clarity": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate how clear and understandable this content is.",
      "item_path": "init_params.content",
      "model_provider": "openai"
    },
    "completeness": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate how complete and thorough this content is.",
      "item_path": "init_params.content",
      "model_provider": "openai"
    }
  },
  "steps": ["accuracy", "clarity", "completeness"]
}

Step 3: Vision-Based Evaluation

Evaluate images or use vision models to describe content before judging.

Workflow: Image Description + Evaluation

Based on production workflow: goose-detector

{
  "init_params": {
    "prompt": "A serene lake with birds swimming"
  },
  "step_configs": {
    "generate_image": {
      "model": "black-forest-labs/flux-schnell",
      "activity": "replicate_text2image",
      "prompt_path": "init_params.prompt"
    },
    "describe_image": {
      "model": "openai/gpt-4o",
      "activity": "litellm_vision",
      "image_path": "generate_image.outputs.images[0].path",
      "prompt": "Describe this image in detail. Focus on any birds or waterfowl present.",
      "detail": "high",
      "temperature": 0.3
    },
    "evaluate": {
      "model": "openai/gpt-4o",
      "activity": "simple_judge",
      "item_path": "describe_image.outputs.text",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Based on this description, score the presence of waterfowl.",
      "explanation_required": true
    }
  },
  "steps": ["generate_image", "describe_image", "evaluate"]
}

Step 4: Batch Evaluation Pipeline

Evaluate a model across a dataset of test cases.

Workflow: Dataset Evaluation

Based on production workflow: txt-prompt-attack

{
  "init_params": {
    "test_prompts": [
      "Explain photosynthesis",
      "What causes rain?",
      "How do computers work?"
    ],
    "model": "gpt-4o-mini"
  },
  "step_configs": {
    "run_all": {
      "activity": "list_emit_await",
      "items_path": "init_params.test_prompts",
      "task_reference": {
        "task_name": "single-evaluation"
      },
      "data_mapping": {
        "prompt": "{{ $item }}",
        "model": "{{ init_params.model }}"
      }
    },
    "collect": {
      "activity": "extract_from_trajectories",
      "trajectory_list_path": "run_all.outputs.trajectory_references",
      "extract_keys": {
        "prompt": "init_params.prompt",
        "response": "generate.outputs.text",
        "score": "evaluate.outputs.average_score"
      }
    },
    "summarize": {
      "activity": "litellm_chat",
      "model": "gpt-4o",
      "user_prompt": "Analyze these evaluation results and provide a summary:\n\n{{ collect.outputs.extracted_data }}"
    }
  },
  "steps": ["run_all", "collect", "summarize"]
}

Child Workflow: single-evaluation

{
  "init_params": {
    "prompt": "",
    "model": "gpt-4o-mini"
  },
  "step_configs": {
    "generate": {
      "activity": "litellm_chat",
      "model_path": "init_params.model",
      "user_prompt_path": "init_params.prompt"
    },
    "evaluate": {
      "activity": "simple_judge",
      "model": "gpt-4o",
      "judge_type": "scale",
      "scale_range": [1, 5],
      "instruction": "Rate this response for accuracy and helpfulness.",
      "item_path": "generate.outputs.text",
      "model_provider": "openai"
    }
  },
  "steps": ["generate", "evaluate"]
}

Step 5: Human Score Correlation

Compare automated scores with human annotations.

Workflow: Score Correlation

Based on production workflow: human_score_evaluator

{
  "init_params": {},
  "step_configs": {
    "filter_labeled": {
      "activity": "select_trajectories",
      "task_name": "jettyio/my-eval-task",
      "filter_by": {
        "labels": {
          "human_score": {
            "$in": ["1", "2", "3", "4", "5"]
          }
        }
      }
    },
    "correlate": {
      "activity": "visualize_correlation",
      "trajectory_path": "filter_labeled.outputs.selected_trajectories",
      "x": "labels[0].value",
      "y": "steps.evaluate.outputs.average_score"
    }
  },
  "steps": ["filter_labeled", "correlate"]
}

Custom Evaluation Rubrics

Example: Code Quality Rubric

{
  "evaluate_code": {
    "activity": "simple_judge",
    "model": "gpt-4o",
    "judge_type": "scale",
    "scale_range": [0, 100],
    "instruction": "Evaluate this code using the rubric:\n\n**Correctness (0-40):**\n- Handles all test cases\n- No logical errors\n\n**Efficiency (0-30):**\n- Optimal time complexity\n- Minimal space usage\n\n**Readability (0-30):**\n- Clear variable names\n- Appropriate comments\n\nProvide total score.",
    "item_path": "generate.outputs.content",
    "model_provider": "openai"
  }
}

Evaluating LLMs with Jetty

Overview

Prerequisites

Step 1: Single Model Evaluation

Workflow: Basic Quality Check

Understanding simple_judge

Step 2: Multi-Criteria Evaluation

Workflow: Comprehensive Assessment

Step 3: Vision-Based Evaluation

Workflow: Image Description + Evaluation

Step 4: Batch Evaluation Pipeline

Workflow: Dataset Evaluation

Child Workflow: single-evaluation

Step 5: Human Score Correlation

Workflow: Score Correlation

Custom Evaluation Rubrics

Example: Code Quality Rubric

Best Practices

1. Use Consistent Prompts

2. Choose Appropriate Judge Models

3. Handle Rate Limits

4. Track Evaluation Metrics

Next Steps

Overview​

Prerequisites​

Step 1: Single Model Evaluation​

Workflow: Basic Quality Check​

Understanding simple_judge​

Step 2: Multi-Criteria Evaluation​

Workflow: Comprehensive Assessment​

Step 3: Vision-Based Evaluation​

Workflow: Image Description + Evaluation​

Step 4: Batch Evaluation Pipeline​

Workflow: Dataset Evaluation​

Child Workflow: single-evaluation​

Step 5: Human Score Correlation​

Workflow: Score Correlation​

Custom Evaluation Rubrics​

Example: Code Quality Rubric​

Best Practices​

1. Use Consistent Prompts​

2. Choose Appropriate Judge Models​

3. Handle Rate Limits​

4. Track Evaluation Metrics​

Next Steps​

Overview

Prerequisites

Step 1: Single Model Evaluation

Workflow: Basic Quality Check

Understanding simple_judge

Step 2: Multi-Criteria Evaluation

Workflow: Comprehensive Assessment

Step 3: Vision-Based Evaluation

Workflow: Image Description + Evaluation

Step 4: Batch Evaluation Pipeline

Workflow: Dataset Evaluation

Child Workflow: single-evaluation

Step 5: Human Score Correlation

Workflow: Score Correlation

Custom Evaluation Rubrics

Example: Code Quality Rubric

Best Practices

1. Use Consistent Prompts

2. Choose Appropriate Judge Models

3. Handle Rate Limits

4. Track Evaluation Metrics

Next Steps