Skip to main content

Evaluation Overview

Evaluation steps provide comprehensive assessment capabilities for Jetty workflows, enabling quality measurement, trajectory analysis, and LLM-as-judge evaluation systems. These steps support everything from basic trajectory filtering to batch evaluation pipelines.

Evaluation Framework

Core Concepts

Trajectory Evaluation: Assess workflow execution results, quality metrics, and performance indicators across multiple trajectories.

Quality Measurement: Quantitative and qualitative assessment using configurable metrics, scoring systems, and evaluation criteria.

LLM-as-Judge: Advanced evaluation using language models as intelligent judges for complex assessment tasks.

Evaluation Pipeline Architecture

Trajectory Selection → Evaluation Criteria → Assessment → Result Analysis

Available Steps

Basic Evaluation (2 steps)

Fundamental evaluation and analysis tools:

  • select_trajectories - Advanced trajectory filtering and selection
  • visualize_correlation - Correlation analysis and visualization

LLM-as-Judge (1 step)

LLM-based evaluation framework:

  • simple_judge - Categorical or scale-based evaluation using any LLM

Common Patterns

Trajectory Selection

Filter and select trajectories for evaluation:

{
"activity": "select_trajectories",
"filter_by": {
"status": "completed",
"labels": {"reviewed": {"$eq": "true"}}
},
"limit": 500
}

LLM-as-Judge Configuration

Configure AI-powered evaluation:

{
"activity": "simple_judge",
"instruction": "Rate the quality of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "gpt-4o",
"temperature": 0.2
}

Categorical Evaluation

Binary or multi-category assessments:

{
"activity": "simple_judge",
"instruction": "Does this content meet quality standards?",
"judge_type": "categorical",
"categories": ["pass", "fail"],
"model": "gpt-4o"
}

Evaluation Approaches

Quantitative Evaluation

Numerical metrics and statistical analysis:

  • Performance metrics
  • Accuracy measurements
  • Statistical correlations
  • Trend analysis

Qualitative Evaluation

LLM-based assessment for complex criteria:

  • Content quality
  • Coherence and consistency
  • Relevance and completeness
  • Creative evaluation

Comparative Evaluation

Compare multiple trajectories or outputs:

  • A/B testing
  • Baseline comparisons
  • Performance benchmarking
  • Quality rankings

Assessment Patterns

Single-Point Evaluation

Evaluate individual outputs or trajectories:

{
"steps": [
{
"name": "evaluate_output",
"activity": "simple_judge",
"config": {
"item_path": "generator.outputs.text",
"instruction": "Rate the quality of this content",
"judge_type": "scale",
"scale_range": [1, 10]
}
}
]
}

Batch Evaluation

Evaluate multiple items efficiently:

{
"steps": [
{
"name": "batch_assessment",
"activity": "simple_judge",
"config": {
"items_path": "collector.outputs.items",
"instruction": "Evaluate each item for compliance",
"judge_type": "categorical",
"categories": ["pass", "fail"]
}
}
]
}

Quality Gate Pattern

{
"steps": [
{
"name": "quality_gate",
"activity": "simple_judge",
"config": {
"item_path": "generator.outputs.text",
"instruction": "Does this meet publication quality standards?",
"judge_type": "categorical",
"categories": ["approved", "needs_revision", "rejected"]
}
}
]
}

Quality Measurement Approaches

Metric-Based Assessment

  • Accuracy: Correctness of outputs
  • Completeness: Coverage of requirements
  • Consistency: Uniformity across evaluations
  • Performance: Speed and resource usage

Criteria-Based Scoring

  • Weighted Scoring: Multi-criteria evaluation with weights
  • Threshold-Based: Pass/fail based on minimum scores
  • Comparative Ranking: Relative quality assessment

Statistical Analysis

  • Correlation Analysis: Relationship between variables
  • Distribution Analysis: Pattern identification
  • Trend Detection: Performance over time

Best Practices

Evaluation Design

  • Define clear evaluation criteria upfront
  • Use appropriate evaluation types for your use case
  • Balance automated and manual evaluation
  • Implement consistent scoring systems

Performance Optimization

  • Use batch evaluation for multiple items
  • Choose appropriate models for task complexity
  • Lower temperature for consistent judgments
  • Monitor evaluation costs and latency

Quality Assurance

  • Validate evaluation criteria regularly
  • Document evaluation methodologies
  • Monitor result distributions
  • Test with known examples

Integration Points

With Data Processing

  • Evaluate processed data quality
  • Assess transformation accuracy
  • Validate pipeline outputs

With AI Models

  • Evaluate model outputs
  • Compare model performance
  • Assess generation quality

With Control Flow

  • Evaluate child workflow outputs
  • Aggregate scores across trajectories
  • Implement evaluation loops

Next Steps

Getting Help

  • Review individual step documentation for parameters
  • Check the Flow Library for evaluation examples
  • See Step Library for available workflow steps