Evaluation Overview
Evaluation steps provide comprehensive assessment capabilities for Jetty workflows, enabling quality measurement, trajectory analysis, and LLM-as-judge evaluation systems. These steps support everything from basic trajectory filtering to batch evaluation pipelines.
Evaluation Framework
Core Concepts
Trajectory Evaluation: Assess workflow execution results, quality metrics, and performance indicators across multiple trajectories.
Quality Measurement: Quantitative and qualitative assessment using configurable metrics, scoring systems, and evaluation criteria.
LLM-as-Judge: Advanced evaluation using language models as intelligent judges for complex assessment tasks.
Evaluation Pipeline Architecture
Trajectory Selection → Evaluation Criteria → Assessment → Result Analysis
Available Steps
Basic Evaluation (2 steps)
Fundamental evaluation and analysis tools:
select_trajectories- Advanced trajectory filtering and selectionvisualize_correlation- Correlation analysis and visualization
LLM-as-Judge (1 step)
LLM-based evaluation framework:
simple_judge- Categorical or scale-based evaluation using any LLM
Common Patterns
Trajectory Selection
Filter and select trajectories for evaluation:
{
"activity": "select_trajectories",
"filter_by": {
"status": "completed",
"labels": {"reviewed": {"$eq": "true"}}
},
"limit": 500
}
LLM-as-Judge Configuration
Configure AI-powered evaluation:
{
"activity": "simple_judge",
"instruction": "Rate the quality of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"model": "gpt-4o",
"temperature": 0.2
}
Categorical Evaluation
Binary or multi-category assessments:
{
"activity": "simple_judge",
"instruction": "Does this content meet quality standards?",
"judge_type": "categorical",
"categories": ["pass", "fail"],
"model": "gpt-4o"
}
Evaluation Approaches
Quantitative Evaluation
Numerical metrics and statistical analysis:
- Performance metrics
- Accuracy measurements
- Statistical correlations
- Trend analysis
Qualitative Evaluation
LLM-based assessment for complex criteria:
- Content quality
- Coherence and consistency
- Relevance and completeness
- Creative evaluation
Comparative Evaluation
Compare multiple trajectories or outputs:
- A/B testing
- Baseline comparisons
- Performance benchmarking
- Quality rankings
Assessment Patterns
Single-Point Evaluation
Evaluate individual outputs or trajectories:
{
"steps": [
{
"name": "evaluate_output",
"activity": "simple_judge",
"config": {
"item_path": "generator.outputs.text",
"instruction": "Rate the quality of this content",
"judge_type": "scale",
"scale_range": [1, 10]
}
}
]
}
Batch Evaluation
Evaluate multiple items efficiently:
{
"steps": [
{
"name": "batch_assessment",
"activity": "simple_judge",
"config": {
"items_path": "collector.outputs.items",
"instruction": "Evaluate each item for compliance",
"judge_type": "categorical",
"categories": ["pass", "fail"]
}
}
]
}
Quality Gate Pattern
{
"steps": [
{
"name": "quality_gate",
"activity": "simple_judge",
"config": {
"item_path": "generator.outputs.text",
"instruction": "Does this meet publication quality standards?",
"judge_type": "categorical",
"categories": ["approved", "needs_revision", "rejected"]
}
}
]
}
Quality Measurement Approaches
Metric-Based Assessment
- Accuracy: Correctness of outputs
- Completeness: Coverage of requirements
- Consistency: Uniformity across evaluations
- Performance: Speed and resource usage
Criteria-Based Scoring
- Weighted Scoring: Multi-criteria evaluation with weights
- Threshold-Based: Pass/fail based on minimum scores
- Comparative Ranking: Relative quality assessment
Statistical Analysis
- Correlation Analysis: Relationship between variables
- Distribution Analysis: Pattern identification
- Trend Detection: Performance over time
Best Practices
Evaluation Design
- Define clear evaluation criteria upfront
- Use appropriate evaluation types for your use case
- Balance automated and manual evaluation
- Implement consistent scoring systems
Performance Optimization
- Use batch evaluation for multiple items
- Choose appropriate models for task complexity
- Lower temperature for consistent judgments
- Monitor evaluation costs and latency
Quality Assurance
- Validate evaluation criteria regularly
- Document evaluation methodologies
- Monitor result distributions
- Test with known examples
Integration Points
With Data Processing
- Evaluate processed data quality
- Assess transformation accuracy
- Validate pipeline outputs
With AI Models
- Evaluate model outputs
- Compare model performance
- Assess generation quality
With Control Flow
- Evaluate child workflow outputs
- Aggregate scores across trajectories
- Implement evaluation loops
Next Steps
- Simple Judge - LLM-as-judge evaluation
- Basic Evaluation Steps - Trajectory selection and correlation
- Evaluation Patterns - Best practices and workflow examples
Getting Help
- Review individual step documentation for parameters
- Check the Flow Library for evaluation examples
- See Step Library for available workflow steps