Evaluation Patterns
This guide provides comprehensive patterns and best practices for building evaluation workflows in Jetty. Learn how to combine basic evaluation steps with the Simple Judge to create sophisticated assessment pipelines.
Core Evaluation Patterns
Quality Gate Pattern
Implement pass/fail checkpoints in your workflows to ensure quality standards.
{
"name": "content_quality_gate",
"steps": [
{
"name": "generate_content",
"activity": "litellm_chat",
"config": {
"prompt": "Write an article about {{topic}}"
}
},
{
"name": "quality_check",
"activity": "simple_judge",
"config": {
"item_path": "generate_content.outputs.text",
"instruction": "Does this article meet these criteria:\n1. Minimum 500 words\n2. Factually accurate\n3. Well-structured with intro, body, conclusion\n4. No grammatical errors",
"judge_type": "categorical",
"categories": ["pass", "fail"]
}
},
{
"name": "proceed_if_passed",
"condition": "quality_check.outputs.results[0].judgment == 'pass'",
"activity": "publish_content"
},
{
"name": "retry_if_failed",
"condition": "quality_check.outputs.results[0].judgment == 'fail'",
"activity": "regenerate_with_feedback",
"config": {
"feedback": "quality_check.outputs.results[0].explanation"
}
}
]
}
Multi-Stage Evaluation Pipeline
Create comprehensive evaluations with multiple assessment stages using sequential judges.
{
"name": "comprehensive_evaluation",
"steps": [
{
"name": "technical_review",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Evaluate technical accuracy and correctness",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "readability_assessment",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Assess readability for target audience (technical professionals)",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "completeness_check",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Check if all required sections are present and complete",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "final_decision",
"activity": "conditional_branch",
"config": {
"value_path": "technical_review.outputs.average_score",
"condition_type": "greater_equal",
"condition_value": 7.0,
"true_output": {"decision": "approved"},
"false_output": {"decision": "needs_revision"}
}
}
]
}
A/B Testing Pattern
Compare different versions or approaches systematically.
{
"name": "ab_test_evaluation",
"steps": [
{
"name": "generate_version_a",
"activity": "litellm_chat",
"config": {
"model": "gpt-4o",
"prompt": "Write a product description..."
}
},
{
"name": "generate_version_b",
"activity": "litellm_chat",
"config": {
"model": "claude-3-5-sonnet-20241022",
"prompt": "Write a product description..."
}
},
{
"name": "evaluate_version_a",
"activity": "simple_judge",
"config": {
"item_path": "generate_version_a.outputs.text",
"instruction": "Rate the quality and effectiveness of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"with_explanation": true
}
},
{
"name": "evaluate_version_b",
"activity": "simple_judge",
"config": {
"item_path": "generate_version_b.outputs.text",
"instruction": "Rate the quality and effectiveness of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"with_explanation": true
}
}
]
}
Progressive Improvement Pattern
Iteratively improve outputs based on evaluation feedback.
{
"name": "progressive_improvement",
"steps": [
{
"name": "initial_generation",
"activity": "generate_content"
},
{
"name": "improvement_loop",
"type": "loop",
"max_iterations": 3,
"steps": [
{
"name": "evaluate_current",
"activity": "verdict_judge",
"config": {
"input_data": "current_version.outputs.text",
"evaluation_prompt": "Evaluate and provide specific improvement suggestions",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true,
"require_suggestions": true
}
},
{
"name": "check_quality",
"condition": "evaluate_current.outputs.score >= 8",
"activity": "exit_loop"
},
{
"name": "apply_improvements",
"activity": "improve_content",
"config": {
"original": "current_version.outputs.text",
"suggestions": "evaluate_current.outputs.suggestions"
}
}
]
}
]
}
Consensus Evaluation Pattern
Use multiple judges to achieve reliable assessments.
{
"name": "consensus_evaluation",
"steps": [
{
"name": "multi_judge_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "multi-judge",
"input_data": "submission.outputs.content",
"stages": [
{
"name": "technical_expert",
"model": "gpt-4",
"evaluation_prompt": "As a technical expert, evaluate accuracy and depth",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
},
{
"name": "domain_expert",
"model": "gpt-4",
"evaluation_prompt": "As a domain expert, evaluate relevance and applicability",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
},
{
"name": "communication_expert",
"model": "gpt-4",
"evaluation_prompt": "As a communication expert, evaluate clarity and structure",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
}
],
"aggregation_method": "majority_vote",
"require_consensus_threshold": 2
}
}
]
}
Advanced Evaluation Workflows
Hierarchical Quality Assessment
Implement tiered evaluation based on initial screening.
{
"name": "hierarchical_assessment",
"steps": [
{
"name": "quick_screen",
"activity": "verdict_judge",
"config": {
"input_data": "submission.outputs.content",
"evaluation_prompt": "Quick quality check: Does this meet minimum standards?",
"evaluation_type": "binary",
"model": "gpt-3.5-turbo",
"temperature": 0
}
},
{
"name": "detailed_evaluation",
"condition": "quick_screen.outputs.result == 'pass'",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "custom",
"stages": [
{
"name": "deep_analysis",
"model": "gpt-4",
"evaluation_prompt": "Perform detailed analysis across multiple dimensions",
"evaluation_type": "scale",
"scale_range": [1, 100]
},
{
"name": "expert_review",
"condition": "deep_analysis.score >= 80",
"evaluation_prompt": "Provide expert-level assessment and recommendations"
}
]
}
}
]
}
Comparative Analysis Pattern
Compare multiple options against criteria.
{
"name": "comparative_analysis",
"steps": [
{
"name": "collect_options",
"activity": "gather_alternatives",
"config": {
"count": 5
}
},
{
"name": "evaluate_all_options",
"activity": "verdict_batch",
"config": {
"items_path": "collect_options.outputs.alternatives",
"evaluation_config": {
"evaluation_prompt": "Evaluate this option against criteria:\n1. Cost-effectiveness\n2. Implementation complexity\n3. Expected impact\n4. Risk level",
"evaluation_type": "multi_criteria",
"criteria": {
"cost_effectiveness": {"weight": 0.3, "scale": [1, 10]},
"complexity": {"weight": 0.2, "scale": [1, 10], "inverse": true},
"impact": {"weight": 0.3, "scale": [1, 10]},
"risk": {"weight": 0.2, "scale": [1, 10], "inverse": true}
}
}
}
},
{
"name": "rank_options",
"activity": "calculate_rankings",
"config": {
"scores": "evaluate_all_options.outputs.scores",
"method": "weighted_sum"
}
}
]
}
Time Series Evaluation
Track quality metrics over time.
{
"name": "time_series_quality_tracking",
"steps": [
{
"name": "select_historical_outputs",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_id": "content_generation",
"date_range": {
"start": "{{30_days_ago}}",
"end": "{{today}}"
}
},
"sort_by": "created_at"
}
},
{
"name": "evaluate_historical_batch",
"activity": "verdict_batch",
"config": {
"items_path": "select_historical_outputs.outputs.trajectories",
"evaluation_config": {
"evaluation_prompt": "Rate content quality",
"evaluation_type": "scale",
"scale_range": [1, 10]
},
"batch_size": 50
}
},
{
"name": "visualize_trends",
"activity": "visualize_correlation",
"config": {
"trajectories": "evaluate_historical_batch.outputs.results",
"x_metric": "created_at",
"y_metric": "quality_score",
"plot_type": "line",
"plot_config": {
"title": "Quality Trend Over Time",
"show_moving_average": true,
"window_size": 7
}
}
}
]
}
Evaluation Criteria Design
Multi-Dimensional Rubrics
Create comprehensive evaluation rubrics for complex assessments.
{
"evaluation_rubric": {
"technical_accuracy": {
"weight": 0.3,
"levels": {
"excellent": "All technical details are correct, with citations",
"good": "Minor technical inaccuracies that don't affect understanding",
"fair": "Some technical errors that may confuse readers",
"poor": "Significant technical errors throughout"
}
},
"completeness": {
"weight": 0.25,
"levels": {
"excellent": "All required topics covered in depth",
"good": "Most topics covered adequately",
"fair": "Key topics present but lacking detail",
"poor": "Missing critical information"
}
},
"clarity": {
"weight": 0.25,
"levels": {
"excellent": "Crystal clear, well-organized, easy to follow",
"good": "Generally clear with minor confusion points",
"fair": "Somewhat unclear, requires effort to understand",
"poor": "Confusing, poorly organized, hard to follow"
}
},
"engagement": {
"weight": 0.2,
"levels": {
"excellent": "Highly engaging, maintains reader interest",
"good": "Reasonably engaging with good examples",
"fair": "Somewhat dry but informative",
"poor": "Boring, difficult to stay focused"
}
}
}
}
Contextual Evaluation
Adapt evaluation criteria based on context.
{
"contextual_evaluation": {
"audience_based": {
"technical_audience": {
"prioritize": ["accuracy", "depth", "precision"],
"de_emphasize": ["simplicity", "broad_appeal"]
},
"general_audience": {
"prioritize": ["clarity", "relevance", "engagement"],
"de_emphasize": ["technical_depth", "jargon"]
}
},
"purpose_based": {
"educational": {
"criteria": ["clarity", "completeness", "examples", "progression"]
},
"reference": {
"criteria": ["accuracy", "comprehensiveness", "organization", "searchability"]
},
"persuasive": {
"criteria": ["argument_strength", "evidence", "emotional_appeal", "call_to_action"]
}
}
}
}
Performance Optimization Patterns
Efficient Batch Processing
Optimize evaluation of large datasets.
{
"name": "optimized_batch_evaluation",
"steps": [
{
"name": "prepare_batches",
"activity": "segment_data",
"config": {
"total_items": 10000,
"strategy": "smart_batching",
"criteria": {
"similar_length": true,
"similar_complexity": true,
"max_tokens_per_batch": 4000
}
}
},
{
"name": "parallel_evaluation",
"activity": "verdict_batch",
"config": {
"items_path": "prepare_batches.outputs.batches",
"batch_size": "dynamic",
"max_concurrent": 20,
"evaluation_config": {
"model": "gpt-3.5-turbo",
"temperature": 0,
"cache_results": true
}
}
},
{
"name": "aggregate_results",
"activity": "combine_batch_results",
"config": {
"maintain_order": true,
"calculate_statistics": true
}
}
]
}
Cached Evaluation Pipeline
Implement caching for repeated evaluations.
{
"name": "cached_evaluation",
"steps": [
{
"name": "check_cache",
"activity": "lookup_evaluation_cache",
"config": {
"content_hash": "{{content.hash}}",
"evaluation_version": "v2.1"
}
},
{
"name": "evaluate_if_needed",
"condition": "check_cache.outputs.cache_miss",
"activity": "verdict_judge",
"config": {
"input_data": "content.outputs.text",
"cache_result": true,
"cache_ttl": 86400
}
},
{
"name": "use_result",
"activity": "process_evaluation",
"config": {
"evaluation": "{{check_cache.outputs.cached_result || evaluate_if_needed.outputs.result}}"
}
}
]
}
Error Handling Patterns
Robust Evaluation with Fallbacks
Handle evaluation failures gracefully.
{
"name": "robust_evaluation",
"steps": [
{
"name": "primary_evaluation",
"activity": "verdict_judge",
"config": {
"model": "gpt-4",
"timeout": 30,
"retry_count": 2
},
"error_handler": {
"on_timeout": "fallback_evaluation",
"on_rate_limit": "queue_for_retry",
"on_parse_error": "simplified_evaluation"
}
},
{
"name": "fallback_evaluation",
"condition": "primary_evaluation.error",
"activity": "verdict_judge",
"config": {
"model": "gpt-3.5-turbo",
"simplified_prompt": true
}
},
{
"name": "simplified_evaluation",
"condition": "fallback_evaluation.error",
"activity": "rule_based_evaluation",
"config": {
"rules": ["length_check", "keyword_presence", "format_validation"]
}
}
]
}
Integration Patterns
End-to-End Workflow Integration
Combine data processing, generation, and evaluation.
{
"name": "integrated_content_pipeline",
"steps": [
{
"name": "collect_data",
"activity": "read_text_file",
"config": {
"file_path": "sources/research_data.json"
}
},
{
"name": "generate_content",
"activity": "generate_article",
"config": {
"data": "collect_data.outputs.content",
"style": "technical"
}
},
{
"name": "initial_evaluation",
"activity": "verdict_judge",
"config": {
"input_data": "generate_content.outputs.article",
"evaluation_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "enhance_if_needed",
"condition": "initial_evaluation.outputs.score < 8",
"activity": "enhance_content",
"config": {
"feedback": "initial_evaluation.outputs.rationale"
}
},
{
"name": "final_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "judge-verify",
"input_data": "{{enhance_if_needed.outputs.enhanced || generate_content.outputs.article}}"
}
},
{
"name": "save_approved",
"condition": "final_evaluation.outputs.final_score >= 8",
"activity": "save_text_file",
"config": {
"content": "final_evaluation.outputs.evaluated_content",
"file_path": "approved/{{workflow.run_id}}.md"
}
}
]
}
Comprehensive Data Processing + Evaluation Pipelines
Document Processing and Quality Assessment Pipeline
Complete workflow for processing multiple documents with comprehensive evaluation.
{
"name": "document_processing_quality_pipeline",
"description": "Process multiple documents, concatenate content, and evaluate quality with comprehensive assessment",
"steps": [
{
"name": "load_source_documents",
"activity": "read_text_file",
"config": {
"file_path": "sources/document_list.json"
}
},
{
"name": "process_document_batch",
"activity": "text_concatenate",
"config": {
"input_files": "load_source_documents.outputs.file_list",
"separator": "\n\n---\n\n",
"output_format": "markdown",
"include_source_headers": true
}
},
{
"name": "add_processing_metadata",
"activity": "add_image_metadata",
"config": {
"file_path": "process_document_batch.outputs.file_path",
"metadata": {
"processed_date": "{{timestamp}}",
"source_count": "{{load_source_documents.outputs.file_count}}",
"processing_pipeline": "document_processing_quality_pipeline"
}
}
},
{
"name": "comprehensive_quality_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "custom",
"input_data": "process_document_batch.outputs.text",
"stages": [
{
"name": "content_completeness",
"evaluation_prompt": "Evaluate the completeness of this combined document. Check if:\n1. All sections are present and coherent\n2. No critical information is missing\n3. Document flow is logical\n4. Sources are properly integrated",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.3
},
{
"name": "technical_accuracy",
"evaluation_prompt": "Assess the technical accuracy of the content:\n1. Factual correctness\n2. Technical terminology usage\n3. Data consistency across sources\n4. Citation accuracy",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.4
},
{
"name": "readability_assessment",
"evaluation_prompt": "Evaluate document readability and structure:\n1. Clear section organization\n2. Consistent formatting\n3. Appropriate language level\n4. Logical information flow",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.3
}
],
"aggregation_method": "weighted_average"
}
},
{
"name": "quality_gate_decision",
"activity": "verdict_judge",
"config": {
"input_data": "process_document_batch.outputs.text",
"evaluation_prompt": "Based on the quality scores ({{comprehensive_quality_evaluation.outputs.final_score}}), determine if this document meets publication standards. Minimum score: 7.5",
"evaluation_type": "categorical",
"categories": ["publish", "revise", "reject"],
"category_descriptions": {
"publish": "Document meets all quality standards",
"revise": "Document needs improvements before publication",
"reject": "Document does not meet minimum standards"
}
}
},
{
"name": "save_approved_document",
"condition": "quality_gate_decision.outputs.result == 'publish'",
"activity": "save_text_file",
"config": {
"content": "process_document_batch.outputs.text",
"file_path": "published/documents/{{workflow.run_id}}_approved.md",
"metadata": {
"quality_score": "comprehensive_quality_evaluation.outputs.final_score",
"decision": "quality_gate_decision.outputs.result",
"evaluation_date": "{{timestamp}}"
}
}
},
{
"name": "flag_for_revision",
"condition": "quality_gate_decision.outputs.result == 'revise'",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.example.com/revision-queue",
"payload": {
"document_id": "{{workflow.run_id}}",
"status": "needs_revision",
"quality_score": "comprehensive_quality_evaluation.outputs.final_score",
"feedback": "comprehensive_quality_evaluation.outputs.stage_results",
"file_path": "process_document_batch.outputs.file_path"
}
}
}
]
}
Data Analysis and Evaluation Workflow
Comprehensive pipeline for shipping data analysis with multi-criteria evaluation.
{
"name": "shipping_data_analysis_evaluation",
"description": "Analyze shipping rate data, perform aggregations, and evaluate results with business criteria",
"steps": [
{
"name": "load_shipping_data",
"activity": "read_text_file",
"config": {
"file_path": "data/shipping_rates_2024.csv"
}
},
{
"name": "analyze_port_pairs",
"activity": "port_pairs_rate_aggregator",
"config": {
"data_source": "load_shipping_data.outputs.content",
"analysis_type": "comprehensive",
"aggregation_methods": ["mean", "median", "percentile_95"],
"group_by": ["origin_port", "destination_port", "service_type"],
"time_period": "2024"
}
},
{
"name": "calculate_rate_priorities",
"activity": "rate_option_prioritizer",
"config": {
"rate_data": "analyze_port_pairs.outputs.aggregated_rates",
"criteria": {
"cost_effectiveness": 0.4,
"service_reliability": 0.3,
"transit_time": 0.2,
"capacity_availability": 0.1
},
"optimization_target": "balanced"
}
},
{
"name": "generate_analysis_report",
"activity": "text_concatenate",
"config": {
"sections": [
{
"title": "Executive Summary",
"content": "analyze_port_pairs.outputs.summary"
},
{
"title": "Rate Analysis Results",
"content": "analyze_port_pairs.outputs.detailed_analysis"
},
{
"title": "Prioritized Recommendations",
"content": "calculate_rate_priorities.outputs.recommendations"
}
],
"output_format": "markdown"
}
},
{
"name": "evaluate_analysis_quality",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "multi-judge",
"input_data": "generate_analysis_report.outputs.text",
"stages": [
{
"name": "data_analyst_review",
"evaluation_prompt": "As a shipping data analyst, evaluate this analysis for:\n1. Data completeness and accuracy\n2. Statistical methodology correctness\n3. Industry relevance of insights\n4. Actionability of recommendations",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
},
{
"name": "business_stakeholder_review",
"evaluation_prompt": "From a business perspective, assess this analysis for:\n1. Strategic value and insights\n2. Cost optimization opportunities\n3. Risk assessment coverage\n4. Implementation feasibility",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
},
{
"name": "technical_reviewer",
"evaluation_prompt": "Evaluate the technical aspects:\n1. Data processing methodology\n2. Statistical rigor\n3. Visualization quality\n4. Documentation completeness",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
}
],
"aggregation_method": "majority_vote",
"require_consensus_threshold": 2
}
},
{
"name": "business_impact_assessment",
"activity": "verdict_judge",
"config": {
"input_data": "calculate_rate_priorities.outputs.recommendations",
"evaluation_prompt": "Evaluate the potential business impact of these recommendations:\n1. Estimated cost savings potential\n2. Implementation complexity\n3. Risk level and mitigation strategies\n4. Timeline for realizing benefits",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true
}
},
{
"name": "select_historical_comparisons",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_type": "shipping_analysis",
"date_range": {
"start": "2023-01-01",
"end": "2023-12-31"
},
"status": "completed"
},
"sort_by": "quality_score",
"limit": 5
}
},
{
"name": "benchmark_against_historical",
"activity": "visualize_correlation",
"config": {
"current_analysis": "evaluate_analysis_quality.outputs.consensus_score",
"historical_data": "select_historical_comparisons.outputs.trajectories",
"metrics": ["quality_score", "business_impact", "implementation_success"],
"visualization_type": "comparative_analysis",
"output_format": "interactive_chart"
}
},
{
"name": "final_approval_decision",
"activity": "verdict_judge",
"config": {
"input_data": "generate_analysis_report.outputs.text",
"evaluation_prompt": "Make final approval decision based on:\n- Analysis quality: {{evaluate_analysis_quality.outputs.consensus_result}}\n- Business impact: {{business_impact_assessment.outputs.score}}\n- Historical performance: {{benchmark_against_historical.outputs.correlation_score}}\n\nMinimum thresholds: Quality >= 'good', Impact >= 7, Historical correlation >= 0.7",
"evaluation_type": "binary",
"positive_label": "approved",
"negative_label": "requires_revision"
}
},
{
"name": "publish_approved_analysis",
"condition": "final_approval_decision.outputs.result == 'approved'",
"activity": "save_text_file",
"config": {
"content": "generate_analysis_report.outputs.text",
"file_path": "reports/shipping_analysis/{{date}}_{{workflow.run_id}}.md",
"metadata": {
"analysis_quality": "evaluate_analysis_quality.outputs.consensus_result",
"business_impact_score": "business_impact_assessment.outputs.score",
"approval_status": "approved",
"publication_date": "{{timestamp}}"
}
}
},
{
"name": "notify_stakeholders",
"condition": "final_approval_decision.outputs.result == 'approved'",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.company.com/notifications/shipping-analysis",
"payload": {
"analysis_id": "{{workflow.run_id}}",
"status": "published",
"quality_metrics": {
"analysis_quality": "evaluate_analysis_quality.outputs.consensus_result",
"business_impact": "business_impact_assessment.outputs.score",
"historical_benchmark": "benchmark_against_historical.outputs.correlation_score"
},
"report_url": "reports/shipping_analysis/{{date}}_{{workflow.run_id}}.md",
"recommendations_summary": "calculate_rate_priorities.outputs.top_recommendations"
}
}
}
]
}
Multi-Modal Content Processing Pipeline
Advanced workflow combining image processing, data analysis, and comprehensive evaluation.
{
"name": "multimodal_content_quality_pipeline",
"description": "Process images, analyze content, and perform comprehensive quality evaluation",
"steps": [
{
"name": "download_content_images",
"activity": "download_image",
"config": {
"image_urls": [
"https://example.com/content/chart1.png",
"https://example.com/content/infographic.jpg",
"https://example.com/content/diagram.svg"
],
"output_directory": "images/{{workflow.run_id}}/",
"quality": "high",
"format": "original"
}
},
{
"name": "extract_image_metadata",
"activity": "add_image_metadata",
"config": {
"image_paths": "download_content_images.outputs.downloaded_files",
"extract_metadata": {
"technical": ["dimensions", "format", "color_profile", "compression"],
"content": ["text_content", "accessibility_description", "visual_elements"],
"quality": ["resolution", "clarity_score", "visual_appeal"]
}
}
},
{
"name": "load_accompanying_text",
"activity": "read_text_file",
"config": {
"file_path": "content/article_{{workflow.run_id}}.md"
}
},
{
"name": "combine_multimodal_content",
"activity": "text_concatenate",
"config": {
"sections": [
{
"title": "Article Content",
"content": "load_accompanying_text.outputs.content"
},
{
"title": "Visual Elements",
"content": "extract_image_metadata.outputs.content_descriptions"
},
{
"title": "Technical Specifications",
"content": "extract_image_metadata.outputs.technical_metadata"
}
],
"output_format": "structured_markdown"
}
},
{
"name": "comprehensive_content_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "hierarchical",
"input_data": "combine_multimodal_content.outputs.text",
"stages": [
{
"name": "technical_quality_check",
"evaluation_prompt": "Evaluate technical quality of images and content:\n1. Image resolution and clarity\n2. Text readability and formatting\n3. Technical accuracy of visual elements\n4. Accessibility compliance",
"evaluation_type": "binary",
"positive_label": "meets_technical_standards",
"negative_label": "technical_issues_found"
},
{
"name": "content_quality_assessment",
"condition": "technical_quality_check.result == 'meets_technical_standards'",
"evaluation_prompt": "Assess content quality across modalities:\n1. Text-image coherence and complementarity\n2. Information accuracy and completeness\n3. Visual appeal and engagement\n4. Educational or informational value",
"evaluation_type": "scale",
"scale_range": [1, 10]
},
{
"name": "audience_suitability_review",
"condition": "content_quality_assessment.score >= 7",
"evaluation_prompt": "Evaluate suitability for target audience:\n1. Appropriate complexity level\n2. Cultural sensitivity and inclusivity\n3. Engagement potential\n4. Learning objective alignment",
"evaluation_type": "categorical",
"categories": ["highly_suitable", "suitable", "needs_adaptation", "unsuitable"]
}
]
}
},
{
"name": "accessibility_compliance_check",
"activity": "verdict_judge",
"config": {
"input_data": "extract_image_metadata.outputs.accessibility_analysis",
"evaluation_prompt": "Evaluate accessibility compliance:\n1. Alternative text quality and descriptiveness\n2. Color contrast ratios for text/background\n3. Image complexity and description adequacy\n4. WCAG 2.1 AA compliance level",
"evaluation_type": "categorical",
"categories": ["fully_compliant", "mostly_compliant", "partially_compliant", "non_compliant"],
"include_rationale": true
}
},
{
"name": "batch_evaluate_historical_content",
"activity": "verdict_batch",
"config": {
"items_path": "select_similar_content.outputs.historical_items",
"evaluation_config": {
"evaluation_prompt": "Compare this historical content to current quality standards:\n1. Technical quality improvements\n2. Content depth and accuracy\n3. Visual design evolution\n4. Accessibility improvements",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true
},
"batch_size": 5,
"max_concurrent": 3
}
},
{
"name": "quality_trend_analysis",
"activity": "visualize_correlation",
"config": {
"current_score": "comprehensive_content_evaluation.outputs.final_score",
"historical_scores": "batch_evaluate_historical_content.outputs.scores",
"metrics": ["technical_quality", "content_quality", "accessibility_score"],
"analysis_type": "trend_analysis",
"time_dimension": "content_creation_date"
}
},
{
"name": "publication_decision",
"activity": "verdict_judge",
"config": {
"input_data": "combine_multimodal_content.outputs.text",
"evaluation_prompt": "Make publication decision based on comprehensive evaluation:\n- Content Quality: {{comprehensive_content_evaluation.outputs.final_score}}\n- Accessibility: {{accessibility_compliance_check.outputs.result}}\n- Historical Benchmark: {{quality_trend_analysis.outputs.trend_score}}\n\nPublication criteria: Content >= 8.0, Accessibility >= 'mostly_compliant', Trend >= baseline",
"evaluation_type": "categorical",
"categories": ["publish_immediately", "publish_with_minor_edits", "major_revision_required", "reject"],
"include_rationale": true
}
},
{
"name": "save_approved_content",
"condition": "publication_decision.outputs.result == 'publish_immediately'",
"activity": "save_text_file",
"config": {
"content": "combine_multimodal_content.outputs.text",
"file_path": "published/multimodal/{{date}}/{{workflow.run_id}}.md",
"metadata": {
"content_quality_score": "comprehensive_content_evaluation.outputs.final_score",
"accessibility_rating": "accessibility_compliance_check.outputs.result",
"publication_decision": "publication_decision.outputs.result",
"image_count": "download_content_images.outputs.file_count",
"quality_trend": "quality_trend_analysis.outputs.trend_direction"
}
}
},
{
"name": "notify_content_team",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.cms.company.com/content-status",
"payload": {
"content_id": "{{workflow.run_id}}",
"decision": "publication_decision.outputs.result",
"quality_metrics": {
"overall_score": "comprehensive_content_evaluation.outputs.final_score",
"accessibility_compliance": "accessibility_compliance_check.outputs.result",
"technical_quality": "comprehensive_content_evaluation.outputs.stage_scores.technical_quality_check",
"trend_performance": "quality_trend_analysis.outputs.trend_score"
},
"next_actions": "publication_decision.outputs.rationale",
"published_location": "{{save_approved_content.outputs.file_path || 'pending_revision'}}"
}
}
}
]
}
Monitoring and Analytics
Evaluation Metrics Dashboard
Track evaluation performance over time.
{
"name": "evaluation_analytics",
"steps": [
{
"name": "collect_metrics",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_id": "evaluation_workflow",
"date_range": {"days": 7}
}
}
},
{
"name": "calculate_statistics",
"activity": "aggregate_metrics",
"config": {
"metrics": [
"average_score",
"score_distribution",
"pass_rate",
"evaluation_time",
"cost_per_evaluation"
],
"group_by": ["model", "evaluation_type"]
}
},
{
"name": "generate_report",
"activity": "create_dashboard",
"config": {
"visualizations": [
"score_trend_line",
"pass_rate_bar",
"cost_breakdown_pie",
"model_performance_heatmap"
]
}
}
]
}
Best Practices Summary
Design Principles
- Clear Criteria: Define specific, measurable evaluation criteria
- Appropriate Granularity: Match evaluation complexity to use case
- Consistent Prompts: Use structured, reproducible evaluation prompts
- Balanced Scoring: Weight criteria appropriately for context
Implementation Guidelines
- Start Simple: Begin with basic evaluations, add complexity as needed
- Test Thoroughly: Validate evaluation prompts with known examples
- Monitor Consistency: Track inter-rater reliability over time
- Optimize Costs: Use appropriate models and caching strategies
Performance Tips
- Batch Similar Items: Group similar evaluations for efficiency
- Cache Aggressively: Store results for repeated evaluations
- Use Appropriate Models: Match model complexity to evaluation needs
- Implement Fallbacks: Handle failures gracefully
Quality Assurance
- Validate Results: Periodically verify evaluation accuracy
- Track Metrics: Monitor evaluation performance and costs
- Iterate Prompts: Refine evaluation criteria based on results
- Document Decisions: Record evaluation design rationale
Related Documentation
Data Processing Integration
- Data Processing Overview - Data pipeline patterns and concepts
- Tools Steps - File operations and utility steps
Evaluation Framework
- Simple Judge - LLM-as-judge evaluation
- Basic Evaluation Steps - Trajectory selection and correlation analysis
- Evaluation Overview - Framework concepts and assessment approaches
Control Flow
- Control Flow Steps - Parallel processing and orchestration
- Iteration Steps - Loops and conditionals
External Resources
- Workflow Examples - Complete evaluation workflows
- Step Library - Available workflow steps