Skip to main content

Evaluation Patterns

This guide provides comprehensive patterns and best practices for building evaluation workflows in Jetty. Learn how to combine basic evaluation steps with the Simple Judge to create sophisticated assessment pipelines.

Core Evaluation Patterns

Quality Gate Pattern

Implement pass/fail checkpoints in your workflows to ensure quality standards.

{
"name": "content_quality_gate",
"steps": [
{
"name": "generate_content",
"activity": "litellm_chat",
"config": {
"prompt": "Write an article about {{topic}}"
}
},
{
"name": "quality_check",
"activity": "simple_judge",
"config": {
"item_path": "generate_content.outputs.text",
"instruction": "Does this article meet these criteria:\n1. Minimum 500 words\n2. Factually accurate\n3. Well-structured with intro, body, conclusion\n4. No grammatical errors",
"judge_type": "categorical",
"categories": ["pass", "fail"]
}
},
{
"name": "proceed_if_passed",
"condition": "quality_check.outputs.results[0].judgment == 'pass'",
"activity": "publish_content"
},
{
"name": "retry_if_failed",
"condition": "quality_check.outputs.results[0].judgment == 'fail'",
"activity": "regenerate_with_feedback",
"config": {
"feedback": "quality_check.outputs.results[0].explanation"
}
}
]
}

Multi-Stage Evaluation Pipeline

Create comprehensive evaluations with multiple assessment stages using sequential judges.

{
"name": "comprehensive_evaluation",
"steps": [
{
"name": "technical_review",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Evaluate technical accuracy and correctness",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "readability_assessment",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Assess readability for target audience (technical professionals)",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "completeness_check",
"activity": "simple_judge",
"config": {
"item_path": "document.outputs.text",
"instruction": "Check if all required sections are present and complete",
"judge_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "final_decision",
"activity": "conditional_branch",
"config": {
"value_path": "technical_review.outputs.average_score",
"condition_type": "greater_equal",
"condition_value": 7.0,
"true_output": {"decision": "approved"},
"false_output": {"decision": "needs_revision"}
}
}
]
}

A/B Testing Pattern

Compare different versions or approaches systematically.

{
"name": "ab_test_evaluation",
"steps": [
{
"name": "generate_version_a",
"activity": "litellm_chat",
"config": {
"model": "gpt-4o",
"prompt": "Write a product description..."
}
},
{
"name": "generate_version_b",
"activity": "litellm_chat",
"config": {
"model": "claude-3-5-sonnet-20241022",
"prompt": "Write a product description..."
}
},
{
"name": "evaluate_version_a",
"activity": "simple_judge",
"config": {
"item_path": "generate_version_a.outputs.text",
"instruction": "Rate the quality and effectiveness of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"with_explanation": true
}
},
{
"name": "evaluate_version_b",
"activity": "simple_judge",
"config": {
"item_path": "generate_version_b.outputs.text",
"instruction": "Rate the quality and effectiveness of this content",
"judge_type": "scale",
"scale_range": [1, 10],
"with_explanation": true
}
}
]
}

Progressive Improvement Pattern

Iteratively improve outputs based on evaluation feedback.

{
"name": "progressive_improvement",
"steps": [
{
"name": "initial_generation",
"activity": "generate_content"
},
{
"name": "improvement_loop",
"type": "loop",
"max_iterations": 3,
"steps": [
{
"name": "evaluate_current",
"activity": "verdict_judge",
"config": {
"input_data": "current_version.outputs.text",
"evaluation_prompt": "Evaluate and provide specific improvement suggestions",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true,
"require_suggestions": true
}
},
{
"name": "check_quality",
"condition": "evaluate_current.outputs.score >= 8",
"activity": "exit_loop"
},
{
"name": "apply_improvements",
"activity": "improve_content",
"config": {
"original": "current_version.outputs.text",
"suggestions": "evaluate_current.outputs.suggestions"
}
}
]
}
]
}

Consensus Evaluation Pattern

Use multiple judges to achieve reliable assessments.

{
"name": "consensus_evaluation",
"steps": [
{
"name": "multi_judge_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "multi-judge",
"input_data": "submission.outputs.content",
"stages": [
{
"name": "technical_expert",
"model": "gpt-4",
"evaluation_prompt": "As a technical expert, evaluate accuracy and depth",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
},
{
"name": "domain_expert",
"model": "gpt-4",
"evaluation_prompt": "As a domain expert, evaluate relevance and applicability",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
},
{
"name": "communication_expert",
"model": "gpt-4",
"evaluation_prompt": "As a communication expert, evaluate clarity and structure",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "fair", "poor"]
}
],
"aggregation_method": "majority_vote",
"require_consensus_threshold": 2
}
}
]
}

Advanced Evaluation Workflows

Hierarchical Quality Assessment

Implement tiered evaluation based on initial screening.

{
"name": "hierarchical_assessment",
"steps": [
{
"name": "quick_screen",
"activity": "verdict_judge",
"config": {
"input_data": "submission.outputs.content",
"evaluation_prompt": "Quick quality check: Does this meet minimum standards?",
"evaluation_type": "binary",
"model": "gpt-3.5-turbo",
"temperature": 0
}
},
{
"name": "detailed_evaluation",
"condition": "quick_screen.outputs.result == 'pass'",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "custom",
"stages": [
{
"name": "deep_analysis",
"model": "gpt-4",
"evaluation_prompt": "Perform detailed analysis across multiple dimensions",
"evaluation_type": "scale",
"scale_range": [1, 100]
},
{
"name": "expert_review",
"condition": "deep_analysis.score >= 80",
"evaluation_prompt": "Provide expert-level assessment and recommendations"
}
]
}
}
]
}

Comparative Analysis Pattern

Compare multiple options against criteria.

{
"name": "comparative_analysis",
"steps": [
{
"name": "collect_options",
"activity": "gather_alternatives",
"config": {
"count": 5
}
},
{
"name": "evaluate_all_options",
"activity": "verdict_batch",
"config": {
"items_path": "collect_options.outputs.alternatives",
"evaluation_config": {
"evaluation_prompt": "Evaluate this option against criteria:\n1. Cost-effectiveness\n2. Implementation complexity\n3. Expected impact\n4. Risk level",
"evaluation_type": "multi_criteria",
"criteria": {
"cost_effectiveness": {"weight": 0.3, "scale": [1, 10]},
"complexity": {"weight": 0.2, "scale": [1, 10], "inverse": true},
"impact": {"weight": 0.3, "scale": [1, 10]},
"risk": {"weight": 0.2, "scale": [1, 10], "inverse": true}
}
}
}
},
{
"name": "rank_options",
"activity": "calculate_rankings",
"config": {
"scores": "evaluate_all_options.outputs.scores",
"method": "weighted_sum"
}
}
]
}

Time Series Evaluation

Track quality metrics over time.

{
"name": "time_series_quality_tracking",
"steps": [
{
"name": "select_historical_outputs",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_id": "content_generation",
"date_range": {
"start": "{{30_days_ago}}",
"end": "{{today}}"
}
},
"sort_by": "created_at"
}
},
{
"name": "evaluate_historical_batch",
"activity": "verdict_batch",
"config": {
"items_path": "select_historical_outputs.outputs.trajectories",
"evaluation_config": {
"evaluation_prompt": "Rate content quality",
"evaluation_type": "scale",
"scale_range": [1, 10]
},
"batch_size": 50
}
},
{
"name": "visualize_trends",
"activity": "visualize_correlation",
"config": {
"trajectories": "evaluate_historical_batch.outputs.results",
"x_metric": "created_at",
"y_metric": "quality_score",
"plot_type": "line",
"plot_config": {
"title": "Quality Trend Over Time",
"show_moving_average": true,
"window_size": 7
}
}
}
]
}

Evaluation Criteria Design

Multi-Dimensional Rubrics

Create comprehensive evaluation rubrics for complex assessments.

{
"evaluation_rubric": {
"technical_accuracy": {
"weight": 0.3,
"levels": {
"excellent": "All technical details are correct, with citations",
"good": "Minor technical inaccuracies that don't affect understanding",
"fair": "Some technical errors that may confuse readers",
"poor": "Significant technical errors throughout"
}
},
"completeness": {
"weight": 0.25,
"levels": {
"excellent": "All required topics covered in depth",
"good": "Most topics covered adequately",
"fair": "Key topics present but lacking detail",
"poor": "Missing critical information"
}
},
"clarity": {
"weight": 0.25,
"levels": {
"excellent": "Crystal clear, well-organized, easy to follow",
"good": "Generally clear with minor confusion points",
"fair": "Somewhat unclear, requires effort to understand",
"poor": "Confusing, poorly organized, hard to follow"
}
},
"engagement": {
"weight": 0.2,
"levels": {
"excellent": "Highly engaging, maintains reader interest",
"good": "Reasonably engaging with good examples",
"fair": "Somewhat dry but informative",
"poor": "Boring, difficult to stay focused"
}
}
}
}

Contextual Evaluation

Adapt evaluation criteria based on context.

{
"contextual_evaluation": {
"audience_based": {
"technical_audience": {
"prioritize": ["accuracy", "depth", "precision"],
"de_emphasize": ["simplicity", "broad_appeal"]
},
"general_audience": {
"prioritize": ["clarity", "relevance", "engagement"],
"de_emphasize": ["technical_depth", "jargon"]
}
},
"purpose_based": {
"educational": {
"criteria": ["clarity", "completeness", "examples", "progression"]
},
"reference": {
"criteria": ["accuracy", "comprehensiveness", "organization", "searchability"]
},
"persuasive": {
"criteria": ["argument_strength", "evidence", "emotional_appeal", "call_to_action"]
}
}
}
}

Performance Optimization Patterns

Efficient Batch Processing

Optimize evaluation of large datasets.

{
"name": "optimized_batch_evaluation",
"steps": [
{
"name": "prepare_batches",
"activity": "segment_data",
"config": {
"total_items": 10000,
"strategy": "smart_batching",
"criteria": {
"similar_length": true,
"similar_complexity": true,
"max_tokens_per_batch": 4000
}
}
},
{
"name": "parallel_evaluation",
"activity": "verdict_batch",
"config": {
"items_path": "prepare_batches.outputs.batches",
"batch_size": "dynamic",
"max_concurrent": 20,
"evaluation_config": {
"model": "gpt-3.5-turbo",
"temperature": 0,
"cache_results": true
}
}
},
{
"name": "aggregate_results",
"activity": "combine_batch_results",
"config": {
"maintain_order": true,
"calculate_statistics": true
}
}
]
}

Cached Evaluation Pipeline

Implement caching for repeated evaluations.

{
"name": "cached_evaluation",
"steps": [
{
"name": "check_cache",
"activity": "lookup_evaluation_cache",
"config": {
"content_hash": "{{content.hash}}",
"evaluation_version": "v2.1"
}
},
{
"name": "evaluate_if_needed",
"condition": "check_cache.outputs.cache_miss",
"activity": "verdict_judge",
"config": {
"input_data": "content.outputs.text",
"cache_result": true,
"cache_ttl": 86400
}
},
{
"name": "use_result",
"activity": "process_evaluation",
"config": {
"evaluation": "{{check_cache.outputs.cached_result || evaluate_if_needed.outputs.result}}"
}
}
]
}

Error Handling Patterns

Robust Evaluation with Fallbacks

Handle evaluation failures gracefully.

{
"name": "robust_evaluation",
"steps": [
{
"name": "primary_evaluation",
"activity": "verdict_judge",
"config": {
"model": "gpt-4",
"timeout": 30,
"retry_count": 2
},
"error_handler": {
"on_timeout": "fallback_evaluation",
"on_rate_limit": "queue_for_retry",
"on_parse_error": "simplified_evaluation"
}
},
{
"name": "fallback_evaluation",
"condition": "primary_evaluation.error",
"activity": "verdict_judge",
"config": {
"model": "gpt-3.5-turbo",
"simplified_prompt": true
}
},
{
"name": "simplified_evaluation",
"condition": "fallback_evaluation.error",
"activity": "rule_based_evaluation",
"config": {
"rules": ["length_check", "keyword_presence", "format_validation"]
}
}
]
}

Integration Patterns

End-to-End Workflow Integration

Combine data processing, generation, and evaluation.

{
"name": "integrated_content_pipeline",
"steps": [
{
"name": "collect_data",
"activity": "read_text_file",
"config": {
"file_path": "sources/research_data.json"
}
},
{
"name": "generate_content",
"activity": "generate_article",
"config": {
"data": "collect_data.outputs.content",
"style": "technical"
}
},
{
"name": "initial_evaluation",
"activity": "verdict_judge",
"config": {
"input_data": "generate_content.outputs.article",
"evaluation_type": "scale",
"scale_range": [1, 10]
}
},
{
"name": "enhance_if_needed",
"condition": "initial_evaluation.outputs.score < 8",
"activity": "enhance_content",
"config": {
"feedback": "initial_evaluation.outputs.rationale"
}
},
{
"name": "final_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "judge-verify",
"input_data": "{{enhance_if_needed.outputs.enhanced || generate_content.outputs.article}}"
}
},
{
"name": "save_approved",
"condition": "final_evaluation.outputs.final_score >= 8",
"activity": "save_text_file",
"config": {
"content": "final_evaluation.outputs.evaluated_content",
"file_path": "approved/{{workflow.run_id}}.md"
}
}
]
}

Comprehensive Data Processing + Evaluation Pipelines

Document Processing and Quality Assessment Pipeline

Complete workflow for processing multiple documents with comprehensive evaluation.

{
"name": "document_processing_quality_pipeline",
"description": "Process multiple documents, concatenate content, and evaluate quality with comprehensive assessment",
"steps": [
{
"name": "load_source_documents",
"activity": "read_text_file",
"config": {
"file_path": "sources/document_list.json"
}
},
{
"name": "process_document_batch",
"activity": "text_concatenate",
"config": {
"input_files": "load_source_documents.outputs.file_list",
"separator": "\n\n---\n\n",
"output_format": "markdown",
"include_source_headers": true
}
},
{
"name": "add_processing_metadata",
"activity": "add_image_metadata",
"config": {
"file_path": "process_document_batch.outputs.file_path",
"metadata": {
"processed_date": "{{timestamp}}",
"source_count": "{{load_source_documents.outputs.file_count}}",
"processing_pipeline": "document_processing_quality_pipeline"
}
}
},
{
"name": "comprehensive_quality_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "custom",
"input_data": "process_document_batch.outputs.text",
"stages": [
{
"name": "content_completeness",
"evaluation_prompt": "Evaluate the completeness of this combined document. Check if:\n1. All sections are present and coherent\n2. No critical information is missing\n3. Document flow is logical\n4. Sources are properly integrated",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.3
},
{
"name": "technical_accuracy",
"evaluation_prompt": "Assess the technical accuracy of the content:\n1. Factual correctness\n2. Technical terminology usage\n3. Data consistency across sources\n4. Citation accuracy",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.4
},
{
"name": "readability_assessment",
"evaluation_prompt": "Evaluate document readability and structure:\n1. Clear section organization\n2. Consistent formatting\n3. Appropriate language level\n4. Logical information flow",
"evaluation_type": "scale",
"scale_range": [1, 10],
"weight": 0.3
}
],
"aggregation_method": "weighted_average"
}
},
{
"name": "quality_gate_decision",
"activity": "verdict_judge",
"config": {
"input_data": "process_document_batch.outputs.text",
"evaluation_prompt": "Based on the quality scores ({{comprehensive_quality_evaluation.outputs.final_score}}), determine if this document meets publication standards. Minimum score: 7.5",
"evaluation_type": "categorical",
"categories": ["publish", "revise", "reject"],
"category_descriptions": {
"publish": "Document meets all quality standards",
"revise": "Document needs improvements before publication",
"reject": "Document does not meet minimum standards"
}
}
},
{
"name": "save_approved_document",
"condition": "quality_gate_decision.outputs.result == 'publish'",
"activity": "save_text_file",
"config": {
"content": "process_document_batch.outputs.text",
"file_path": "published/documents/{{workflow.run_id}}_approved.md",
"metadata": {
"quality_score": "comprehensive_quality_evaluation.outputs.final_score",
"decision": "quality_gate_decision.outputs.result",
"evaluation_date": "{{timestamp}}"
}
}
},
{
"name": "flag_for_revision",
"condition": "quality_gate_decision.outputs.result == 'revise'",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.example.com/revision-queue",
"payload": {
"document_id": "{{workflow.run_id}}",
"status": "needs_revision",
"quality_score": "comprehensive_quality_evaluation.outputs.final_score",
"feedback": "comprehensive_quality_evaluation.outputs.stage_results",
"file_path": "process_document_batch.outputs.file_path"
}
}
}
]
}

Data Analysis and Evaluation Workflow

Comprehensive pipeline for shipping data analysis with multi-criteria evaluation.

{
"name": "shipping_data_analysis_evaluation",
"description": "Analyze shipping rate data, perform aggregations, and evaluate results with business criteria",
"steps": [
{
"name": "load_shipping_data",
"activity": "read_text_file",
"config": {
"file_path": "data/shipping_rates_2024.csv"
}
},
{
"name": "analyze_port_pairs",
"activity": "port_pairs_rate_aggregator",
"config": {
"data_source": "load_shipping_data.outputs.content",
"analysis_type": "comprehensive",
"aggregation_methods": ["mean", "median", "percentile_95"],
"group_by": ["origin_port", "destination_port", "service_type"],
"time_period": "2024"
}
},
{
"name": "calculate_rate_priorities",
"activity": "rate_option_prioritizer",
"config": {
"rate_data": "analyze_port_pairs.outputs.aggregated_rates",
"criteria": {
"cost_effectiveness": 0.4,
"service_reliability": 0.3,
"transit_time": 0.2,
"capacity_availability": 0.1
},
"optimization_target": "balanced"
}
},
{
"name": "generate_analysis_report",
"activity": "text_concatenate",
"config": {
"sections": [
{
"title": "Executive Summary",
"content": "analyze_port_pairs.outputs.summary"
},
{
"title": "Rate Analysis Results",
"content": "analyze_port_pairs.outputs.detailed_analysis"
},
{
"title": "Prioritized Recommendations",
"content": "calculate_rate_priorities.outputs.recommendations"
}
],
"output_format": "markdown"
}
},
{
"name": "evaluate_analysis_quality",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "multi-judge",
"input_data": "generate_analysis_report.outputs.text",
"stages": [
{
"name": "data_analyst_review",
"evaluation_prompt": "As a shipping data analyst, evaluate this analysis for:\n1. Data completeness and accuracy\n2. Statistical methodology correctness\n3. Industry relevance of insights\n4. Actionability of recommendations",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
},
{
"name": "business_stakeholder_review",
"evaluation_prompt": "From a business perspective, assess this analysis for:\n1. Strategic value and insights\n2. Cost optimization opportunities\n3. Risk assessment coverage\n4. Implementation feasibility",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
},
{
"name": "technical_reviewer",
"evaluation_prompt": "Evaluate the technical aspects:\n1. Data processing methodology\n2. Statistical rigor\n3. Visualization quality\n4. Documentation completeness",
"evaluation_type": "categorical",
"categories": ["excellent", "good", "acceptable", "poor"],
"model": "gpt-4"
}
],
"aggregation_method": "majority_vote",
"require_consensus_threshold": 2
}
},
{
"name": "business_impact_assessment",
"activity": "verdict_judge",
"config": {
"input_data": "calculate_rate_priorities.outputs.recommendations",
"evaluation_prompt": "Evaluate the potential business impact of these recommendations:\n1. Estimated cost savings potential\n2. Implementation complexity\n3. Risk level and mitigation strategies\n4. Timeline for realizing benefits",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true
}
},
{
"name": "select_historical_comparisons",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_type": "shipping_analysis",
"date_range": {
"start": "2023-01-01",
"end": "2023-12-31"
},
"status": "completed"
},
"sort_by": "quality_score",
"limit": 5
}
},
{
"name": "benchmark_against_historical",
"activity": "visualize_correlation",
"config": {
"current_analysis": "evaluate_analysis_quality.outputs.consensus_score",
"historical_data": "select_historical_comparisons.outputs.trajectories",
"metrics": ["quality_score", "business_impact", "implementation_success"],
"visualization_type": "comparative_analysis",
"output_format": "interactive_chart"
}
},
{
"name": "final_approval_decision",
"activity": "verdict_judge",
"config": {
"input_data": "generate_analysis_report.outputs.text",
"evaluation_prompt": "Make final approval decision based on:\n- Analysis quality: {{evaluate_analysis_quality.outputs.consensus_result}}\n- Business impact: {{business_impact_assessment.outputs.score}}\n- Historical performance: {{benchmark_against_historical.outputs.correlation_score}}\n\nMinimum thresholds: Quality >= 'good', Impact >= 7, Historical correlation >= 0.7",
"evaluation_type": "binary",
"positive_label": "approved",
"negative_label": "requires_revision"
}
},
{
"name": "publish_approved_analysis",
"condition": "final_approval_decision.outputs.result == 'approved'",
"activity": "save_text_file",
"config": {
"content": "generate_analysis_report.outputs.text",
"file_path": "reports/shipping_analysis/{{date}}_{{workflow.run_id}}.md",
"metadata": {
"analysis_quality": "evaluate_analysis_quality.outputs.consensus_result",
"business_impact_score": "business_impact_assessment.outputs.score",
"approval_status": "approved",
"publication_date": "{{timestamp}}"
}
}
},
{
"name": "notify_stakeholders",
"condition": "final_approval_decision.outputs.result == 'approved'",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.company.com/notifications/shipping-analysis",
"payload": {
"analysis_id": "{{workflow.run_id}}",
"status": "published",
"quality_metrics": {
"analysis_quality": "evaluate_analysis_quality.outputs.consensus_result",
"business_impact": "business_impact_assessment.outputs.score",
"historical_benchmark": "benchmark_against_historical.outputs.correlation_score"
},
"report_url": "reports/shipping_analysis/{{date}}_{{workflow.run_id}}.md",
"recommendations_summary": "calculate_rate_priorities.outputs.top_recommendations"
}
}
}
]
}

Multi-Modal Content Processing Pipeline

Advanced workflow combining image processing, data analysis, and comprehensive evaluation.

{
"name": "multimodal_content_quality_pipeline",
"description": "Process images, analyze content, and perform comprehensive quality evaluation",
"steps": [
{
"name": "download_content_images",
"activity": "download_image",
"config": {
"image_urls": [
"https://example.com/content/chart1.png",
"https://example.com/content/infographic.jpg",
"https://example.com/content/diagram.svg"
],
"output_directory": "images/{{workflow.run_id}}/",
"quality": "high",
"format": "original"
}
},
{
"name": "extract_image_metadata",
"activity": "add_image_metadata",
"config": {
"image_paths": "download_content_images.outputs.downloaded_files",
"extract_metadata": {
"technical": ["dimensions", "format", "color_profile", "compression"],
"content": ["text_content", "accessibility_description", "visual_elements"],
"quality": ["resolution", "clarity_score", "visual_appeal"]
}
}
},
{
"name": "load_accompanying_text",
"activity": "read_text_file",
"config": {
"file_path": "content/article_{{workflow.run_id}}.md"
}
},
{
"name": "combine_multimodal_content",
"activity": "text_concatenate",
"config": {
"sections": [
{
"title": "Article Content",
"content": "load_accompanying_text.outputs.content"
},
{
"title": "Visual Elements",
"content": "extract_image_metadata.outputs.content_descriptions"
},
{
"title": "Technical Specifications",
"content": "extract_image_metadata.outputs.technical_metadata"
}
],
"output_format": "structured_markdown"
}
},
{
"name": "comprehensive_content_evaluation",
"activity": "verdict_pipeline",
"config": {
"pipeline_type": "hierarchical",
"input_data": "combine_multimodal_content.outputs.text",
"stages": [
{
"name": "technical_quality_check",
"evaluation_prompt": "Evaluate technical quality of images and content:\n1. Image resolution and clarity\n2. Text readability and formatting\n3. Technical accuracy of visual elements\n4. Accessibility compliance",
"evaluation_type": "binary",
"positive_label": "meets_technical_standards",
"negative_label": "technical_issues_found"
},
{
"name": "content_quality_assessment",
"condition": "technical_quality_check.result == 'meets_technical_standards'",
"evaluation_prompt": "Assess content quality across modalities:\n1. Text-image coherence and complementarity\n2. Information accuracy and completeness\n3. Visual appeal and engagement\n4. Educational or informational value",
"evaluation_type": "scale",
"scale_range": [1, 10]
},
{
"name": "audience_suitability_review",
"condition": "content_quality_assessment.score >= 7",
"evaluation_prompt": "Evaluate suitability for target audience:\n1. Appropriate complexity level\n2. Cultural sensitivity and inclusivity\n3. Engagement potential\n4. Learning objective alignment",
"evaluation_type": "categorical",
"categories": ["highly_suitable", "suitable", "needs_adaptation", "unsuitable"]
}
]
}
},
{
"name": "accessibility_compliance_check",
"activity": "verdict_judge",
"config": {
"input_data": "extract_image_metadata.outputs.accessibility_analysis",
"evaluation_prompt": "Evaluate accessibility compliance:\n1. Alternative text quality and descriptiveness\n2. Color contrast ratios for text/background\n3. Image complexity and description adequacy\n4. WCAG 2.1 AA compliance level",
"evaluation_type": "categorical",
"categories": ["fully_compliant", "mostly_compliant", "partially_compliant", "non_compliant"],
"include_rationale": true
}
},
{
"name": "batch_evaluate_historical_content",
"activity": "verdict_batch",
"config": {
"items_path": "select_similar_content.outputs.historical_items",
"evaluation_config": {
"evaluation_prompt": "Compare this historical content to current quality standards:\n1. Technical quality improvements\n2. Content depth and accuracy\n3. Visual design evolution\n4. Accessibility improvements",
"evaluation_type": "scale",
"scale_range": [1, 10],
"include_rationale": true
},
"batch_size": 5,
"max_concurrent": 3
}
},
{
"name": "quality_trend_analysis",
"activity": "visualize_correlation",
"config": {
"current_score": "comprehensive_content_evaluation.outputs.final_score",
"historical_scores": "batch_evaluate_historical_content.outputs.scores",
"metrics": ["technical_quality", "content_quality", "accessibility_score"],
"analysis_type": "trend_analysis",
"time_dimension": "content_creation_date"
}
},
{
"name": "publication_decision",
"activity": "verdict_judge",
"config": {
"input_data": "combine_multimodal_content.outputs.text",
"evaluation_prompt": "Make publication decision based on comprehensive evaluation:\n- Content Quality: {{comprehensive_content_evaluation.outputs.final_score}}\n- Accessibility: {{accessibility_compliance_check.outputs.result}}\n- Historical Benchmark: {{quality_trend_analysis.outputs.trend_score}}\n\nPublication criteria: Content >= 8.0, Accessibility >= 'mostly_compliant', Trend >= baseline",
"evaluation_type": "categorical",
"categories": ["publish_immediately", "publish_with_minor_edits", "major_revision_required", "reject"],
"include_rationale": true
}
},
{
"name": "save_approved_content",
"condition": "publication_decision.outputs.result == 'publish_immediately'",
"activity": "save_text_file",
"config": {
"content": "combine_multimodal_content.outputs.text",
"file_path": "published/multimodal/{{date}}/{{workflow.run_id}}.md",
"metadata": {
"content_quality_score": "comprehensive_content_evaluation.outputs.final_score",
"accessibility_rating": "accessibility_compliance_check.outputs.result",
"publication_decision": "publication_decision.outputs.result",
"image_count": "download_content_images.outputs.file_count",
"quality_trend": "quality_trend_analysis.outputs.trend_direction"
}
}
},
{
"name": "notify_content_team",
"activity": "webhook_notify",
"config": {
"webhook_url": "https://api.cms.company.com/content-status",
"payload": {
"content_id": "{{workflow.run_id}}",
"decision": "publication_decision.outputs.result",
"quality_metrics": {
"overall_score": "comprehensive_content_evaluation.outputs.final_score",
"accessibility_compliance": "accessibility_compliance_check.outputs.result",
"technical_quality": "comprehensive_content_evaluation.outputs.stage_scores.technical_quality_check",
"trend_performance": "quality_trend_analysis.outputs.trend_score"
},
"next_actions": "publication_decision.outputs.rationale",
"published_location": "{{save_approved_content.outputs.file_path || 'pending_revision'}}"
}
}
}
]
}

Monitoring and Analytics

Evaluation Metrics Dashboard

Track evaluation performance over time.

{
"name": "evaluation_analytics",
"steps": [
{
"name": "collect_metrics",
"activity": "select_trajectories",
"config": {
"filters": {
"workflow_id": "evaluation_workflow",
"date_range": {"days": 7}
}
}
},
{
"name": "calculate_statistics",
"activity": "aggregate_metrics",
"config": {
"metrics": [
"average_score",
"score_distribution",
"pass_rate",
"evaluation_time",
"cost_per_evaluation"
],
"group_by": ["model", "evaluation_type"]
}
},
{
"name": "generate_report",
"activity": "create_dashboard",
"config": {
"visualizations": [
"score_trend_line",
"pass_rate_bar",
"cost_breakdown_pie",
"model_performance_heatmap"
]
}
}
]
}

Best Practices Summary

Design Principles

  1. Clear Criteria: Define specific, measurable evaluation criteria
  2. Appropriate Granularity: Match evaluation complexity to use case
  3. Consistent Prompts: Use structured, reproducible evaluation prompts
  4. Balanced Scoring: Weight criteria appropriately for context

Implementation Guidelines

  1. Start Simple: Begin with basic evaluations, add complexity as needed
  2. Test Thoroughly: Validate evaluation prompts with known examples
  3. Monitor Consistency: Track inter-rater reliability over time
  4. Optimize Costs: Use appropriate models and caching strategies

Performance Tips

  1. Batch Similar Items: Group similar evaluations for efficiency
  2. Cache Aggressively: Store results for repeated evaluations
  3. Use Appropriate Models: Match model complexity to evaluation needs
  4. Implement Fallbacks: Handle failures gracefully

Quality Assurance

  1. Validate Results: Periodically verify evaluation accuracy
  2. Track Metrics: Monitor evaluation performance and costs
  3. Iterate Prompts: Refine evaluation criteria based on results
  4. Document Decisions: Record evaluation design rationale

Data Processing Integration

Evaluation Framework

Control Flow

External Resources