Observability for CrewAI Multi-Agent Collaboration
Trace agent handoffs, track per-agent costs, and evaluate content quality across a multi-agent pipeline
This cookbook walks you through adding complete observability to a CrewAI multi-agent pipeline—tracing agent-to-agent handoffs, measuring individual agent performance, and evaluating content quality at each stage.
from crewai import Agentfrom langchain_openai import ChatOpenAIdef create_agents(config: dict = None): """Create the content team agents with configurable models.""" config = config or { "researcher": "gpt-4o", "writer": "gpt-4o", "editor": "gpt-3.5-turbo", "seo": "gpt-3.5-turbo", } researcher = Agent( role="Research Specialist", goal="Gather accurate facts, statistics, and expert opinions for the article", backstory=( "You are an expert researcher with 10 years of experience in content research. " "You excel at finding reliable sources, key statistics, and expert quotes " "that make articles authoritative and engaging." ), llm=ChatOpenAI(model=config["researcher"]), verbose=True, ) writer = Agent( role="Content Writer", goal="Write engaging, well-structured blog articles that inform and captivate readers", backstory=( "You are a professional copywriter with expertise in creating compelling content. " "You know how to structure articles with clear introductions, informative body sections, " "and memorable conclusions that drive engagement." ), llm=ChatOpenAI(model=config["writer"]), verbose=True, ) editor = Agent( role="Quality Editor", goal="Polish articles for clarity, grammar, flow, and readability", backstory=( "You are a senior editor with a keen eye for detail. " "You improve sentence structure, fix grammatical errors, enhance flow, " "and ensure the content is clear and professional." ), llm=ChatOpenAI(model=config["editor"]), verbose=True, ) seo_specialist = Agent( role="SEO Optimizer", goal="Optimize content for search engines without sacrificing readability", backstory=( "You are an SEO expert who balances keyword optimization with user experience. " "You add meta descriptions, optimize headings, suggest internal links, " "and ensure content ranks well while remaining engaging." ), llm=ChatOpenAI(model=config["seo"]), verbose=True, ) return { "researcher": researcher, "writer": writer, "editor": editor, "seo": seo_specialist, }
Create tasks that chain together with dependencies:
Copy
from crewai import Taskdef create_tasks(agents: dict, topic: str): """Create the content pipeline tasks.""" research_task = Task( description=( f"Research the topic: '{topic}'. " "Find 5-7 key facts, relevant statistics, and expert opinions. " "Include sources where possible. Focus on accuracy and relevance." ), expected_output=( "A research brief containing:\n" "- Key facts and statistics\n" "- Expert opinions or quotes\n" "- Source references\n" "- Main themes to cover" ), agent=agents["researcher"], ) writing_task = Task( description=( "Write a 800-1000 word blog article based on the research provided. " "Include:\n" "- An engaging introduction that hooks the reader\n" "- 3-4 body sections with clear subheadings\n" "- A conclusion with key takeaways\n" "Format the article in markdown." ), expected_output="A draft blog article in markdown format with introduction, body sections, and conclusion", agent=agents["writer"], context=[research_task], ) editing_task = Task( description=( "Edit the article for:\n" "- Grammar and spelling errors\n" "- Sentence structure and flow\n" "- Clarity and readability\n" "- Consistent tone and style\n" "Make improvements while preserving the author's voice." ), expected_output="A polished blog article with improved clarity, grammar, and flow", agent=agents["editor"], context=[writing_task], ) seo_task = Task( description=( "Optimize the article for SEO:\n" "- Add a compelling meta description (150-160 characters)\n" "- Optimize the title and headings for keywords\n" "- Suggest 3-5 target keywords\n" "- Ensure proper heading hierarchy (H1, H2, H3)\n" "- Add a suggested slug for the URL\n" "Return the optimized article with SEO metadata." ), expected_output=( "SEO-optimized article with:\n" "- Meta description\n" "- Target keywords\n" "- Optimized headings\n" "- Suggested URL slug" ), agent=agents["seo"], context=[editing_task], ) return [research_task, writing_task, editing_task, seo_task]
# Create crew with default configurationcrew_data = create_content_crew()# Run a test articleresult = run_content_crew( crew_data, topic="The Future of AI in Healthcare")print("Article created successfully!")print(result.raw[:500] + "...")
Test across multiple topics to get statistically meaningful results:
Copy
TEST_TOPICS = [ "The Future of AI in Healthcare", "Remote Work Best Practices for 2026", "Sustainable Investing for Beginners", "How to Build a Personal Brand Online", "The Rise of Electric Vehicles",]def run_config_comparison(): """Run all configurations across all topics.""" results = [] for config_key, config in CONFIGS.items(): for topic in TEST_TOPICS: print(f"Running {config_key} for: {topic[:30]}...") result = create_article( topic=topic, config_name=config_key, config=config, ) results.append({ "config": config_key, "topic": topic, "output": result["output"], "output_length": len(result["output"]), }) return results# Run comparison (this will take a while)# comparison_results = run_config_comparison()
Use the Answer Correctness template with a custom prompt:
Copy
Evaluate the quality of this blog article draft.Score from 0 to 1 based on:- Coherence: Does the article flow logically? (0.3)- Coverage: Does it cover the topic comprehensively? (0.3)- Engagement: Is it interesting to read? (0.2)- Structure: Does it have clear intro, body, conclusion? (0.2)Article:{output}Topic:{input}
Evaluate this blog article for publication readiness.Score from 0 to 1 based on:- Informativeness: Does it provide valuable information? (0.25)- Readability: Is it easy to read and understand? (0.25)- SEO Optimization: Does it have proper structure and keywords? (0.25)- Professionalism: Is it publication-ready? (0.25)Article:{output}Original Topic:{input}
Execute all configurations and collect results for evaluation:
Copy
def run_quality_evaluation(): """Run all test cases across all configurations.""" results = [] for config_key, config in CONFIGS.items(): for test_case in TEST_CASES: print(f"Running {config_key} for {test_case['id']}...") result = create_article( topic=test_case["topic"], config_name=f"{config_key}-{test_case['id']}", config=config, ) results.append({ "test_id": test_case["id"], "config": config_key, "topic": test_case["topic"], "output": result["output"], "keywords": test_case["target_keywords"], }) return results# Run evaluation# evaluation_results = run_quality_evaluation()
Based on trace analysis, refine agent backstories and task descriptions:
Copy
# Example: Improved editor agent after observing poor edit qualityeditor_improved = Agent( role="Quality Editor", goal="Polish articles for clarity, grammar, flow, and readability", backstory=( "You are a senior editor with 15 years of experience at major publications. " "You have a keen eye for detail and always improve content while preserving " "the author's voice. You focus on:\n" "1. Fixing grammatical errors and typos\n" "2. Improving sentence structure for better flow\n" "3. Ensuring consistent tone throughout\n" "4. Making complex ideas more accessible\n" "You make targeted improvements, not wholesale rewrites." ), llm=ChatOpenAI(model="gpt-3.5-turbo"), verbose=True,)