Skip to main content
This cookbook walks you through adding complete observability to a CrewAI multi-agent pipeline—tracing agent-to-agent handoffs, measuring individual agent performance, and evaluating content quality at each stage.

Open in Google Colab

Run the complete notebook in your browser
All company names (ContentCraft) and scenarios in this cookbook are entirely fictional and used for demonstration purposes only.

What You’ll Learn

This cookbook guides you through 5 key stages of building an observable multi-agent system:

Prerequisites


High-Level Concepts

Why Trace Multi-Agent Systems?

Multi-agent systems introduce complexity that single-agent workflows don’t have:
Failure ModeSymptomWhat Tracing Reveals
Agent bottleneckPipeline slowWhich agent takes longest
Handoff failureContext lostMessage content between agents
Cost explosionBudget exceededWhich agent uses most tokens
Quality degradationPoor outputWhere quality drops in pipeline
Model mismatchInconsistent resultsWhich model for which role
Without per-agent visibility, you can’t optimize individual roles or identify where the pipeline breaks down.

CrewAI Architecture

CrewAI organizes multi-agent work into three components:
ComponentDescriptionExample
AgentAutonomous unit with role, goal, backstoryResearch Specialist, Content Writer
TaskWork item with description and expected output”Research the topic”, “Write the draft”
CrewTeam of agents executing tasksContent creation team
Processes:
  • Sequential: Tasks execute one after another (A → B → C)
  • Hierarchical: Manager agent delegates to workers

ContentCraft Scenario

ContentCraft is a fictional AI content agency that uses a 4-agent pipeline to create SEO-optimized blog articles:
Topic Input


┌─────────────────┐
│   Researcher    │ ──► Research findings
│   (GPT-4)       │
└─────────────────┘


┌─────────────────┐
│    Writer       │ ──► Draft article
│ (GPT-4/GPT-3.5) │
└─────────────────┘


┌─────────────────┐
│    Editor       │ ──► Polished article
│ (GPT-3.5/GPT-4) │
└─────────────────┘


┌─────────────────┐
│  SEO Specialist │ ──► Final optimized article
│   (GPT-3.5)     │
└─────────────────┘


Published Content
Each agent specializes in one task, and the output of one becomes the input for the next.

Building the Content Pipeline

Let’s build the multi-agent crew first, then add tracing and evaluation.

Installation

Install the required packages:
pip install netra-sdk crewai crewai-tools openai langchain-openai

Environment Setup

Configure your API keys:
export NETRA_API_KEY="your-netra-api-key"
export OPENAI_API_KEY="your-openai-api-key"

Define the Agents

Create specialized agents with distinct roles:
from crewai import Agent
from langchain_openai import ChatOpenAI

def create_agents(config: dict = None):
    """Create the content team agents with configurable models."""
    config = config or {
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    }

    researcher = Agent(
        role="Research Specialist",
        goal="Gather accurate facts, statistics, and expert opinions for the article",
        backstory=(
            "You are an expert researcher with 10 years of experience in content research. "
            "You excel at finding reliable sources, key statistics, and expert quotes "
            "that make articles authoritative and engaging."
        ),
        llm=ChatOpenAI(model=config["researcher"]),
        verbose=True,
    )

    writer = Agent(
        role="Content Writer",
        goal="Write engaging, well-structured blog articles that inform and captivate readers",
        backstory=(
            "You are a professional copywriter with expertise in creating compelling content. "
            "You know how to structure articles with clear introductions, informative body sections, "
            "and memorable conclusions that drive engagement."
        ),
        llm=ChatOpenAI(model=config["writer"]),
        verbose=True,
    )

    editor = Agent(
        role="Quality Editor",
        goal="Polish articles for clarity, grammar, flow, and readability",
        backstory=(
            "You are a senior editor with a keen eye for detail. "
            "You improve sentence structure, fix grammatical errors, enhance flow, "
            "and ensure the content is clear and professional."
        ),
        llm=ChatOpenAI(model=config["editor"]),
        verbose=True,
    )

    seo_specialist = Agent(
        role="SEO Optimizer",
        goal="Optimize content for search engines without sacrificing readability",
        backstory=(
            "You are an SEO expert who balances keyword optimization with user experience. "
            "You add meta descriptions, optimize headings, suggest internal links, "
            "and ensure content ranks well while remaining engaging."
        ),
        llm=ChatOpenAI(model=config["seo"]),
        verbose=True,
    )

    return {
        "researcher": researcher,
        "writer": writer,
        "editor": editor,
        "seo": seo_specialist,
    }

Define the Tasks

Create tasks that chain together with dependencies:
from crewai import Task

def create_tasks(agents: dict, topic: str):
    """Create the content pipeline tasks."""

    research_task = Task(
        description=(
            f"Research the topic: '{topic}'. "
            "Find 5-7 key facts, relevant statistics, and expert opinions. "
            "Include sources where possible. Focus on accuracy and relevance."
        ),
        expected_output=(
            "A research brief containing:\n"
            "- Key facts and statistics\n"
            "- Expert opinions or quotes\n"
            "- Source references\n"
            "- Main themes to cover"
        ),
        agent=agents["researcher"],
    )

    writing_task = Task(
        description=(
            "Write a 800-1000 word blog article based on the research provided. "
            "Include:\n"
            "- An engaging introduction that hooks the reader\n"
            "- 3-4 body sections with clear subheadings\n"
            "- A conclusion with key takeaways\n"
            "Format the article in markdown."
        ),
        expected_output="A draft blog article in markdown format with introduction, body sections, and conclusion",
        agent=agents["writer"],
        context=[research_task],
    )

    editing_task = Task(
        description=(
            "Edit the article for:\n"
            "- Grammar and spelling errors\n"
            "- Sentence structure and flow\n"
            "- Clarity and readability\n"
            "- Consistent tone and style\n"
            "Make improvements while preserving the author's voice."
        ),
        expected_output="A polished blog article with improved clarity, grammar, and flow",
        agent=agents["editor"],
        context=[writing_task],
    )

    seo_task = Task(
        description=(
            "Optimize the article for SEO:\n"
            "- Add a compelling meta description (150-160 characters)\n"
            "- Optimize the title and headings for keywords\n"
            "- Suggest 3-5 target keywords\n"
            "- Ensure proper heading hierarchy (H1, H2, H3)\n"
            "- Add a suggested slug for the URL\n"
            "Return the optimized article with SEO metadata."
        ),
        expected_output=(
            "SEO-optimized article with:\n"
            "- Meta description\n"
            "- Target keywords\n"
            "- Optimized headings\n"
            "- Suggested URL slug"
        ),
        agent=agents["seo"],
        context=[editing_task],
    )

    return [research_task, writing_task, editing_task, seo_task]

Create the Crew

Assemble the agents and tasks into a crew:
from crewai import Crew, Process

def create_content_crew(config: dict = None):
    """Create a content creation crew with the specified configuration."""
    agents = create_agents(config)

    # Tasks will be created when running
    return {
        "agents": agents,
        "config": config or {
            "researcher": "gpt-4o",
            "writer": "gpt-4o",
            "editor": "gpt-3.5-turbo",
            "seo": "gpt-3.5-turbo",
        },
    }

def run_content_crew(crew_data: dict, topic: str):
    """Execute the content creation pipeline."""
    agents = crew_data["agents"]
    tasks = create_tasks(agents, topic)

    crew = Crew(
        agents=list(agents.values()),
        tasks=tasks,
        process=Process.sequential,
        verbose=True,
    )

    result = crew.kickoff()
    return result

Test the Basic Pipeline

Verify the pipeline works before adding tracing:
# Create crew with default configuration
crew_data = create_content_crew()

# Run a test article
result = run_content_crew(
    crew_data,
    topic="The Future of AI in Healthcare"
)

print("Article created successfully!")
print(result.raw[:500] + "...")

Adding Observability with Netra

Now let’s instrument the pipeline for full observability.

Initialize Netra with CrewAI Instrumentation

Netra provides auto-instrumentation for CrewAI that captures agent execution automatically:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

# Initialize Netra with CrewAI and OpenAI instrumentation
Netra.init(
    app_name="contentcraft",
    instruments=set([InstrumentSet.CREWAI, InstrumentSet.OPENAI]),
    trace_content=True,
)
With auto-instrumentation enabled, Netra automatically captures:
  • Agent execution spans with role and backstory
  • Task execution with descriptions and outputs
  • LLM calls with prompts, completions, and token usage
  • Cost calculations per agent

Tracing the Pipeline with Decorators

For more control, wrap your pipeline execution with the @workflow decorator:
from netra.decorators import workflow

@workflow(name="content-pipeline")
def create_article(topic: str, config_name: str = "default", config: dict = None):
    """Run the content creation pipeline with full tracing."""

    # Set custom attributes for filtering and analysis
    Netra.set_custom_attributes(key="topic", value=topic)
    Netra.set_custom_attributes(key="config_name", value=config_name)

    # Create and run the crew
    crew_data = create_content_crew(config)
    result = run_content_crew(crew_data, topic)

    return {
        "topic": topic,
        "config": config_name,
        "output": result.raw,
        "token_usage": getattr(result, "token_usage", None),
    }

Adding Custom Span Attributes

Track additional metadata for each pipeline run:
from netra import Netra, SpanType

@workflow(name="content-pipeline-detailed")
def create_article_detailed(topic: str, config_name: str, config: dict):
    """Run pipeline with detailed custom tracing."""

    with Netra.start_span("pipeline-setup") as setup_span:
        setup_span.set_attribute("topic", topic)
        setup_span.set_attribute("config_name", config_name)
        setup_span.set_attribute("model.researcher", config["researcher"])
        setup_span.set_attribute("model.writer", config["writer"])
        setup_span.set_attribute("model.editor", config["editor"])
        setup_span.set_attribute("model.seo", config["seo"])

        crew_data = create_content_crew(config)

    with Netra.start_span("pipeline-execution", as_type=SpanType.AGENT) as exec_span:
        result = run_content_crew(crew_data, topic)
        exec_span.set_attribute("output_length", len(result.raw))

    return {
        "topic": topic,
        "config": config_name,
        "output": result.raw,
    }

Viewing Multi-Agent Traces

After running the pipeline, navigate to Observability → Traces in Netra. You’ll see the full agent execution flow:
Netra trace view showing multi-agent pipeline with Researcher, Writer, Editor, and SEO agent spans
The trace shows:
  • Pipeline span: Overall execution time
  • Agent spans: Each agent’s task execution
  • LLM calls: Nested under each agent with prompts and completions
  • Token usage: Per-agent and total

Running Content Creation Experiments

Test different model configurations to find the optimal cost/quality balance.

Configuration Definitions

# Model configurations to test
CONFIGS = {
    "premium": {
        "name": "Premium (All GPT-4)",
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-4o",
        "seo": "gpt-4o",
    },
    "budget": {
        "name": "Budget (Mixed)",
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    },
    "economy": {
        "name": "Economy (Minimal GPT-4)",
        "researcher": "gpt-4o",
        "writer": "gpt-3.5-turbo",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    },
}

Experiment 1: Premium Config (All GPT-4)

print("Running Premium configuration (all GPT-4)...")

result_premium = create_article(
    topic="The Future of AI in Healthcare",
    config_name="premium-all-gpt4",
    config=CONFIGS["premium"],
)

print(f"Premium article created: {len(result_premium['output'])} characters")

Experiment 2: Budget Config (Mixed Models)

print("Running Budget configuration (mixed models)...")

result_budget = create_article(
    topic="The Future of AI in Healthcare",
    config_name="budget-mixed",
    config=CONFIGS["budget"],
)

print(f"Budget article created: {len(result_budget['output'])} characters")

Experiment 3: Economy Config (Minimal GPT-4)

print("Running Economy configuration (minimal GPT-4)...")

result_economy = create_article(
    topic="The Future of AI in Healthcare",
    config_name="economy-minimal",
    config=CONFIGS["economy"],
)

print(f"Economy article created: {len(result_economy['output'])} characters")

Comparing Costs Across Configurations

After running all configurations, compare the costs in the Netra dashboard:
ConfigResearcherWriterEditorSEOTotal Cost
Premium~$0.05~$0.08~$0.04~$0.02~$0.19
Budget~$0.05~$0.08~$0.01~$0.005~$0.145
Economy~$0.05~$0.02~$0.01~$0.005~$0.085
Per-agent cost breakdown showing token usage and costs for each configuration

Running Multiple Topics

Test across multiple topics to get statistically meaningful results:
TEST_TOPICS = [
    "The Future of AI in Healthcare",
    "Remote Work Best Practices for 2026",
    "Sustainable Investing for Beginners",
    "How to Build a Personal Brand Online",
    "The Rise of Electric Vehicles",
]

def run_config_comparison():
    """Run all configurations across all topics."""
    results = []

    for config_key, config in CONFIGS.items():
        for topic in TEST_TOPICS:
            print(f"Running {config_key} for: {topic[:30]}...")

            result = create_article(
                topic=topic,
                config_name=config_key,
                config=config,
            )

            results.append({
                "config": config_key,
                "topic": topic,
                "output": result["output"],
                "output_length": len(result["output"]),
            })

    return results

# Run comparison (this will take a while)
# comparison_results = run_config_comparison()

Evaluating Content Quality

Measuring cost isn’t enough—you need to ensure quality doesn’t degrade.

Why Evaluate Multi-Agent Output?

Each stage can introduce or fix quality issues:
StageWhat to EvaluateWhy It Matters
ResearchAccuracy, coverageFoundation for the article
DraftCoherence, engagementReader experience
EditedImprovement deltaEditor effectiveness
FinalSEO score, readabilityPublication readiness

Creating Evaluators

In Netra, navigate to Evaluation → Evaluators to create custom evaluators.

Writer Quality Evaluator (LLM as Judge)

Use the Answer Correctness template with a custom prompt:
Evaluate the quality of this blog article draft.

Score from 0 to 1 based on:
- Coherence: Does the article flow logically? (0.3)
- Coverage: Does it cover the topic comprehensively? (0.3)
- Engagement: Is it interesting to read? (0.2)
- Structure: Does it have clear intro, body, conclusion? (0.2)

Article:
{output}

Topic:
{input}
Set Pass Criteria to >= 0.7.

Editor Effectiveness Evaluator (Code Evaluator)

Create a Code Evaluator to measure if editing improved the content:
function handler(input, output, expectedOutput) {
    // input.draft = the original draft
    // output = the edited version

    const draft = input?.draft || "";
    const edited = output || "";

    // No changes = no improvement
    if (draft === edited) {
        return 0;
    }

    // Calculate similarity (simple length-based heuristic)
    const lengthRatio = edited.length / Math.max(draft.length, 1);

    // Edited should be similar length (0.8 to 1.2x)
    if (lengthRatio < 0.7 || lengthRatio > 1.3) {
        return 0.5; // Major length change might indicate issues
    }

    // Check for improvements
    let score = 0.5; // Base score for making changes

    // Reward similar length (tight editing)
    if (lengthRatio >= 0.9 && lengthRatio <= 1.1) {
        score += 0.3;
    }

    // Reward if output is valid markdown
    if (edited.includes("# ") || edited.includes("## ")) {
        score += 0.2;
    }

    return Math.min(score, 1);
}

SEO Score Evaluator (Code Evaluator)

Measure SEO optimization quality:
function handler(input, output, expectedOutput) {
    const article = output || "";
    const keywords = expectedOutput?.keywords || [];

    let score = 0;

    // Check for meta description (30%)
    const hasMetaDesc = article.toLowerCase().includes("meta description") ||
                        article.toLowerCase().includes("meta:") ||
                        article.includes("**Meta Description**");
    if (hasMetaDesc) {
        score += 0.3;
    }

    // Check for keyword usage (40%)
    if (keywords.length > 0) {
        const articleLower = article.toLowerCase();
        const keywordsFound = keywords.filter(kw =>
            articleLower.includes(kw.toLowerCase())
        );
        score += (keywordsFound.length / keywords.length) * 0.4;
    } else {
        score += 0.2; // Partial credit if no keywords specified
    }

    // Check for proper heading structure (30%)
    const hasH1 = article.includes("# ") && !article.startsWith("## ");
    const hasH2 = article.includes("## ");
    const hasH3 = article.includes("### ");

    if (hasH1) score += 0.1;
    if (hasH2) score += 0.1;
    if (hasH1 && hasH2) score += 0.1;

    return Math.min(score, 1);
}

End-to-End Quality Evaluator (LLM as Judge)

Evaluate the final article holistically:
Evaluate this blog article for publication readiness.

Score from 0 to 1 based on:
- Informativeness: Does it provide valuable information? (0.25)
- Readability: Is it easy to read and understand? (0.25)
- SEO Optimization: Does it have proper structure and keywords? (0.25)
- Professionalism: Is it publication-ready? (0.25)

Article:
{output}

Original Topic:
{input}

Creating Test Dataset

Define test cases with expected attributes:
TEST_CASES = [
    {
        "id": "TC-001",
        "topic": "The Future of AI in Healthcare",
        "target_keywords": ["AI healthcare", "medical AI", "diagnosis", "treatment"],
        "expected_sections": ["introduction", "benefits", "challenges", "future"],
    },
    {
        "id": "TC-002",
        "topic": "Remote Work Best Practices for 2026",
        "target_keywords": ["remote work", "productivity", "work from home", "hybrid"],
        "expected_sections": ["introduction", "tools", "communication", "conclusion"],
    },
    {
        "id": "TC-003",
        "topic": "Sustainable Investing for Beginners",
        "target_keywords": ["ESG investing", "sustainable", "green funds", "portfolio"],
        "expected_sections": ["introduction", "what is ESG", "how to start", "risks"],
    },
]

Running Quality Comparison

Execute all configurations and collect results for evaluation:
def run_quality_evaluation():
    """Run all test cases across all configurations."""
    results = []

    for config_key, config in CONFIGS.items():
        for test_case in TEST_CASES:
            print(f"Running {config_key} for {test_case['id']}...")

            result = create_article(
                topic=test_case["topic"],
                config_name=f"{config_key}-{test_case['id']}",
                config=config,
            )

            results.append({
                "test_id": test_case["id"],
                "config": config_key,
                "topic": test_case["topic"],
                "output": result["output"],
                "keywords": test_case["target_keywords"],
            })

    return results

# Run evaluation
# evaluation_results = run_quality_evaluation()

Viewing Quality vs. Cost Matrix

After running evaluations, view the results in Evaluation → Experiments:
Quality vs. cost comparison showing scores for each configuration
Example results:
ConfigQuality ScoreSEO ScoreTotal CostCost per Quality Point
Premium0.920.88$0.19$0.21
Budget0.880.85$0.145$0.16
Economy0.780.82$0.085$0.11

Analyzing Multi-Agent Performance

Use traces to identify optimization opportunities.

Per-Agent Metrics

Track these metrics for each agent:
MetricHow to MeasureWhat It Reveals
Task DurationSpan latencyBottleneck identification
Token UsageLLM token countsCost driver
Output LengthCharacter countContent volume
Quality ScoreEvaluation resultsOutput value

Identifying Bottlenecks

Use the trace view to answer:
  1. Which agent takes longest? → Target for optimization
  2. Which agent uses most tokens? → Consider cheaper model
  3. Where does quality drop? → Improve prompts or upgrade model

Optimization Strategies

ProblemSymptomSolution
Writer too slowHigh latency on writing taskUse faster model or shorter prompts
Editor not improvingLow edit delta scoreImprove editor prompts or upgrade model
SEO weakMissing meta/keywordsAdd more specific SEO instructions
Research shallowLow quality scoresKeep GPT-4 for researcher
High total costBudget exceededDowngrade non-critical agents (editor, SEO)

Iterating on Agent Prompts

Based on trace analysis, refine agent backstories and task descriptions:
# Example: Improved editor agent after observing poor edit quality
editor_improved = Agent(
    role="Quality Editor",
    goal="Polish articles for clarity, grammar, flow, and readability",
    backstory=(
        "You are a senior editor with 15 years of experience at major publications. "
        "You have a keen eye for detail and always improve content while preserving "
        "the author's voice. You focus on:\n"
        "1. Fixing grammatical errors and typos\n"
        "2. Improving sentence structure for better flow\n"
        "3. Ensuring consistent tone throughout\n"
        "4. Making complex ideas more accessible\n"
        "You make targeted improvements, not wholesale rewrites."
    ),
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    verbose=True,
)

Summary

Key Takeaways

  1. Multi-agent systems need per-agent visibility to identify bottlenecks and cost drivers
  2. Cost allocation by role reveals which agents benefit from premium models
  3. Quality evaluation at each stage catches degradation before it compounds
  4. Configuration experiments find the optimal cost/quality balance for your use case
  5. Trace analysis enables data-driven prompt optimization

What You Built

  • 4-agent content creation pipeline with CrewAI (Researcher → Writer → Editor → SEO)
  • Full observability with agent handoff tracing
  • Per-agent cost and performance tracking
  • Quality evaluators for each stage (Writer, Editor, SEO, End-to-End)
  • Configuration comparison framework (Premium, Budget, Economy)

Learn More

Last modified on February 3, 2026