Observability for CrewAI Multi-Agent Collaboration - What is Netra?

This cookbook walks you through adding complete observability to a CrewAI multi-agent pipeline—tracing agent-to-agent handoffs, measuring individual agent performance, and evaluating content quality at each stage.

Open in Google Colab

Run the complete notebook in your browser

All company names (ContentCraft) and scenarios in this cookbook are entirely fictional and used for demonstration purposes only.

What You’ll Learn

This cookbook guides you through 5 key stages of building an observable multi-agent system:

1. Build Multi-Agent Pipeline

Create a 4-agent content creation crew with CrewAI that handles research, writing, editing, and SEO.

2. Trace Agent Handoffs

Capture the message flow between agents as tasks pass through the pipeline.

3. Track Per-Agent Costs

Monitor token usage and costs for each agent role to identify cost drivers.

4. Compare Configurations

Run experiments with different model assignments to find the cost/quality sweet spot.

5. Evaluate Quality

Measure content quality at each stage with custom evaluators.

Prerequisites

Python 3.9+
OpenAI API key
Netra API key (Get started here)
CrewAI installed

High-Level Concepts

Why Trace Multi-Agent Systems?

Multi-agent systems introduce complexity that single-agent workflows don’t have:

Failure Mode	Symptom	What Tracing Reveals
Agent bottleneck	Pipeline slow	Which agent takes longest
Handoff failure	Context lost	Message content between agents
Cost explosion	Budget exceeded	Which agent uses most tokens
Quality degradation	Poor output	Where quality drops in pipeline
Model mismatch	Inconsistent results	Which model for which role

Without per-agent visibility, you can’t optimize individual roles or identify where the pipeline breaks down.

CrewAI Architecture

CrewAI organizes multi-agent work into three components:

Component	Description	Example
Agent	Autonomous unit with role, goal, backstory	Research Specialist, Content Writer
Task	Work item with description and expected output	”Research the topic”, “Write the draft”
Crew	Team of agents executing tasks	Content creation team

Processes:

Sequential: Tasks execute one after another (A → B → C)
Hierarchical: Manager agent delegates to workers

ContentCraft Scenario

ContentCraft is a fictional AI content agency that uses a 4-agent pipeline to create SEO-optimized blog articles:

Topic Input
    │
    ▼
┌─────────────────┐
│   Researcher    │ ──► Research findings
│   (GPT-4)       │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│    Writer       │ ──► Draft article
│ (GPT-4/GPT-3.5) │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│    Editor       │ ──► Polished article
│ (GPT-3.5/GPT-4) │
└─────────────────┘
    │
    ▼
┌─────────────────┐
│  SEO Specialist │ ──► Final optimized article
│   (GPT-3.5)     │
└─────────────────┘
    │
    ▼
Published Content

Each agent specializes in one task, and the output of one becomes the input for the next.

Building the Content Pipeline

Let’s build the multi-agent crew first, then add tracing and evaluation.

Installation

Install the required packages:

pip install netra-sdk crewai crewai-tools openai langchain-openai

Environment Setup

Configure your API keys:

export NETRA_API_KEY="your-netra-api-key"
export OPENAI_API_KEY="your-openai-api-key"

Define the Agents

Create specialized agents with distinct roles:

from crewai import Agent
from langchain_openai import ChatOpenAI

def create_agents(config: dict = None):
    """Create the content team agents with configurable models."""
    config = config or {
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    }

    researcher = Agent(
        role="Research Specialist",
        goal="Gather accurate facts, statistics, and expert opinions for the article",
        backstory=(
            "You are an expert researcher with 10 years of experience in content research. "
            "You excel at finding reliable sources, key statistics, and expert quotes "
            "that make articles authoritative and engaging."
        ),
        llm=ChatOpenAI(model=config["researcher"]),
        verbose=True,
    )

    writer = Agent(
        role="Content Writer",
        goal="Write engaging, well-structured blog articles that inform and captivate readers",
        backstory=(
            "You are a professional copywriter with expertise in creating compelling content. "
            "You know how to structure articles with clear introductions, informative body sections, "
            "and memorable conclusions that drive engagement."
        ),
        llm=ChatOpenAI(model=config["writer"]),
        verbose=True,
    )

    editor = Agent(
        role="Quality Editor",
        goal="Polish articles for clarity, grammar, flow, and readability",
        backstory=(
            "You are a senior editor with a keen eye for detail. "
            "You improve sentence structure, fix grammatical errors, enhance flow, "
            "and ensure the content is clear and professional."
        ),
        llm=ChatOpenAI(model=config["editor"]),
        verbose=True,
    )

    seo_specialist = Agent(
        role="SEO Optimizer",
        goal="Optimize content for search engines without sacrificing readability",
        backstory=(
            "You are an SEO expert who balances keyword optimization with user experience. "
            "You add meta descriptions, optimize headings, suggest internal links, "
            "and ensure content ranks well while remaining engaging."
        ),
        llm=ChatOpenAI(model=config["seo"]),
        verbose=True,
    )

    return {
        "researcher": researcher,
        "writer": writer,
        "editor": editor,
        "seo": seo_specialist,
    }

Define the Tasks

Create tasks that chain together with dependencies:

from crewai import Task

def create_tasks(agents: dict, topic: str):
    """Create the content pipeline tasks."""

    research_task = Task(
        description=(
            f"Research the topic: '{topic}'. "
            "Find 5-7 key facts, relevant statistics, and expert opinions. "
            "Include sources where possible. Focus on accuracy and relevance."
        ),
        expected_output=(
            "A research brief containing:\n"
            "- Key facts and statistics\n"
            "- Expert opinions or quotes\n"
            "- Source references\n"
            "- Main themes to cover"
        ),
        agent=agents["researcher"],
    )

    writing_task = Task(
        description=(
            "Write a 800-1000 word blog article based on the research provided. "
            "Include:\n"
            "- An engaging introduction that hooks the reader\n"
            "- 3-4 body sections with clear subheadings\n"
            "- A conclusion with key takeaways\n"
            "Format the article in markdown."
        ),
        expected_output="A draft blog article in markdown format with introduction, body sections, and conclusion",
        agent=agents["writer"],
        context=[research_task],
    )

    editing_task = Task(
        description=(
            "Edit the article for:\n"
            "- Grammar and spelling errors\n"
            "- Sentence structure and flow\n"
            "- Clarity and readability\n"
            "- Consistent tone and style\n"
            "Make improvements while preserving the author's voice."
        ),
        expected_output="A polished blog article with improved clarity, grammar, and flow",
        agent=agents["editor"],
        context=[writing_task],
    )

    seo_task = Task(
        description=(
            "Optimize the article for SEO:\n"
            "- Add a compelling meta description (150-160 characters)\n"
            "- Optimize the title and headings for keywords\n"
            "- Suggest 3-5 target keywords\n"
            "- Ensure proper heading hierarchy (H1, H2, H3)\n"
            "- Add a suggested slug for the URL\n"
            "Return the optimized article with SEO metadata."
        ),
        expected_output=(
            "SEO-optimized article with:\n"
            "- Meta description\n"
            "- Target keywords\n"
            "- Optimized headings\n"
            "- Suggested URL slug"
        ),
        agent=agents["seo"],
        context=[editing_task],
    )

    return [research_task, writing_task, editing_task, seo_task]

Create the Crew

Assemble the agents and tasks into a crew:

from crewai import Crew, Process

def create_content_crew(config: dict = None):
    """Create a content creation crew with the specified configuration."""
    agents = create_agents(config)

    # Tasks will be created when running
    return {
        "agents": agents,
        "config": config or {
            "researcher": "gpt-4o",
            "writer": "gpt-4o",
            "editor": "gpt-3.5-turbo",
            "seo": "gpt-3.5-turbo",
        },
    }

def run_content_crew(crew_data: dict, topic: str):
    """Execute the content creation pipeline."""
    agents = crew_data["agents"]
    tasks = create_tasks(agents, topic)

    crew = Crew(
        agents=list(agents.values()),
        tasks=tasks,
        process=Process.sequential,
        verbose=True,
    )

    result = crew.kickoff()
    return result

Test the Basic Pipeline

Verify the pipeline works before adding tracing:

# Create crew with default configuration
crew_data = create_content_crew()

# Run a test article
result = run_content_crew(
    crew_data,
    topic="The Future of AI in Healthcare"
)

print("Article created successfully!")
print(result.raw[:500] + "...")

Adding Observability with Netra

Now let’s instrument the pipeline for full observability.

Initialize Netra with CrewAI Instrumentation

Netra provides auto-instrumentation for CrewAI that captures agent execution automatically:

from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

# Initialize Netra with CrewAI and OpenAI instrumentation
Netra.init(
    app_name="contentcraft",
    instruments=set([InstrumentSet.CREWAI, InstrumentSet.OPENAI]),
    trace_content=True,
)

With auto-instrumentation enabled, Netra automatically captures:

Agent execution spans with role and backstory
Task execution with descriptions and outputs
LLM calls with prompts, completions, and token usage
Cost calculations per agent

Tracing the Pipeline with Decorators

For more control, wrap your pipeline execution with the @workflow decorator:

from netra.decorators import workflow

@workflow(name="content-pipeline")
def create_article(topic: str, config_name: str = "default", config: dict = None):
    """Run the content creation pipeline with full tracing."""

    # Set custom attributes for filtering and analysis
    Netra.set_custom_attributes(key="topic", value=topic)
    Netra.set_custom_attributes(key="config_name", value=config_name)

    # Create and run the crew
    crew_data = create_content_crew(config)
    result = run_content_crew(crew_data, topic)

    return {
        "topic": topic,
        "config": config_name,
        "output": result.raw,
        "token_usage": getattr(result, "token_usage", None),
    }

Adding Custom Span Attributes

Track additional metadata for each pipeline run:

from netra import Netra, SpanType

@workflow(name="content-pipeline-detailed")
def create_article_detailed(topic: str, config_name: str, config: dict):
    """Run pipeline with detailed custom tracing."""

    with Netra.start_span("pipeline-setup") as setup_span:
        setup_span.set_attribute("topic", topic)
        setup_span.set_attribute("config_name", config_name)
        setup_span.set_attribute("model.researcher", config["researcher"])
        setup_span.set_attribute("model.writer", config["writer"])
        setup_span.set_attribute("model.editor", config["editor"])
        setup_span.set_attribute("model.seo", config["seo"])

        crew_data = create_content_crew(config)

    with Netra.start_span("pipeline-execution", as_type=SpanType.AGENT) as exec_span:
        result = run_content_crew(crew_data, topic)
        exec_span.set_attribute("output_length", len(result.raw))

    return {
        "topic": topic,
        "config": config_name,
        "output": result.raw,
    }

Viewing Multi-Agent Traces

After running the pipeline, navigate to Observability → Traces in Netra. You’ll see the full agent execution flow:

Netra trace view showing multi-agent pipeline with Researcher, Writer, Editor, and SEO agent spans

The trace shows:

Pipeline span: Overall execution time
Agent spans: Each agent’s task execution
LLM calls: Nested under each agent with prompts and completions
Token usage: Per-agent and total

Running Content Creation Experiments

Test different model configurations to find the optimal cost/quality balance.

Configuration Definitions

# Model configurations to test
CONFIGS = {
    "premium": {
        "name": "Premium (All GPT-4)",
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-4o",
        "seo": "gpt-4o",
    },
    "budget": {
        "name": "Budget (Mixed)",
        "researcher": "gpt-4o",
        "writer": "gpt-4o",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    },
    "economy": {
        "name": "Economy (Minimal GPT-4)",
        "researcher": "gpt-4o",
        "writer": "gpt-3.5-turbo",
        "editor": "gpt-3.5-turbo",
        "seo": "gpt-3.5-turbo",
    },
}

Experiment 1: Premium Config (All GPT-4)

print("Running Premium configuration (all GPT-4)...")

result_premium = create_article(
    topic="The Future of AI in Healthcare",
    config_name="premium-all-gpt4",
    config=CONFIGS["premium"],
)

print(f"Premium article created: {len(result_premium['output'])} characters")

Experiment 2: Budget Config (Mixed Models)

print("Running Budget configuration (mixed models)...")

result_budget = create_article(
    topic="The Future of AI in Healthcare",
    config_name="budget-mixed",
    config=CONFIGS["budget"],
)

print(f"Budget article created: {len(result_budget['output'])} characters")

Experiment 3: Economy Config (Minimal GPT-4)

print("Running Economy configuration (minimal GPT-4)...")

result_economy = create_article(
    topic="The Future of AI in Healthcare",
    config_name="economy-minimal",
    config=CONFIGS["economy"],
)

print(f"Economy article created: {len(result_economy['output'])} characters")

Comparing Costs Across Configurations

After running all configurations, compare the costs in the Netra dashboard:

Config	Researcher	Writer	Editor	SEO	Total Cost
Premium	~$0.05	~$0.08	~$0.04	~$0.02	~$0.19
Budget	~$0.05	~$0.08	~$0.01	~$0.005	~$0.145
Economy	~$0.05	~$0.02	~$0.01	~$0.005	~$0.085

Per-agent cost breakdown showing token usage and costs for each configuration

Running Multiple Topics

Test across multiple topics to get statistically meaningful results:

TEST_TOPICS = [
    "The Future of AI in Healthcare",
    "Remote Work Best Practices for 2026",
    "Sustainable Investing for Beginners",
    "How to Build a Personal Brand Online",
    "The Rise of Electric Vehicles",
]

def run_config_comparison():
    """Run all configurations across all topics."""
    results = []

    for config_key, config in CONFIGS.items():
        for topic in TEST_TOPICS:
            print(f"Running {config_key} for: {topic[:30]}...")

            result = create_article(
                topic=topic,
                config_name=config_key,
                config=config,
            )

            results.append({
                "config": config_key,
                "topic": topic,
                "output": result["output"],
                "output_length": len(result["output"]),
            })

    return results

# Run comparison (this will take a while)
# comparison_results = run_config_comparison()

Evaluating Content Quality

Measuring cost isn’t enough—you need to ensure quality doesn’t degrade.

Why Evaluate Multi-Agent Output?

Each stage can introduce or fix quality issues:

Stage	What to Evaluate	Why It Matters
Research	Accuracy, coverage	Foundation for the article
Draft	Coherence, engagement	Reader experience
Edited	Improvement delta	Editor effectiveness
Final	SEO score, readability	Publication readiness

Creating Evaluators

In Netra, navigate to Evaluation → Evaluators to create custom evaluators.

Writer Quality Evaluator (LLM as Judge)

Use the Answer Correctness template with a custom prompt:

Evaluate the quality of this blog article draft.

Score from 0 to 1 based on:
- Coherence: Does the article flow logically? (0.3)
- Coverage: Does it cover the topic comprehensively? (0.3)
- Engagement: Is it interesting to read? (0.2)
- Structure: Does it have clear intro, body, conclusion? (0.2)

Article:
{output}

Topic:
{input}

Set Pass Criteria to >= 0.7.

Editor Effectiveness Evaluator (Code Evaluator)

Create a Code Evaluator to measure if editing improved the content:

function handler(input, output, expectedOutput) {
    // input.draft = the original draft
    // output = the edited version

    const draft = input?.draft || "";
    const edited = output || "";

    // No changes = no improvement
    if (draft === edited) {
        return 0;
    }

    // Calculate similarity (simple length-based heuristic)
    const lengthRatio = edited.length / Math.max(draft.length, 1);

    // Edited should be similar length (0.8 to 1.2x)
    if (lengthRatio < 0.7 || lengthRatio > 1.3) {
        return 0.5; // Major length change might indicate issues
    }

    // Check for improvements
    let score = 0.5; // Base score for making changes

    // Reward similar length (tight editing)
    if (lengthRatio >= 0.9 && lengthRatio <= 1.1) {
        score += 0.3;
    }

    // Reward if output is valid markdown
    if (edited.includes("# ") || edited.includes("## ")) {
        score += 0.2;
    }

    return Math.min(score, 1);
}

SEO Score Evaluator (Code Evaluator)

Measure SEO optimization quality:

function handler(input, output, expectedOutput) {
    const article = output || "";
    const keywords = expectedOutput?.keywords || [];

    let score = 0;

    // Check for meta description (30%)
    const hasMetaDesc = article.toLowerCase().includes("meta description") ||
                        article.toLowerCase().includes("meta:") ||
                        article.includes("**Meta Description**");
    if (hasMetaDesc) {
        score += 0.3;
    }

    // Check for keyword usage (40%)
    if (keywords.length > 0) {
        const articleLower = article.toLowerCase();
        const keywordsFound = keywords.filter(kw =>
            articleLower.includes(kw.toLowerCase())
        );
        score += (keywordsFound.length / keywords.length) * 0.4;
    } else {
        score += 0.2; // Partial credit if no keywords specified
    }

    // Check for proper heading structure (30%)
    const hasH1 = article.includes("# ") && !article.startsWith("## ");
    const hasH2 = article.includes("## ");
    const hasH3 = article.includes("### ");

    if (hasH1) score += 0.1;
    if (hasH2) score += 0.1;
    if (hasH1 && hasH2) score += 0.1;

    return Math.min(score, 1);
}

End-to-End Quality Evaluator (LLM as Judge)

Evaluate the final article holistically:

Evaluate this blog article for publication readiness.

Score from 0 to 1 based on:
- Informativeness: Does it provide valuable information? (0.25)
- Readability: Is it easy to read and understand? (0.25)
- SEO Optimization: Does it have proper structure and keywords? (0.25)
- Professionalism: Is it publication-ready? (0.25)

Article:
{output}

Original Topic:
{input}

Creating Test Dataset

Define test cases with expected attributes:

TEST_CASES = [
    {
        "id": "TC-001",
        "topic": "The Future of AI in Healthcare",
        "target_keywords": ["AI healthcare", "medical AI", "diagnosis", "treatment"],
        "expected_sections": ["introduction", "benefits", "challenges", "future"],
    },
    {
        "id": "TC-002",
        "topic": "Remote Work Best Practices for 2026",
        "target_keywords": ["remote work", "productivity", "work from home", "hybrid"],
        "expected_sections": ["introduction", "tools", "communication", "conclusion"],
    },
    {
        "id": "TC-003",
        "topic": "Sustainable Investing for Beginners",
        "target_keywords": ["ESG investing", "sustainable", "green funds", "portfolio"],
        "expected_sections": ["introduction", "what is ESG", "how to start", "risks"],
    },
]

Running Quality Comparison

Execute all configurations and collect results for evaluation:

def run_quality_evaluation():
    """Run all test cases across all configurations."""
    results = []

    for config_key, config in CONFIGS.items():
        for test_case in TEST_CASES:
            print(f"Running {config_key} for {test_case['id']}...")

            result = create_article(
                topic=test_case["topic"],
                config_name=f"{config_key}-{test_case['id']}",
                config=config,
            )

            results.append({
                "test_id": test_case["id"],
                "config": config_key,
                "topic": test_case["topic"],
                "output": result["output"],
                "keywords": test_case["target_keywords"],
            })

    return results

# Run evaluation
# evaluation_results = run_quality_evaluation()

Viewing Quality vs. Cost Matrix

After running evaluations, view the results in Evaluation → Experiments:

Quality vs. cost comparison showing scores for each configuration

Example results:

Config	Quality Score	SEO Score	Total Cost	Cost per Quality Point
Premium	0.92	0.88	$0.19	$0.21
Budget	0.88	0.85	$0.145	$0.16
Economy	0.78	0.82	$0.085	$0.11

Analyzing Multi-Agent Performance

Use traces to identify optimization opportunities.

Per-Agent Metrics

Track these metrics for each agent:

Metric	How to Measure	What It Reveals
Task Duration	Span latency	Bottleneck identification
Token Usage	LLM token counts	Cost driver
Output Length	Character count	Content volume
Quality Score	Evaluation results	Output value

Identifying Bottlenecks

Use the trace view to answer:

Which agent takes longest? → Target for optimization
Which agent uses most tokens? → Consider cheaper model
Where does quality drop? → Improve prompts or upgrade model

Optimization Strategies

Problem	Symptom	Solution
Writer too slow	High latency on writing task	Use faster model or shorter prompts
Editor not improving	Low edit delta score	Improve editor prompts or upgrade model
SEO weak	Missing meta/keywords	Add more specific SEO instructions
Research shallow	Low quality scores	Keep GPT-4 for researcher
High total cost	Budget exceeded	Downgrade non-critical agents (editor, SEO)

Iterating on Agent Prompts

Based on trace analysis, refine agent backstories and task descriptions:

# Example: Improved editor agent after observing poor edit quality
editor_improved = Agent(
    role="Quality Editor",
    goal="Polish articles for clarity, grammar, flow, and readability",
    backstory=(
        "You are a senior editor with 15 years of experience at major publications. "
        "You have a keen eye for detail and always improve content while preserving "
        "the author's voice. You focus on:\n"
        "1. Fixing grammatical errors and typos\n"
        "2. Improving sentence structure for better flow\n"
        "3. Ensuring consistent tone throughout\n"
        "4. Making complex ideas more accessible\n"
        "You make targeted improvements, not wholesale rewrites."
    ),
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    verbose=True,
)

Summary

Key Takeaways

Multi-agent systems need per-agent visibility to identify bottlenecks and cost drivers
Cost allocation by role reveals which agents benefit from premium models
Quality evaluation at each stage catches degradation before it compounds
Configuration experiments find the optimal cost/quality balance for your use case
Trace analysis enables data-driven prompt optimization

What You Built

4-agent content creation pipeline with CrewAI (Researcher → Writer → Editor → SEO)
Full observability with agent handoff tracing
Per-agent cost and performance tracking
Quality evaluators for each stage (Writer, Editor, SEO, End-to-End)
Configuration comparison framework (Premium, Budget, Economy)

Learn More

CrewAI Integration

Complete CrewAI instrumentation guide

Agents Documentation

Deep dive into agent observability features

Evaluators

Build custom evaluators for your use case

Usage APIs

Query cost and usage data programmatically

Cookbooks

Open in Google Colab

​What You’ll Learn

1. Build Multi-Agent Pipeline

2. Trace Agent Handoffs

3. Track Per-Agent Costs

4. Compare Configurations

5. Evaluate Quality

​Prerequisites

​High-Level Concepts

​Why Trace Multi-Agent Systems?

​CrewAI Architecture

​ContentCraft Scenario

​Building the Content Pipeline

​Installation

​Environment Setup

​Define the Agents

​Define the Tasks

​Create the Crew

​Test the Basic Pipeline

​Adding Observability with Netra

​Initialize Netra with CrewAI Instrumentation

​Tracing the Pipeline with Decorators

​Adding Custom Span Attributes

​Viewing Multi-Agent Traces

​Running Content Creation Experiments

​Configuration Definitions

​Experiment 1: Premium Config (All GPT-4)

​Experiment 2: Budget Config (Mixed Models)

​Experiment 3: Economy Config (Minimal GPT-4)

​Comparing Costs Across Configurations

​Running Multiple Topics

​Evaluating Content Quality

​Why Evaluate Multi-Agent Output?

​Creating Evaluators

​Writer Quality Evaluator (LLM as Judge)

​Editor Effectiveness Evaluator (Code Evaluator)

​SEO Score Evaluator (Code Evaluator)

​End-to-End Quality Evaluator (LLM as Judge)

​Creating Test Dataset

​Running Quality Comparison

​Viewing Quality vs. Cost Matrix

​Analyzing Multi-Agent Performance

​Per-Agent Metrics

​Identifying Bottlenecks

​Optimization Strategies

​Iterating on Agent Prompts

​Summary

​Key Takeaways

​What You Built

​Learn More

CrewAI Integration

Agents Documentation

Evaluators

Usage APIs

What You’ll Learn

Prerequisites

High-Level Concepts

Why Trace Multi-Agent Systems?

CrewAI Architecture

ContentCraft Scenario

Building the Content Pipeline

Installation

Environment Setup

Define the Agents

Define the Tasks

Create the Crew

Test the Basic Pipeline

Adding Observability with Netra

Initialize Netra with CrewAI Instrumentation

Tracing the Pipeline with Decorators

Adding Custom Span Attributes

Viewing Multi-Agent Traces

Running Content Creation Experiments

Configuration Definitions

Experiment 1: Premium Config (All GPT-4)

Experiment 2: Budget Config (Mixed Models)

Experiment 3: Economy Config (Minimal GPT-4)

Comparing Costs Across Configurations

Running Multiple Topics

Evaluating Content Quality

Why Evaluate Multi-Agent Output?

Creating Evaluators

Writer Quality Evaluator (LLM as Judge)

Editor Effectiveness Evaluator (Code Evaluator)

SEO Score Evaluator (Code Evaluator)

End-to-End Quality Evaluator (LLM as Judge)

Creating Test Dataset

Running Quality Comparison

Viewing Quality vs. Cost Matrix

Analyzing Multi-Agent Performance

Per-Agent Metrics

Identifying Bottlenecks

Optimization Strategies

Iterating on Agent Prompts

Summary

Key Takeaways

What You Built

Learn More