Skip to main content
This cookbook walks you through adding complete observability and evaluation to a LangChain ReAct agent—tracing each step of the reasoning loop, capturing tool invocations with latency breakdowns, and measuring tool selection accuracy.

Open in Google Colab

Run the complete notebook in your browser
All company names (TaskBot, ShopFlow) and scenarios in this cookbook are entirely fictional and used for demonstration purposes only.

What You’ll Learn

This cookbook guides you through 5 key stages of building an observable LangChain agent:

Prerequisites

  • Python 3.9+
  • OpenAI API key
  • Netra API key (Get started here)
  • LangChain installed

High-Level Concepts

Why Trace Agents?

Unlike simple LLM calls, agents involve multi-step reasoning that can fail in subtle ways:
Failure ModeSymptomWhat Tracing Reveals
Wrong tool selectionAgent uses incorrect toolTool call sequence, decision reasoning
Infinite loopsAgent repeats actionsIteration count, repeated patterns
Hallucinated toolsAgent calls non-existent toolTool names vs. available tools
Premature terminationAgent stops before completionFinal state, missing steps
Over-escalationAgent escalates simple queriesEscalation triggers, query classification
Without visibility into the reasoning loop, debugging these failures requires guesswork.

The ReAct Pattern

ReAct (Reasoning + Acting) agents follow an iterative loop:
┌─────────────────────────────────────────────────────┐
│                    User Query                        │
└─────────────────────┬───────────────────────────────┘


         ┌────────────────────────┐
         │   Thought: Reason      │◄──────────────┐
         │   about what to do     │               │
         └───────────┬────────────┘               │
                     │                            │
                     ▼                            │
         ┌────────────────────────┐               │
         │   Action: Select and   │               │
         │   invoke a tool        │               │
         └───────────┬────────────┘               │
                     │                            │
                     ▼                            │
         ┌────────────────────────┐               │
         │   Observation: Get     │───────────────┘
         │   tool result          │    (loop until done)
         └───────────┬────────────┘


         ┌────────────────────────┐
         │   Final Answer         │
         └────────────────────────┘
Netra captures each iteration as nested spans, giving you visibility into the agent’s decision-making process.

TaskBot Scenario

TaskBot is a fictional AI assistant for ShopFlow, an e-commerce platform. It handles user queries using five tools:
ToolDescriptionWhen to Use
lookup_ticketRetrieve ticket details by IDUser references a ticket number
search_kbSearch knowledge baseGeneral product/policy questions
check_order_statusGet order status and trackingOrder-related inquiries
process_refundInitiate a refundRefund requests (with validation)
escalate_to_humanTransfer to human operatorComplex issues, urgent requests

Building the TaskBot Agent

Let’s build the ReAct agent first, then add tracing and evaluation.

Installation

Install the required packages:
pip install netra-sdk langchain langchain-openai openai

Environment Setup

Configure your API keys:
export NETRA_API_KEY="your-netra-api-key"
export OPENAI_API_KEY="your-openai-api-key"

Mock Data

First, let’s define mock data that our tools will operate on:
from typing import Dict, List, Optional
from dataclasses import dataclass
from datetime import datetime, timedelta

# Mock ticket database
TICKETS: Dict[str, dict] = {
    "TKT-001": {
        "id": "TKT-001",
        "subject": "Return policy question",
        "status": "open",
        "created_at": "2026-01-15T10:30:00Z",
        "order_id": None,
        "priority": "low",
    },
    "TKT-002": {
        "id": "TKT-002",
        "subject": "Damaged item received",
        "status": "open",
        "created_at": "2026-01-20T14:15:00Z",
        "order_id": "ORD-12345",
        "priority": "high",
    },
    "TKT-003": {
        "id": "TKT-003",
        "subject": "Urgent: 3 week delay",
        "status": "open",
        "created_at": "2026-01-10T09:00:00Z",
        "order_id": "ORD-99999",
        "priority": "critical",
    },
}

# Mock order database
ORDERS: Dict[str, dict] = {
    "ORD-12345": {
        "id": "ORD-12345",
        "status": "delivered",
        "items": ["Wireless Headphones"],
        "total": 79.99,
        "tracking_number": "1Z999AA10123456784",
        "delivered_at": "2026-01-18T16:30:00Z",
    },
    "ORD-99999": {
        "id": "ORD-99999",
        "status": "processing",
        "items": ["Gaming Monitor"],
        "total": 349.99,
        "tracking_number": None,
        "estimated_ship_date": "2026-02-01T00:00:00Z",
    },
}

# Mock knowledge base
KNOWLEDGE_BASE: List[dict] = [
    {
        "title": "Return Policy",
        "content": "Items can be returned within 30 days of delivery for a full refund. "
                   "Items must be in original packaging and unused condition. "
                   "Electronics must include all accessories.",
    },
    {
        "title": "Refund Processing",
        "content": "Refunds are processed within 5-7 business days after we receive "
                   "the returned item. Refunds are credited to the original payment method.",
    },
    {
        "title": "Shipping Times",
        "content": "Standard shipping: 5-7 business days. Express shipping: 2-3 business days. "
                   "Processing time is 1-2 business days before shipping.",
    },
    {
        "title": "Damaged Items",
        "content": "If you received a damaged item, please contact us within 48 hours. "
                   "We will arrange a replacement or full refund including shipping costs.",
    },
]

Define the Tools

Create LangChain tools with proper type annotations and docstrings:
from langchain.tools import tool

@tool
def lookup_ticket(ticket_id: str) -> str:
    """Look up a ticket by its ID to get details about the issue.

    Args:
        ticket_id: The ticket ID (e.g., TKT-001)

    Returns:
        Ticket details including subject, status, priority, and associated order
    """
    ticket = TICKETS.get(ticket_id.upper())
    if not ticket:
        return f"No ticket found with ID: {ticket_id}"

    return (
        f"Ticket {ticket['id']}:\n"
        f"  Subject: {ticket['subject']}\n"
        f"  Status: {ticket['status']}\n"
        f"  Priority: {ticket['priority']}\n"
        f"  Created: {ticket['created_at']}\n"
        f"  Associated Order: {ticket['order_id'] or 'None'}"
    )

@tool
def search_kb(query: str) -> str:
    """Search the knowledge base for information about policies, procedures, or FAQs.

    Args:
        query: The search query (e.g., "return policy", "shipping times")

    Returns:
        Relevant knowledge base articles matching the query
    """
    query_lower = query.lower()
    results = []

    for article in KNOWLEDGE_BASE:
        if (query_lower in article["title"].lower() or
            query_lower in article["content"].lower()):
            results.append(f"**{article['title']}**\n{article['content']}")

    if not results:
        return "No relevant articles found. Try different search terms."

    return "\n\n".join(results)

@tool
def check_order_status(order_id: str) -> str:
    """Check the status of an order including shipping and tracking information.

    Args:
        order_id: The order ID (e.g., ORD-12345)

    Returns:
        Order status, items, tracking number, and delivery information
    """
    order = ORDERS.get(order_id.upper())
    if not order:
        return f"No order found with ID: {order_id}"

    status_info = f"Order {order['id']}:\n"
    status_info += f"  Status: {order['status']}\n"
    status_info += f"  Items: {', '.join(order['items'])}\n"
    status_info += f"  Total: ${order['total']:.2f}\n"

    if order.get("tracking_number"):
        status_info += f"  Tracking: {order['tracking_number']}\n"
    if order.get("delivered_at"):
        status_info += f"  Delivered: {order['delivered_at']}\n"
    if order.get("estimated_ship_date"):
        status_info += f"  Est. Ship Date: {order['estimated_ship_date']}\n"

    return status_info

@tool
def process_refund(order_id: str, reason: str) -> str:
    """Process a refund for an order. Only use after verifying the order status.

    Args:
        order_id: The order ID to refund
        reason: The reason for the refund

    Returns:
        Confirmation of refund initiation or error message
    """
    order = ORDERS.get(order_id.upper())
    if not order:
        return f"Cannot process refund: No order found with ID {order_id}"

    if order["status"] not in ["delivered", "shipped"]:
        return f"Cannot process refund: Order status is '{order['status']}'. Refunds are only available for shipped or delivered orders."

    return (
        f"Refund initiated for Order {order_id}:\n"
        f"  Amount: ${order['total']:.2f}\n"
        f"  Reason: {reason}\n"
        f"  Status: Processing\n"
        f"  Expected completion: 5-7 business days\n"
        f"  Refund will be credited to original payment method."
    )

@tool
def escalate_to_human(ticket_id: str, reason: str) -> str:
    """Escalate a ticket to a human operator. Use for complex issues or urgent requests.

    Args:
        ticket_id: The ticket ID to escalate
        reason: The reason for escalation

    Returns:
        Confirmation of escalation
    """
    ticket = TICKETS.get(ticket_id.upper()) if ticket_id else None

    return (
        f"Ticket escalated to human operator:\n"
        f"  Ticket ID: {ticket_id or 'New ticket created'}\n"
        f"  Reason: {reason}\n"
        f"  Priority: Urgent\n"
        f"  Expected response: Within 1 hour\n"
        f"  A specialist will contact you shortly."
    )

Create the ReAct Agent

Build the agent using LangChain’s ReAct implementation:
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Define the tools
tools = [lookup_ticket, search_kb, check_order_status, process_refund, escalate_to_human]

# Create the ReAct prompt
react_prompt = PromptTemplate.from_template("""You are TaskBot, an AI assistant for ShopFlow e-commerce platform.

You help users with:
- Order status and tracking
- Return and refund requests
- Policy questions
- Escalating complex issues

Always be helpful and professional. Use tools to look up information before responding.
For refund requests, always check the order status first.
Escalate to human operators when: the user is frustrated, the issue is complex, or you cannot resolve it.

You have access to these tools:
{tools}

Use this format:

Question: the user's question
Thought: think about what to do
Action: the tool name (one of [{tool_names}])
Action Input: the input to the tool
Observation: the tool's output
... (repeat Thought/Action/Action Input/Observation as needed)
Thought: I now know the final answer
Final Answer: the response to the user

Begin!

Question: {input}
Thought: {agent_scratchpad}""")

# Create the agent
agent = create_react_agent(llm, tools, react_prompt)

# Create the executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True,
    max_iterations=10,
)

Test the Basic Agent

Verify the agent works before adding tracing:
# Simple FAQ query
response = agent_executor.invoke({
    "input": "What is your return policy?"
})
print(response["output"])

# Order status query
response = agent_executor.invoke({
    "input": "Where is my order ORD-12345?"
})
print(response["output"])

Adding Observability with Netra

Now let’s instrument the agent for full observability.

Initialize Netra with LangChain Instrumentation

Netra provides auto-instrumentation for LangChain that captures agent execution automatically:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

# Initialize Netra with LangChain and OpenAI instrumentation
Netra.init(
    app_name="taskbot",
    instruments=set([InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN]),
    trace_content=True,
)
With auto-instrumentation enabled, Netra automatically captures:
  • Agent execution spans
  • LLM calls with prompts and completions
  • Tool invocations with inputs and outputs
  • Token usage and costs

Tracing Agent Execution with Decorators

For more control, wrap your agent handler with the @agent decorator:
from netra.decorators import agent, task

@agent(name="taskbot-agent")
def handle_request(query: str, ticket_id: str = None, user_id: str = None) -> dict:
    """Handle a user request with full tracing."""

    # Set user context if provided
    if user_id:
        Netra.set_user_id(user_id)

    # Add custom attributes
    if ticket_id:
        Netra.set_custom_attributes(key="ticket_id", value=ticket_id)

    # Execute the agent
    result = agent_executor.invoke({"input": query})

    return {
        "query": query,
        "response": result["output"],
        "ticket_id": ticket_id,
    }

Tracing Tool Calls

The auto-instrumentation captures tool calls, but you can add custom tracing for business logic:
from netra import Netra, SpanType

@tool
def lookup_ticket_traced(ticket_id: str) -> str:
    """Look up a ticket with custom span attributes."""
    with Netra.start_span("ticket-lookup", as_type=SpanType.TOOL) as span:
        span.set_attribute("ticket_id", ticket_id)

        ticket = TICKETS.get(ticket_id.upper())

        if ticket:
            span.set_attribute("ticket_status", ticket["status"])
            span.set_attribute("ticket_priority", ticket["priority"])
            span.set_attribute("found", True)
        else:
            span.set_attribute("found", False)

        # Return the result
        if not ticket:
            return f"No ticket found with ID: {ticket_id}"

        return (
            f"Ticket {ticket['id']}:\n"
            f"  Subject: {ticket['subject']}\n"
            f"  Status: {ticket['status']}\n"
            f"  Priority: {ticket['priority']}"
        )

Manual Span Tracing for Custom Workflows

For fine-grained control over trace structure, use manual spans:
from netra import Netra, SpanType

def handle_complex_request(query: str, ticket_id: str = None):
    """Handle a request with detailed manual tracing."""

    with Netra.start_span("request-handler", as_type=SpanType.AGENT) as parent_span:
        parent_span.set_attribute("query", query)
        parent_span.set_attribute("ticket_id", ticket_id)

        # Classification step
        with Netra.start_span("query-classification") as class_span:
            # Classify the query type
            query_type = classify_query(query)
            class_span.set_attribute("query_type", query_type)

        # Agent execution
        with Netra.start_span("agent-execution", as_type=SpanType.AGENT) as agent_span:
            result = agent_executor.invoke({"input": query})
            agent_span.set_attribute("iterations", result.get("intermediate_steps", []))

        # Post-processing
        with Netra.start_span("response-formatting") as format_span:
            formatted_response = format_response(result["output"])
            format_span.set_attribute("response_length", len(formatted_response))

        return formatted_response

def classify_query(query: str) -> str:
    """Classify query type for routing."""
    query_lower = query.lower()
    if "refund" in query_lower:
        return "refund"
    elif "order" in query_lower or "tracking" in query_lower:
        return "order_status"
    elif "urgent" in query_lower or "help" in query_lower:
        return "escalation"
    else:
        return "general"

def format_response(response: str) -> str:
    """Format the response for the user."""
    return response.strip()

Viewing Agent Traces

After running requests, navigate to Observability → Traces in Netra. You’ll see the full agent execution flow:
Netra trace view showing nested agent spans with thought, action, and observation steps
The trace shows:
  • Parent span: The overall agent execution
  • LLM calls: Each reasoning step with prompts and completions
  • Tool calls: Each tool invocation with inputs, outputs, and latency
  • Token usage: Cumulative token counts and costs

Running Sample Requests

Let’s test the agent with different query types to see tracing in action.

Simple Query: FAQ Lookup

# Single-tool query - should use search_kb
response = handle_request(
    query="What is your return policy?",
    user_id="user-001",
)
print(response["response"])
Expected behavior: Agent uses search_kb once and returns the policy information.

Single-Tool Query: Order Status

# Order status query - should use check_order_status
response = handle_request(
    query="Where is my order ORD-12345?",
    user_id="user-002",
)
print(response["response"])
Expected behavior: Agent uses check_order_status and provides tracking information.

Multi-Step Query: Refund Request

# Multi-step workflow - should use multiple tools
response = handle_request(
    query="I want a refund for order ORD-12345, the item arrived damaged",
    ticket_id="TKT-002",
    user_id="user-003",
)
print(response["response"])
Expected behavior: Agent uses check_order_status to verify the order, then process_refund to initiate the refund.

Edge Case: Escalation Required

# Escalation scenario - should detect urgency
response = handle_request(
    query="I've been waiting 3 weeks and need urgent help! I want to speak to someone immediately!",
    ticket_id="TKT-003",
    user_id="user-004",
)
print(response["response"])
Expected behavior: Agent uses lookup_ticket to get context, then escalate_to_human due to the urgent tone.

Comparing Traces

After running these requests, compare the traces in the Netra dashboard:
Comparison of agent traces showing different tool call patterns for simple vs complex queries
Notice how:
  • Simple queries have 1-2 tool calls
  • Complex queries have multiple tool calls in sequence
  • Escalation queries show the agent’s decision-making process

Evaluating Agent Performance

Systematic evaluation ensures your agent behaves correctly across different scenarios.

Why Evaluate Agents?

Agent evaluation differs from simple LLM evaluation:
DimensionWhat to MeasureWhy It Matters
Tool SelectionDid it call the right tools?Wrong tools = wrong answers
Tool SequenceDid it call tools in the right order?Order matters for multi-step workflows
CompletionDid it resolve the query?Premature stops frustrate users
Escalation AccuracyDid it escalate appropriately?Over/under-escalation impacts operations

Creating Test Datasets

Define test cases with expected tool calls:
TEST_CASES = [
    # Simple FAQ queries
    {
        "id": "TC-001",
        "category": "faq",
        "query": "What is your return policy?",
        "expected_tools": ["search_kb"],
        "forbidden_tools": ["process_refund", "escalate_to_human"],
        "should_escalate": False,
    },
    {
        "id": "TC-002",
        "category": "faq",
        "query": "How long does shipping take?",
        "expected_tools": ["search_kb"],
        "forbidden_tools": ["process_refund", "escalate_to_human"],
        "should_escalate": False,
    },

    # Order status queries
    {
        "id": "TC-003",
        "category": "order",
        "query": "Where is my order ORD-12345?",
        "expected_tools": ["check_order_status"],
        "forbidden_tools": ["process_refund"],
        "should_escalate": False,
    },
    {
        "id": "TC-004",
        "category": "order",
        "query": "Can you check the tracking for order ORD-12345?",
        "expected_tools": ["check_order_status"],
        "forbidden_tools": ["process_refund"],
        "should_escalate": False,
    },

    # Refund requests (multi-step)
    {
        "id": "TC-005",
        "category": "refund",
        "query": "I want a refund for order ORD-12345, item was damaged",
        "expected_tools": ["check_order_status", "process_refund"],
        "forbidden_tools": [],
        "should_escalate": False,
    },
    {
        "id": "TC-006",
        "category": "refund",
        "query": "Please process a refund for my damaged headphones, order ORD-12345",
        "expected_tools": ["check_order_status", "process_refund"],
        "forbidden_tools": [],
        "should_escalate": False,
    },

    # Escalation scenarios
    {
        "id": "TC-007",
        "category": "escalation",
        "query": "This is ridiculous! I've been waiting 3 weeks! I need to speak to someone NOW!",
        "expected_tools": ["escalate_to_human"],
        "forbidden_tools": [],
        "should_escalate": True,
    },
    {
        "id": "TC-008",
        "category": "escalation",
        "query": "I've tried everything and nothing works. I need human help.",
        "expected_tools": ["escalate_to_human"],
        "forbidden_tools": [],
        "should_escalate": True,
    },
]

Using the Tool Correctness Evaluator

Netra provides a Tool Correctness evaluator that validates tool selection. Configure it in Evaluation → Evaluators:
SettingValue
NameTool Selection Accuracy
TypeTool Correctness
Pass CriteriaScore >= 0.8
The evaluator checks:
  • Expected tools called: Did the agent call all required tools?
  • Forbidden tools avoided: Did it avoid calling tools it shouldn’t?
  • Sequence correctness: Were tools called in the expected order?
Tool Correctness evaluator configuration in Netra

Creating a Code Evaluator for Escalation

For custom business logic, create a Code Evaluator to measure escalation precision:
// handler function is required
function handler(input, output, expectedOutput) {
    const shouldEscalate = expectedOutput?.should_escalate || false;

    // Check if the agent called escalate_to_human
    const outputLower = output.toLowerCase();
    const didEscalate = outputLower.includes("escalate") ||
                        outputLower.includes("human operator") ||
                        outputLower.includes("specialist will contact");

    // Score based on correct escalation decision
    if (shouldEscalate === didEscalate) {
        return 1; // Correct decision
    } else if (shouldEscalate && !didEscalate) {
        return 0; // False negative - should have escalated
    } else {
        return 0.5; // False positive - over-escalation (less severe)
    }
}
Set Output Type to Numerical and Pass Criteria to >= 0.8.

Running Evaluation Experiments

Create a script to run all test cases and collect results:
from netra import Netra
from netra.decorators import agent
import json

def run_evaluation():
    """Run all test cases and collect results."""
    results = []

    for test_case in TEST_CASES:
        print(f"Running {test_case['id']}: {test_case['query'][:50]}...")

        try:
            # Run the agent
            response = handle_request(
                query=test_case["query"],
                user_id=f"eval-{test_case['id']}",
            )

            # Collect the result
            results.append({
                "test_id": test_case["id"],
                "category": test_case["category"],
                "query": test_case["query"],
                "response": response["response"],
                "expected_tools": test_case["expected_tools"],
                "forbidden_tools": test_case["forbidden_tools"],
                "should_escalate": test_case["should_escalate"],
                "status": "success",
            })

        except Exception as e:
            results.append({
                "test_id": test_case["id"],
                "category": test_case["category"],
                "query": test_case["query"],
                "error": str(e),
                "status": "error",
            })

    return results

# Run the evaluation
evaluation_results = run_evaluation()

# Print summary
for category in ["faq", "order", "refund", "escalation"]:
    category_results = [r for r in evaluation_results if r["category"] == category]
    success_count = len([r for r in category_results if r["status"] == "success"])
    print(f"{category}: {success_count}/{len(category_results)} successful")

Viewing Evaluation Results

Navigate to Evaluation → Experiments in Netra to see the results:
Evaluation results showing pass/fail rates by category
The dashboard shows:
  • Pass rate by category: Which query types are handled correctly
  • Tool accuracy: How often the agent selects the right tools
  • Failure analysis: Which test cases failed and why

Analyzing Results and Iterating

Use traces to debug failures and improve your agent.

Using Traces to Debug Failures

When a test case fails:
  1. Find the trace: Filter by the test case user ID (e.g., eval-TC-007)
  2. Examine the reasoning: Look at the thought steps to understand the decision
  3. Check tool calls: Verify which tools were called and in what order
  4. Identify the root cause: Was it a prompt issue, tool description issue, or LLM limitation?
Example debugging flow:
# If escalation is under-triggering, examine the trace:
# 1. Look at the "Thought" span - did the agent recognize urgency?
# 2. Check if the prompt includes escalation criteria
# 3. Add explicit examples of when to escalate

# Updated prompt with better escalation guidance:
escalation_guidance = """
Escalate to human operators when ANY of these conditions are met:
- User expresses frustration (words like "ridiculous", "unacceptable", "furious")
- User has been waiting more than 2 weeks
- User explicitly asks to speak to a human
- The issue involves policy exceptions
- You cannot resolve the issue with available tools
"""

Iterating on the Agent

After identifying issues, iterate on:
  1. Prompt engineering: Add clearer instructions for tool selection
  2. Tool descriptions: Make tool purposes more explicit
  3. Examples: Add few-shot examples for edge cases
  4. Guardrails: Add validation before certain tool calls

Summary

Key Takeaways

  1. ReAct agents need visibility into the reasoning loop—trace each thought, action, and observation
  2. Tool call tracing reveals latency bottlenecks and decision patterns
  3. Tool Correctness evaluator validates that agents call the right tools in the right order
  4. Test cases by category ensure coverage across simple, complex, and edge scenarios
  5. Trace analysis enables systematic debugging of agent failures

What You Built

  • A LangChain ReAct agent with 5 tools for e-commerce assistance
  • Full observability with Netra auto-instrumentation
  • Custom span tracing for business logic
  • Evaluation suite with tool correctness checks
  • Debugging workflow using trace analysis

Learn More

Last modified on February 3, 2026