Evaluating Agent Decisions

This cookbook shows you how to systematically evaluate AI agent decision-making—measuring whether agents call the right tools, make appropriate escalation decisions, and complete multi-step workflows correctly.

Open in Google Colab

Run the complete notebook in your browser

What You’ll Learn

Tool Selection Evaluation

Validate that agents call the correct tools for each query type

Escalation Accuracy

Measure over-escalation and under-escalation rates

Workflow Completion

Verify that multi-step tasks are completed in the correct order

Custom Agent Evaluators

Build domain-specific evaluators for your agent’s behavior

Prerequisites:

Python >=3.10, <3.14 or Node.js 18+
OpenAI API key
Netra API key (Get started here)
A test dataset with expected outputs

Why Evaluate Agent Decisions?

Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound:

Dimension	What to Measure	Why It Matters
Tool Selection	Did it call the right tools?	Wrong tools = wrong answers
Tool Sequence	Did it call tools in the right order?	Order matters for workflows
Completion	Did it resolve the query?	Premature stops frustrate users
Escalation Accuracy	Did it escalate appropriately?	Over/under-escalation impacts operations

A 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3). Evaluation catches these compounding errors.

Creating Test Datasets

Test Case Structure for Agents

Agent test cases need more structure than simple LLM tests. Include:

{
  "query": "I want a refund for order ORD-12345, the item was damaged",
  "expected_tools": ["check_order_status", "process_refund"],
  "forbidden_tools": ["escalate_to_human"],
  "should_escalate": false,
  "query_category": "refund",
  "expected_outcome": "Refund initiated successfully"
}

Test Categories

Build test cases across different agent behaviors:

Category	Example Queries	Expected Behavior
Single-tool	”What is your return policy?”	One tool call (search_kb)
Multi-tool	”Check ticket TKT-001 and the related order”	Sequence of tools
Escalation	”I’m furious! Need to speak to a manager!”	Escalate to human
No-tool	”Thanks for your help!”	Direct response, no tools
Edge cases	”Refund order that doesn’t exist”	Appropriate error handling

Creating the Dataset in Netra

Navigate to Evaluation → Datasets
Click Create Dataset
Add test cases with:
- query: The user input
- expected_tools: List of tools that should be called
- forbidden_tools: Tools that should not be called
- should_escalate: Boolean for escalation expectation
- expected_outcome: Description of correct outcome

Start with 30-50 test cases covering each category. Expand as you discover new edge cases in production.

Defining Agent Evaluators

Create evaluators in Evaluation → Evaluators → Add Evaluator.

Tool Correctness Evaluator

Use the Tool Correctness template to validate tool selection:

Setting	Value
Template	Tool Correctness
Output Type	Numerical
Pass Criteria	score >= 0.8

This evaluator checks:

Were all expected tools called?
Were forbidden tools avoided?
Was the sequence correct (if order matters)?

Escalation Accuracy Evaluator

Create a Code Evaluator for escalation decisions:

Setting	Value
Type	Code Evaluator
Output Type	Numerical
Pass Criteria	score >= 0.8

// Evaluator for escalation accuracy
function handler(input, output, expectedOutput) {
    const shouldEscalate = expectedOutput?.should_escalate || false;

    // Check if the agent escalated
    const outputLower = output.toLowerCase();
    const didEscalate = outputLower.includes("escalate") ||
                        outputLower.includes("human operator") ||
                        outputLower.includes("specialist will contact") ||
                        outputLower.includes("transferring");

    // Score based on correct escalation decision
    if (shouldEscalate === didEscalate) {
        return 1; // Correct decision
    } else if (shouldEscalate && !didEscalate) {
        return 0; // False negative - missed escalation (severe)
    } else {
        return 0.5; // False positive - over-escalation (less severe)
    }
}

Workflow Completion Evaluator

Create a code evaluator that validates multi-step workflow completion:

// Evaluator for workflow completion
function handler(input, output, expectedOutput) {
    const requiredSteps = expectedOutput?.required_steps || [];

    if (requiredSteps.length === 0) {
        return 1; // No specific steps required
    }

    const outputLower = output.toLowerCase();
    let completedSteps = 0;

    for (const step of requiredSteps) {
        // Check if evidence of step completion exists in output
        if (outputLower.includes(step.toLowerCase())) {
            completedSteps++;
        }
    }

    return completedSteps / requiredSteps.length;
}

Query Category Accuracy

Evaluate whether the agent handles different query types correctly:

// Evaluator for query category handling
function handler(input, output, expectedOutput) {
    const category = expectedOutput?.query_category;
    const outputLower = output.toLowerCase();

    const categorySignals = {
        "refund": ["refund", "credited", "processed"],
        "order_status": ["status", "tracking", "delivered", "shipped"],
        "policy": ["policy", "days", "return", "terms"],
        "escalation": ["escalate", "specialist", "human", "contact you"],
    };

    const signals = categorySignals[category] || [];

    if (signals.length === 0) {
        return 1; // Unknown category, pass by default
    }

    // Check if any category signals appear in output
    const hasSignal = signals.some(signal => outputLower.includes(signal));

    return hasSignal ? 1 : 0;
}

Running Agent Evaluations

Setup

Ensure your agent is configured with Netra tracing:

from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="agent-evaluation",
    environment="evaluation",
    trace_content=True,
    instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},
)

# Your agent setup
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_react_agent(model, tools, prompt=system_prompt)

def run_agent(query: str) -> str:
    """Execute agent and return response."""
    result = agent.invoke({"messages": [{"role": "user", "content": query}]})
    return result["messages"][-1].content

Running the Evaluation Suite

# Get dataset from Netra
DATASET_ID = "your-agent-test-dataset-id"

dataset = Netra.evaluation.get_dataset(DATASET_ID)

Netra.evaluation.run_test_suite(
    name="Agent Decision Evaluation",
    data=dataset,
    task=lambda eval_input: run_agent(eval_input["query"])
)

print("Evaluation complete! View results in Netra dashboard.")
Netra.shutdown()

Analyzing Results

Interpreting Evaluation Metrics

After running evaluations, review results in Evaluation → Test Runs:

Metric	Good	Warning	Action Needed
Tool Correctness	> 90%	80-90%	< 80%
Escalation Accuracy	> 95%	90-95%	< 90%
Workflow Completion	> 85%	75-85%	< 75%

Common Failure Patterns

Low Score In	Likely Cause	How to Fix
Tool Correctness	Ambiguous tool descriptions	Add clearer docstrings, few-shot examples
Escalation (false negatives)	Missing urgency signals	Add more escalation triggers to prompt
Escalation (false positives)	Over-cautious agent	Narrow escalation criteria
Workflow Completion	Premature termination	Add explicit completion checks

Debugging with Traces

Click View Trace on any failed test case to see:

The exact reasoning steps (thought → action → observation)
Which tools were called and in what order
The full prompt and response at each iteration
Where the agent deviated from expected behavior

Improving Agent Performance

Prompt Engineering Based on Evaluation

Use evaluation insights to improve your agent prompt:

# Before: Vague escalation guidance
system_prompt = """
Escalate complex issues to human operators.
"""

# After: Specific escalation criteria from evaluation failures
system_prompt = """
Escalate to human operators when ANY of these conditions are met:
- User expresses frustration (words like "ridiculous", "unacceptable", "furious")
- User has been waiting more than 2 weeks
- User explicitly asks to speak to a human
- The issue involves policy exceptions
- You cannot resolve the issue after 2 tool calls

Do NOT escalate for:
- Simple FAQ questions
- Routine order status checks
- Standard refund requests within policy
"""

Tool Description Improvements

If tool selection is failing, improve tool descriptions:

# Before: Generic description
@tool
def search_kb(query: str) -> str:
    """Search the knowledge base."""
    ...

# After: Specific, actionable description
@tool
def search_kb(query: str) -> str:
    """Search the knowledge base for policy, procedure, or FAQ information.

    Use this tool when:
    - User asks about return/refund policies
    - User asks about shipping times or costs
    - User asks general "how to" questions

    Do NOT use for:
    - Order-specific questions (use check_order_status)
    - Ticket lookups (use lookup_ticket)
    """
    ...

Continuous Evaluation

Set up regular evaluation runs:

On prompt changes — Re-run full test suite
Weekly regression tests — Catch drift in agent behavior
After tool additions — Ensure new tools don’t disrupt existing behavior

Summary

You’ve learned how to systematically evaluate agent decision-making:

Tool Correctness validates that agents call the right tools
Escalation Accuracy catches over/under-escalation
Workflow Completion ensures multi-step tasks succeed
Trace debugging reveals exactly where agents go wrong

Key Takeaways

Agent errors compound—95% accuracy per step means 86% for 3-step workflows
Test cases need structure: expected tools, forbidden tools, escalation flags
Custom code evaluators handle domain-specific logic
Use evaluation insights to improve prompts and tool descriptions

Trace LangChain Agents

Add comprehensive tracing to see every agent decision

Custom Evaluator Patterns

Build more sophisticated evaluators for your use case

Evaluators Documentation

Deep dive into evaluator configuration

Test Runs

Understanding test run results and metrics

Observability

Evaluation

Open in Google Colab

What You’ll Learn

Tool Selection Evaluation

Escalation Accuracy

Workflow Completion

Custom Agent Evaluators

Why Evaluate Agent Decisions?

Creating Test Datasets

Test Case Structure for Agents

Test Categories

Creating the Dataset in Netra

Defining Agent Evaluators

Tool Correctness Evaluator

Escalation Accuracy Evaluator

Workflow Completion Evaluator

Query Category Accuracy

Running Agent Evaluations

Setup

Running the Evaluation Suite

Analyzing Results

Interpreting Evaluation Metrics

Common Failure Patterns

Debugging with Traces

Improving Agent Performance

Prompt Engineering Based on Evaluation

Tool Description Improvements

Continuous Evaluation

Summary

Key Takeaways

See Also

Trace LangChain Agents

Custom Evaluator Patterns

Evaluators Documentation

Test Runs

Observability

Evaluation

Open in Google Colab

​What You’ll Learn

Tool Selection Evaluation

Escalation Accuracy

Workflow Completion

Custom Agent Evaluators

​Why Evaluate Agent Decisions?

​Creating Test Datasets

​Test Case Structure for Agents

​Test Categories

​Creating the Dataset in Netra

​Defining Agent Evaluators

​Tool Correctness Evaluator

​Escalation Accuracy Evaluator

​Workflow Completion Evaluator

​Query Category Accuracy

​Running Agent Evaluations

​Setup

​Running the Evaluation Suite

​Analyzing Results

​Interpreting Evaluation Metrics

​Common Failure Patterns

​Debugging with Traces

​Improving Agent Performance

​Prompt Engineering Based on Evaluation

​Tool Description Improvements

​Continuous Evaluation

​Summary

​Key Takeaways

​See Also

Trace LangChain Agents

Custom Evaluator Patterns

Evaluators Documentation

Test Runs

What You’ll Learn

Why Evaluate Agent Decisions?

Creating Test Datasets

Test Case Structure for Agents

Test Categories

Creating the Dataset in Netra

Defining Agent Evaluators

Tool Correctness Evaluator

Escalation Accuracy Evaluator

Workflow Completion Evaluator

Query Category Accuracy

Running Agent Evaluations

Setup

Running the Evaluation Suite

Analyzing Results

Interpreting Evaluation Metrics

Common Failure Patterns

Debugging with Traces

Improving Agent Performance

Prompt Engineering Based on Evaluation

Tool Description Improvements

Continuous Evaluation

Summary

Key Takeaways

See Also