Skip to main content
This cookbook shows you how to systematically evaluate AI agent decision-making—measuring whether agents call the right tools, make appropriate escalation decisions, and complete multi-step workflows correctly.

Open in Google Colab

Run the complete notebook in your browser

What You’ll Learn

Tool Selection Evaluation

Validate that agents call the correct tools for each query type

Escalation Accuracy

Measure over-escalation and under-escalation rates

Workflow Completion

Verify that multi-step tasks are completed in the correct order

Custom Agent Evaluators

Build domain-specific evaluators for your agent’s behavior
Prerequisites:
  • Python >=3.10, <3.14 or Node.js 18+
  • OpenAI API key
  • Netra API key (Get started here)
  • A test dataset with expected outputs

Why Evaluate Agent Decisions?

Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound:
DimensionWhat to MeasureWhy It Matters
Tool SelectionDid it call the right tools?Wrong tools = wrong answers
Tool SequenceDid it call tools in the right order?Order matters for workflows
CompletionDid it resolve the query?Premature stops frustrate users
Escalation AccuracyDid it escalate appropriately?Over/under-escalation impacts operations
A 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3). Evaluation catches these compounding errors.

Creating Test Datasets

Test Case Structure for Agents

Agent test cases need more structure than simple LLM tests. Include:
{
  "query": "I want a refund for order ORD-12345, the item was damaged",
  "expected_tools": ["check_order_status", "process_refund"],
  "forbidden_tools": ["escalate_to_human"],
  "should_escalate": false,
  "query_category": "refund",
  "expected_outcome": "Refund initiated successfully"
}

Test Categories

Build test cases across different agent behaviors:
CategoryExample QueriesExpected Behavior
Single-tool”What is your return policy?”One tool call (search_kb)
Multi-tool”Check ticket TKT-001 and the related order”Sequence of tools
Escalation”I’m furious! Need to speak to a manager!”Escalate to human
No-tool”Thanks for your help!”Direct response, no tools
Edge cases”Refund order that doesn’t exist”Appropriate error handling

Creating the Dataset in Netra

  1. Navigate to Evaluation → Datasets
  2. Click Create Dataset
  3. Add test cases with:
    • query: The user input
    • expected_tools: List of tools that should be called
    • forbidden_tools: Tools that should not be called
    • should_escalate: Boolean for escalation expectation
    • expected_outcome: Description of correct outcome
Start with 30-50 test cases covering each category. Expand as you discover new edge cases in production.

Defining Agent Evaluators

Create evaluators in Evaluation → Evaluators → Add Evaluator.

Tool Correctness Evaluator

Use the Tool Correctness template to validate tool selection:
SettingValue
TemplateTool Correctness
Output TypeNumerical
Pass Criteriascore >= 0.8
This evaluator checks:
  • Were all expected tools called?
  • Were forbidden tools avoided?
  • Was the sequence correct (if order matters)?

Escalation Accuracy Evaluator

Create a Code Evaluator for escalation decisions:
SettingValue
TypeCode Evaluator
Output TypeNumerical
Pass Criteriascore >= 0.8
// Evaluator for escalation accuracy
function handler(input, output, expectedOutput) {
    const shouldEscalate = expectedOutput?.should_escalate || false;

    // Check if the agent escalated
    const outputLower = output.toLowerCase();
    const didEscalate = outputLower.includes("escalate") ||
                        outputLower.includes("human operator") ||
                        outputLower.includes("specialist will contact") ||
                        outputLower.includes("transferring");

    // Score based on correct escalation decision
    if (shouldEscalate === didEscalate) {
        return 1; // Correct decision
    } else if (shouldEscalate && !didEscalate) {
        return 0; // False negative - missed escalation (severe)
    } else {
        return 0.5; // False positive - over-escalation (less severe)
    }
}

Workflow Completion Evaluator

Create a code evaluator that validates multi-step workflow completion:
// Evaluator for workflow completion
function handler(input, output, expectedOutput) {
    const requiredSteps = expectedOutput?.required_steps || [];

    if (requiredSteps.length === 0) {
        return 1; // No specific steps required
    }

    const outputLower = output.toLowerCase();
    let completedSteps = 0;

    for (const step of requiredSteps) {
        // Check if evidence of step completion exists in output
        if (outputLower.includes(step.toLowerCase())) {
            completedSteps++;
        }
    }

    return completedSteps / requiredSteps.length;
}

Query Category Accuracy

Evaluate whether the agent handles different query types correctly:
// Evaluator for query category handling
function handler(input, output, expectedOutput) {
    const category = expectedOutput?.query_category;
    const outputLower = output.toLowerCase();

    const categorySignals = {
        "refund": ["refund", "credited", "processed"],
        "order_status": ["status", "tracking", "delivered", "shipped"],
        "policy": ["policy", "days", "return", "terms"],
        "escalation": ["escalate", "specialist", "human", "contact you"],
    };

    const signals = categorySignals[category] || [];

    if (signals.length === 0) {
        return 1; // Unknown category, pass by default
    }

    // Check if any category signals appear in output
    const hasSignal = signals.some(signal => outputLower.includes(signal));

    return hasSignal ? 1 : 0;
}

Running Agent Evaluations

Setup

Ensure your agent is configured with Netra tracing:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="agent-evaluation",
    environment="evaluation",
    trace_content=True,
    instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},
)

# Your agent setup
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_react_agent(model, tools, prompt=system_prompt)

def run_agent(query: str) -> str:
    """Execute agent and return response."""
    result = agent.invoke({"messages": [{"role": "user", "content": query}]})
    return result["messages"][-1].content

Running the Evaluation Suite

# Get dataset from Netra
DATASET_ID = "your-agent-test-dataset-id"

dataset = Netra.evaluation.get_dataset(DATASET_ID)

Netra.evaluation.run_test_suite(
    name="Agent Decision Evaluation",
    data=dataset,
    task=lambda eval_input: run_agent(eval_input["query"])
)

print("Evaluation complete! View results in Netra dashboard.")
Netra.shutdown()

Analyzing Results

Interpreting Evaluation Metrics

After running evaluations, review results in Evaluation → Test Runs:
MetricGoodWarningAction Needed
Tool Correctness> 90%80-90%< 80%
Escalation Accuracy> 95%90-95%< 90%
Workflow Completion> 85%75-85%< 75%

Common Failure Patterns

Low Score InLikely CauseHow to Fix
Tool CorrectnessAmbiguous tool descriptionsAdd clearer docstrings, few-shot examples
Escalation (false negatives)Missing urgency signalsAdd more escalation triggers to prompt
Escalation (false positives)Over-cautious agentNarrow escalation criteria
Workflow CompletionPremature terminationAdd explicit completion checks

Debugging with Traces

Click View Trace on any failed test case to see:
  • The exact reasoning steps (thought → action → observation)
  • Which tools were called and in what order
  • The full prompt and response at each iteration
  • Where the agent deviated from expected behavior

Improving Agent Performance

Prompt Engineering Based on Evaluation

Use evaluation insights to improve your agent prompt:
# Before: Vague escalation guidance
system_prompt = """
Escalate complex issues to human operators.
"""

# After: Specific escalation criteria from evaluation failures
system_prompt = """
Escalate to human operators when ANY of these conditions are met:
- User expresses frustration (words like "ridiculous", "unacceptable", "furious")
- User has been waiting more than 2 weeks
- User explicitly asks to speak to a human
- The issue involves policy exceptions
- You cannot resolve the issue after 2 tool calls

Do NOT escalate for:
- Simple FAQ questions
- Routine order status checks
- Standard refund requests within policy
"""

Tool Description Improvements

If tool selection is failing, improve tool descriptions:
# Before: Generic description
@tool
def search_kb(query: str) -> str:
    """Search the knowledge base."""
    ...

# After: Specific, actionable description
@tool
def search_kb(query: str) -> str:
    """Search the knowledge base for policy, procedure, or FAQ information.

    Use this tool when:
    - User asks about return/refund policies
    - User asks about shipping times or costs
    - User asks general "how to" questions

    Do NOT use for:
    - Order-specific questions (use check_order_status)
    - Ticket lookups (use lookup_ticket)
    """
    ...

Continuous Evaluation

Set up regular evaluation runs:
  1. On prompt changes — Re-run full test suite
  2. Weekly regression tests — Catch drift in agent behavior
  3. After tool additions — Ensure new tools don’t disrupt existing behavior

Summary

You’ve learned how to systematically evaluate agent decision-making:
  • Tool Correctness validates that agents call the right tools
  • Escalation Accuracy catches over/under-escalation
  • Workflow Completion ensures multi-step tasks succeed
  • Trace debugging reveals exactly where agents go wrong

Key Takeaways

  1. Agent errors compound—95% accuracy per step means 86% for 3-step workflows
  2. Test cases need structure: expected tools, forbidden tools, escalation flags
  3. Custom code evaluators handle domain-specific logic
  4. Use evaluation insights to improve prompts and tool descriptions

See Also

Last modified on February 11, 2026