Skip to main content
A traced agent tells you what happened — which tools were called, how long each step took, and what the LLM generated. Evaluation tells you whether the agent made the right decisions. Without structured scoring, you can’t tell if the agent is selecting the wrong tools, over-escalating simple requests, or stopping before the workflow is complete. These failures don’t throw errors — they just produce worse outcomes. This cookbook walks you through Netra’s evaluation workflow: creating evaluators for agent-specific quality dimensions, building test datasets from your traces, running test suites, and interpreting results to improve your agent.
Prerequisite: You need a Netra API key (Get started here) and an AI agent to evaluate. The test cases below use the customer support agent from the Tracing LangChain Agents cookbook as a reference.

What You’ll Learn

Build a Test Dataset

Create structured test cases with inputs, expected outputs, and metadata for your evaluators

Configure Agent Evaluators

Set up evaluators for tool correctness, escalation accuracy, and workflow completion

Run Test Suites

Execute evaluations via the SDK and collect quality metrics

Analyze Results & Iterate

Interpret scores, debug failures using trace integration, and improve your agent

Why Agent Decisions Need Evaluation

Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound — a 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3):
Failure ModeWhat Goes WrongWhy You Can’t Spot-Check It
Wrong tool selectionAgent uses search_kb when it should use check_order_statusThe answer may still sound reasonable despite using the wrong data source
Over-escalationAgent escalates a simple FAQ to a human operatorEach escalation looks cautious and safe in isolation — you need aggregate metrics to see the pattern
Under-escalationAgent tries to handle a frustrated customer instead of escalatingOnly visible when you compare the agent’s decision against the expected action
Incomplete workflowAgent looks up the ticket but never checks the related orderThe partial answer addresses part of the question, so it looks acceptable on a quick read
Netra’s evaluation framework addresses this with Datasets (test cases with inputs, expected outputs, and metadata), Evaluators (library and custom code-based scoring for tool usage, escalation, and completion), and Test Runs (execution results with pass/fail rates, scores, and linked traces). The workflow is: create evaluators, build test cases, run, and review. See the Evaluation Overview for a deeper look at the framework.
Now, let’s walk through the process of evaluating agent decisions:

Step 1: Create Evaluators

You need three evaluators — one from the library and two custom LLM as Judge evaluators.

Tool Correctness (Library)

Go to Evaluation → Evaluators, switch to the Library tab, and add Tool Correctness from the Tool Use category.
EvaluatorWhat It Measures
Tool CorrectnessDid the agent call the right tools, avoid forbidden tools, and use the correct sequence?

Escalation Accuracy (LLM as Judge)

Click Add Evaluator and create an LLM as Judge evaluator.
SettingValue
TypeLLM as Judge
Output TypeNumerical
Pass Criteriascore >= 0.8
Use the following prompt template:
You are evaluating whether a customer support agent made the correct escalation decision.

Metadata for this test case:
{{metadata}}

Agent's response:
{{agent_response}}

Evaluate the escalation decision based on the "should_escalate" field in the metadata:
- If should_escalate is true, the agent must have escalated (e.g., transferred to a human, mentioned a specialist, offered to connect to a manager).
- If should_escalate is false, the agent must NOT have escalated.

Score:
- 1.0 if the agent's escalation decision is correct
- 0.5 if the agent escalated unnecessarily (false positive — less severe)
- 0.0 if the agent failed to escalate when required (false negative — severe)

Workflow Completion (LLM as Judge)

Create another LLM as Judge evaluator that validates whether the agent completed all required steps.
SettingValue
TypeLLM as Judge
Output TypeBoolean
Pass Criteriatrue
Use the following prompt template:
You are evaluating whether a customer support agent completed all required workflow steps.

Metadata for this test case:
{{metadata}}

Agent's response:
{{agent_response}}

Check the "required_steps" field in the metadata. If required_steps is empty or missing, return true.

Otherwise, verify that the agent's response addresses every step listed. Steps can be addressed using different wording — check for semantic equivalence, not exact keyword matches.

Return true if ALL required steps are covered, false otherwise.
You can test each evaluator in the Playground before using it in a dataset. See Evaluators for the full reference.

Step 2: Create a Dataset

Go to Evaluation → Datasets and click Create Dataset. Name it “Agent Decisions Dataset” and attach the three evaluators from Step 1.

Configure Variable Mappings

For each evaluator, map the variables to their data source so the evaluator receives the correct inputs at runtime: Tool Correctness
VariableMaps To
expected_toolsDataset item → metadata.expected_tools
actual_toolsExecution data → summary metrics → tools
Escalation Accuracy
VariableMaps To
agent_responseAgent response
metadataDataset item → metadata
Workflow Completion
VariableMaps To
agent_responseAgent response
metadataDataset item → metadata

Add Test Cases

Add the following five test cases manually: 1. Single-tool — Policy lookup
FieldValue
InputWhat is your return policy?
Expected OutputItems can be returned within 30 days of purchase. The item must be unused and in its original packaging. Refunds are processed in 5-7 business days to your original payment method.
Metadata{"expected_tools": ["search_kb.tool"], "should_escalate": false}
2. Multi-tool — Ticket with related order
FieldValue
InputCheck ticket TKT-002 and tell me the status of the related order
Expected OutputTicket TKT-002 is open regarding a damaged item. The related order ORD-12345 has been delivered. The order contains Headphones totaling $79.99.
Metadata{"expected_tools": ["lookup_ticket.tool", "check_order_status.tool"], "should_escalate": false, "required_steps": ["ticket", "order", "delivered"]}
3. Escalation — Angry customer
FieldValue
InputThis is ridiculous! I’ve been waiting 3 weeks and nobody has helped me. I need to speak to a manager right now!
Expected OutputI understand your frustration, and I’m sorry for the delay. I’m transferring you to a specialist who can resolve this immediately.
Metadata{"should_escalate": true, "expected_tools": ["escalate_to_human.tool"]}
4. No-tool — Simple thank you
FieldValue
InputThanks for your help, that’s all I needed!
Expected OutputYou’re welcome! If you need anything else, don’t hesitate to reach out. Have a great day!
Metadata{"expected_tools": [], "should_escalate": false}
5. Edge case — Non-existent order
FieldValue
InputI want a refund for order ORD-99999
Expected OutputI wasn’t able to find an order with ID ORD-99999. Could you double-check the order number? You can find it in your confirmation email.
Metadata{"expected_tools": ["check_order_status.tool"], "should_escalate": false, "required_steps": ["not found", "order"]}
Under Evaluation → Datasets, you should now see the “Agent Decisions Dataset” with five items and three evaluators under the Evaluators tab.

Step 3: Trigger a Test Run

Copy the Dataset ID from the dataset page and use the code below.
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="agent-evaluation",
    instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},
)

# Your agent logic — wrap your agent in a function that takes
# an input string and returns the generated response.
# Tip: if you followed the tracing cookbook, you can call agent.invoke() here.
def run_agent(input_data):
    result = agent.invoke({"messages": [{"role": "user", "content": input_data}]})
    return result["messages"][-1].content

dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")

result = Netra.evaluation.run_test_suite(
    name="Agent Decision Evaluation",
    data=dataset,
    task=run_agent,
)
For more details on the evaluation API, refer to the SDK documentation.

Step 4: View Results

Go to Evaluation → Test Runs to see your test run with its status. Click on the test run to see the result for each evaluator, for each dataset item — whether it passed or failed. You can also click View Trace on any result to see the exact reasoning steps (thought → action → observation), which tools were called and in what order, and where the agent deviated from expected behavior. See Test Runs for the full reference.

Interpreting Scores and Improving Quality

When evaluator scores are low, use this table to identify the likely cause and fix:
Low Score InLikely CauseHow to Fix
Tool CorrectnessAmbiguous tool descriptionsAdd clearer docstrings with explicit “use when” / “do not use for” guidance
Escalation (false negatives)Agent misses urgency signalsAdd more escalation triggers to the system prompt (e.g., specific keywords, wait times)
Escalation (false positives)Agent is over-cautiousNarrow escalation criteria — list what should NOT be escalated
Workflow CompletionAgent stops before finishing all stepsAdd explicit completion checks to the prompt (e.g., “verify all related records before responding”)

Prompt Improvements Based on Evaluation

Use evaluation failures to refine your agent prompt:
# Before: Vague escalation guidance
system_prompt = """Escalate complex issues to human operators."""

# After: Specific criteria derived from evaluation failures
system_prompt = """
Escalate to human operators when ANY of these conditions are met:
- User expresses frustration ("ridiculous", "unacceptable", "furious")
- User has been waiting more than 2 weeks
- User explicitly asks to speak to a human
- The issue involves policy exceptions

Do NOT escalate for:
- Simple FAQ questions
- Routine order status checks
- Standard refund requests within policy
"""
After making changes, re-run the evaluation against the same dataset and compare results across test runs. Netra tracks all runs so you can see whether your changes improved quality.

Continuous Evaluation Strategy

For production agents, run evaluations regularly:
  1. On every prompt change — Re-run the full test suite to catch regressions
  2. After tool additions — Ensure new tools don’t disrupt existing tool selection patterns
  3. Weekly benchmarks — Track quality trends over time to catch gradual degradation
  4. After model upgrades — Verify that a new model version doesn’t change escalation or tool selection behavior

See Also

Trace Your LangChain Agent

Set up comprehensive tracing for your agent before evaluating

Evaluation Overview

Deep dive into Netra’s evaluation framework: datasets, evaluators, and test runs

Simulating Customer Support

Test your agent through multi-turn simulated conversations

A/B Testing Configurations

Compare different pipeline configurations systematically
Last modified on March 17, 2026