Evaluate AI agent decisions with Netra. Measure tool selection accuracy, escalation logic, and workflow completion using structured evaluation datasets.
Use this file to discover all available pages before exploring further.
A traced agent tells you what happened — which tools were called, how long each step took, and what the LLM generated. Evaluation tells you whether the agent made the right decisions. Without structured scoring, you can’t tell if the agent is selecting the wrong tools, over-escalating simple requests, or stopping before the workflow is complete. These failures don’t throw errors — they just produce worse outcomes.This cookbook walks you through Netra’s evaluation workflow: creating evaluators for agent-specific quality dimensions, building test datasets from your traces, running test suites, and interpreting results to improve your agent.
Prerequisite: You need a Netra API key (Get started here) and an AI agent to evaluate. The test cases below use the customer support agent from the Tracing LangChain Agents cookbook as a reference.
Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound — a 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3):
Failure Mode
What Goes Wrong
Why You Can’t Spot-Check It
Wrong tool selection
Agent uses search_kb when it should use check_order_status
The answer may still sound reasonable despite using the wrong data source
Over-escalation
Agent escalates a simple FAQ to a human operator
Each escalation looks cautious and safe in isolation — you need aggregate metrics to see the pattern
Under-escalation
Agent tries to handle a frustrated customer instead of escalating
Only visible when you compare the agent’s decision against the expected action
Incomplete workflow
Agent looks up the ticket but never checks the related order
The partial answer addresses part of the question, so it looks acceptable on a quick read
Netra’s evaluation framework addresses this with Datasets (test cases with inputs, expected outputs, and metadata), Evaluators (library and custom code-based scoring for tool usage, escalation, and completion), and Test Runs (execution results with pass/fail rates, scores, and linked traces). The workflow is: create evaluators, build test cases, run, and review. See the Evaluation Overview for a deeper look at the framework.Now, let’s walk through the process of evaluating agent decisions:
Click Add Evaluator and create an LLM as Judge evaluator.
Setting
Value
Type
LLM as Judge
Output Type
Numerical
Pass Criteria
score >= 0.8
Use the following prompt template:
You are evaluating whether a customer support agent made the correct escalation decision.Metadata for this test case:{{metadata}}Agent's response:{{agent_response}}Evaluate the escalation decision based on the "should_escalate" field in the metadata:- If should_escalate is true, the agent must have escalated (e.g., transferred to a human, mentioned a specialist, offered to connect to a manager).- If should_escalate is false, the agent must NOT have escalated.Score:- 1.0 if the agent's escalation decision is correct- 0.5 if the agent escalated unnecessarily (false positive — less severe)- 0.0 if the agent failed to escalate when required (false negative — severe)
Create another LLM as Judge evaluator that validates whether the agent completed all required steps.
Setting
Value
Type
LLM as Judge
Output Type
Boolean
Pass Criteria
true
Use the following prompt template:
You are evaluating whether a customer support agent completed all required workflow steps.Metadata for this test case:{{metadata}}Agent's response:{{agent_response}}Check the "required_steps" field in the metadata. If required_steps is empty or missing, return true.Otherwise, verify that the agent's response addresses every step listed. Steps can be addressed using different wording — check for semantic equivalence, not exact keyword matches.Return true if ALL required steps are covered, false otherwise.
You can test each evaluator in the Playground before using it in a dataset. See Evaluators for the full reference.
Add the following five test cases manually:1. Single-tool — Policy lookup
Field
Value
Input
What is your return policy?
Expected Output
Items can be returned within 30 days of purchase. The item must be unused and in its original packaging. Refunds are processed in 5-7 business days to your original payment method.
Copy the Dataset ID from the dataset page and use the code below.
from netra import Netrafrom netra.instrumentation.instruments import InstrumentSetNetra.init( app_name="agent-evaluation", instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},)# Your agent logic — wrap your agent in a function that takes# an input string and returns the generated response.# Tip: if you followed the tracing cookbook, you can call agent.invoke() here.def run_agent(input_data): result = agent.invoke({"messages": [{"role": "user", "content": input_data}]}) return result["messages"][-1].contentdataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")result = Netra.evaluation.run_test_suite( name="Agent Decision Evaluation", data=dataset, task=run_agent,)
For more details on the evaluation API, refer to the SDK documentation.
Go to Evaluation → Test Runs to see your test run with its status. Click on the test run to see the result for each evaluator, for each dataset item — whether it passed or failed.You can also click View Trace on any result to see the exact reasoning steps (thought → action → observation), which tools were called and in what order, and where the agent deviated from expected behavior. See Test Runs for the full reference.
Use evaluation failures to refine your agent prompt:
# Before: Vague escalation guidancesystem_prompt = """Escalate complex issues to human operators."""# After: Specific criteria derived from evaluation failuressystem_prompt = """Escalate to human operators when ANY of these conditions are met:- User expresses frustration ("ridiculous", "unacceptable", "furious")- User has been waiting more than 2 weeks- User explicitly asks to speak to a human- The issue involves policy exceptionsDo NOT escalate for:- Simple FAQ questions- Routine order status checks- Standard refund requests within policy"""
After making changes, re-run the evaluation against the same dataset and compare results across test runs. Netra tracks all runs so you can see whether your changes improved quality.