Prerequisite: You need a Netra API key (Get started here) and an AI agent to evaluate. The test cases below use the customer support agent from the Tracing LangChain Agents cookbook as a reference.
What You’ll Learn
Build a Test Dataset
Create structured test cases with inputs, expected outputs, and metadata for your evaluators
Configure Agent Evaluators
Set up evaluators for tool correctness, escalation accuracy, and workflow completion
Run Test Suites
Execute evaluations via the SDK and collect quality metrics
Analyze Results & Iterate
Interpret scores, debug failures using trace integration, and improve your agent
Why Agent Decisions Need Evaluation
Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound — a 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3):| Failure Mode | What Goes Wrong | Why You Can’t Spot-Check It |
|---|---|---|
| Wrong tool selection | Agent uses search_kb when it should use check_order_status | The answer may still sound reasonable despite using the wrong data source |
| Over-escalation | Agent escalates a simple FAQ to a human operator | Each escalation looks cautious and safe in isolation — you need aggregate metrics to see the pattern |
| Under-escalation | Agent tries to handle a frustrated customer instead of escalating | Only visible when you compare the agent’s decision against the expected action |
| Incomplete workflow | Agent looks up the ticket but never checks the related order | The partial answer addresses part of the question, so it looks acceptable on a quick read |
Now, let’s walk through the process of evaluating agent decisions:
Step 1: Create Evaluators
You need three evaluators — one from the library and two custom LLM as Judge evaluators.Tool Correctness (Library)
Go to Evaluation → Evaluators, switch to the Library tab, and add Tool Correctness from the Tool Use category.| Evaluator | What It Measures |
|---|---|
| Tool Correctness | Did the agent call the right tools, avoid forbidden tools, and use the correct sequence? |
Escalation Accuracy (LLM as Judge)
Click Add Evaluator and create an LLM as Judge evaluator.| Setting | Value |
|---|---|
| Type | LLM as Judge |
| Output Type | Numerical |
| Pass Criteria | score >= 0.8 |
Workflow Completion (LLM as Judge)
Create another LLM as Judge evaluator that validates whether the agent completed all required steps.| Setting | Value |
|---|---|
| Type | LLM as Judge |
| Output Type | Boolean |
| Pass Criteria | true |
Step 2: Create a Dataset
Go to Evaluation → Datasets and click Create Dataset. Name it “Agent Decisions Dataset” and attach the three evaluators from Step 1.Configure Variable Mappings
For each evaluator, map the variables to their data source so the evaluator receives the correct inputs at runtime: Tool Correctness| Variable | Maps To |
|---|---|
expected_tools | Dataset item → metadata.expected_tools |
actual_tools | Execution data → summary metrics → tools |
| Variable | Maps To |
|---|---|
agent_response | Agent response |
metadata | Dataset item → metadata |
| Variable | Maps To |
|---|---|
agent_response | Agent response |
metadata | Dataset item → metadata |
Add Test Cases
Add the following five test cases manually: 1. Single-tool — Policy lookup| Field | Value |
|---|---|
| Input | What is your return policy? |
| Expected Output | Items can be returned within 30 days of purchase. The item must be unused and in its original packaging. Refunds are processed in 5-7 business days to your original payment method. |
| Metadata | {"expected_tools": ["search_kb.tool"], "should_escalate": false} |
| Field | Value |
|---|---|
| Input | Check ticket TKT-002 and tell me the status of the related order |
| Expected Output | Ticket TKT-002 is open regarding a damaged item. The related order ORD-12345 has been delivered. The order contains Headphones totaling $79.99. |
| Metadata | {"expected_tools": ["lookup_ticket.tool", "check_order_status.tool"], "should_escalate": false, "required_steps": ["ticket", "order", "delivered"]} |
| Field | Value |
|---|---|
| Input | This is ridiculous! I’ve been waiting 3 weeks and nobody has helped me. I need to speak to a manager right now! |
| Expected Output | I understand your frustration, and I’m sorry for the delay. I’m transferring you to a specialist who can resolve this immediately. |
| Metadata | {"should_escalate": true, "expected_tools": ["escalate_to_human.tool"]} |
| Field | Value |
|---|---|
| Input | Thanks for your help, that’s all I needed! |
| Expected Output | You’re welcome! If you need anything else, don’t hesitate to reach out. Have a great day! |
| Metadata | {"expected_tools": [], "should_escalate": false} |
| Field | Value |
|---|---|
| Input | I want a refund for order ORD-99999 |
| Expected Output | I wasn’t able to find an order with ID ORD-99999. Could you double-check the order number? You can find it in your confirmation email. |
| Metadata | {"expected_tools": ["check_order_status.tool"], "should_escalate": false, "required_steps": ["not found", "order"]} |
Step 3: Trigger a Test Run
Copy the Dataset ID from the dataset page and use the code below.Step 4: View Results
Go to Evaluation → Test Runs to see your test run with its status. Click on the test run to see the result for each evaluator, for each dataset item — whether it passed or failed. You can also click View Trace on any result to see the exact reasoning steps (thought → action → observation), which tools were called and in what order, and where the agent deviated from expected behavior. See Test Runs for the full reference.Interpreting Scores and Improving Quality
When evaluator scores are low, use this table to identify the likely cause and fix:| Low Score In | Likely Cause | How to Fix |
|---|---|---|
| Tool Correctness | Ambiguous tool descriptions | Add clearer docstrings with explicit “use when” / “do not use for” guidance |
| Escalation (false negatives) | Agent misses urgency signals | Add more escalation triggers to the system prompt (e.g., specific keywords, wait times) |
| Escalation (false positives) | Agent is over-cautious | Narrow escalation criteria — list what should NOT be escalated |
| Workflow Completion | Agent stops before finishing all steps | Add explicit completion checks to the prompt (e.g., “verify all related records before responding”) |
Prompt Improvements Based on Evaluation
Use evaluation failures to refine your agent prompt:Continuous Evaluation Strategy
For production agents, run evaluations regularly:- On every prompt change — Re-run the full test suite to catch regressions
- After tool additions — Ensure new tools don’t disrupt existing tool selection patterns
- Weekly benchmarks — Track quality trends over time to catch gradual degradation
- After model upgrades — Verify that a new model version doesn’t change escalation or tool selection behavior
See Also
Trace Your LangChain Agent
Set up comprehensive tracing for your agent before evaluating
Evaluation Overview
Deep dive into Netra’s evaluation framework: datasets, evaluators, and test runs
Simulating Customer Support
Test your agent through multi-turn simulated conversations
A/B Testing Configurations
Compare different pipeline configurations systematically
