Open in Google Colab
Run the complete notebook in your browser
What You’ll Learn
Tool Selection Evaluation
Validate that agents call the correct tools for each query type
Escalation Accuracy
Measure over-escalation and under-escalation rates
Workflow Completion
Verify that multi-step tasks are completed in the correct order
Custom Agent Evaluators
Build domain-specific evaluators for your agent’s behavior
Prerequisites:
- Python >=3.10, <3.14 or Node.js 18+
- OpenAI API key
- Netra API key (Get started here)
- A test dataset with expected outputs
Why Evaluate Agent Decisions?
Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound:| Dimension | What to Measure | Why It Matters |
|---|---|---|
| Tool Selection | Did it call the right tools? | Wrong tools = wrong answers |
| Tool Sequence | Did it call tools in the right order? | Order matters for workflows |
| Completion | Did it resolve the query? | Premature stops frustrate users |
| Escalation Accuracy | Did it escalate appropriately? | Over/under-escalation impacts operations |
Creating Test Datasets
Test Case Structure for Agents
Agent test cases need more structure than simple LLM tests. Include:Test Categories
Build test cases across different agent behaviors:| Category | Example Queries | Expected Behavior |
|---|---|---|
| Single-tool | ”What is your return policy?” | One tool call (search_kb) |
| Multi-tool | ”Check ticket TKT-001 and the related order” | Sequence of tools |
| Escalation | ”I’m furious! Need to speak to a manager!” | Escalate to human |
| No-tool | ”Thanks for your help!” | Direct response, no tools |
| Edge cases | ”Refund order that doesn’t exist” | Appropriate error handling |
Creating the Dataset in Netra
- Navigate to Evaluation → Datasets
- Click Create Dataset
- Add test cases with:
query: The user inputexpected_tools: List of tools that should be calledforbidden_tools: Tools that should not be calledshould_escalate: Boolean for escalation expectationexpected_outcome: Description of correct outcome
Defining Agent Evaluators
Create evaluators in Evaluation → Evaluators → Add Evaluator.Tool Correctness Evaluator
Use the Tool Correctness template to validate tool selection:| Setting | Value |
|---|---|
| Template | Tool Correctness |
| Output Type | Numerical |
| Pass Criteria | score >= 0.8 |
- Were all expected tools called?
- Were forbidden tools avoided?
- Was the sequence correct (if order matters)?
Escalation Accuracy Evaluator
Create a Code Evaluator for escalation decisions:| Setting | Value |
|---|---|
| Type | Code Evaluator |
| Output Type | Numerical |
| Pass Criteria | score >= 0.8 |
Workflow Completion Evaluator
Create a code evaluator that validates multi-step workflow completion:Query Category Accuracy
Evaluate whether the agent handles different query types correctly:Running Agent Evaluations
Setup
Ensure your agent is configured with Netra tracing:Running the Evaluation Suite
Analyzing Results
Interpreting Evaluation Metrics
After running evaluations, review results in Evaluation → Test Runs:| Metric | Good | Warning | Action Needed |
|---|---|---|---|
| Tool Correctness | > 90% | 80-90% | < 80% |
| Escalation Accuracy | > 95% | 90-95% | < 90% |
| Workflow Completion | > 85% | 75-85% | < 75% |
Common Failure Patterns
| Low Score In | Likely Cause | How to Fix |
|---|---|---|
| Tool Correctness | Ambiguous tool descriptions | Add clearer docstrings, few-shot examples |
| Escalation (false negatives) | Missing urgency signals | Add more escalation triggers to prompt |
| Escalation (false positives) | Over-cautious agent | Narrow escalation criteria |
| Workflow Completion | Premature termination | Add explicit completion checks |
Debugging with Traces
Click View Trace on any failed test case to see:- The exact reasoning steps (thought → action → observation)
- Which tools were called and in what order
- The full prompt and response at each iteration
- Where the agent deviated from expected behavior
Improving Agent Performance
Prompt Engineering Based on Evaluation
Use evaluation insights to improve your agent prompt:Tool Description Improvements
If tool selection is failing, improve tool descriptions:Continuous Evaluation
Set up regular evaluation runs:- On prompt changes — Re-run full test suite
- Weekly regression tests — Catch drift in agent behavior
- After tool additions — Ensure new tools don’t disrupt existing behavior
Summary
You’ve learned how to systematically evaluate agent decision-making:- Tool Correctness validates that agents call the right tools
- Escalation Accuracy catches over/under-escalation
- Workflow Completion ensures multi-step tasks succeed
- Trace debugging reveals exactly where agents go wrong
Key Takeaways
- Agent errors compound—95% accuracy per step means 86% for 3-step workflows
- Test cases need structure: expected tools, forbidden tools, escalation flags
- Custom code evaluators handle domain-specific logic
- Use evaluation insights to improve prompts and tool descriptions