> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating Agent Decisions

> Evaluate AI agent decisions with Netra. Measure tool selection accuracy, escalation logic, and workflow completion using structured evaluation datasets.

A traced agent tells you what happened — which tools were called, how long each step took, and what the LLM generated. Evaluation tells you whether the agent made the right decisions. Without structured scoring, you can't tell if the agent is selecting the wrong tools, over-escalating simple requests, or stopping before the workflow is complete. These failures don't throw errors — they just produce worse outcomes.

This cookbook walks you through Netra's evaluation workflow: creating evaluators for agent-specific quality dimensions, building test datasets from your traces, running test suites, and interpreting results to improve your agent.

<Info>
  **Prerequisite:** You need a Netra API key ([Get started here](/quick-start/Overview)) and an AI agent to evaluate. The test cases below use the customer support agent from the [Tracing LangChain Agents](/Cookbooks/observability/tracing-langchain-agents) cookbook as a reference.
</Info>

## What You'll Learn

<CardGroup cols={2}>
  <Card title="Build a Test Dataset" icon="database">
    Create structured test cases with inputs, expected outputs, and metadata for your evaluators
  </Card>

  <Card title="Configure Agent Evaluators" icon="scale-balanced">
    Set up evaluators for tool correctness, escalation accuracy, and workflow completion
  </Card>

  <Card title="Run Test Suites" icon="flask-vial">
    Execute evaluations via the SDK and collect quality metrics
  </Card>

  <Card title="Analyze Results & Iterate" icon="chart-line">
    Interpret scores, debug failures using trace integration, and improve your agent
  </Card>
</CardGroup>

***

## Why Agent Decisions Need Evaluation

Agent evaluation differs from simple LLM evaluation. Agents make multi-step decisions that compound — a 95% accurate tool selection across 3 steps means only 86% of full workflows succeed (0.95^3):

| Failure Mode             | What Goes Wrong                                                   | Why You Can't Spot-Check It                                                                          |
| ------------------------ | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| **Wrong tool selection** | Agent uses `search_kb` when it should use `check_order_status`    | The answer may still sound reasonable despite using the wrong data source                            |
| **Over-escalation**      | Agent escalates a simple FAQ to a human operator                  | Each escalation looks cautious and safe in isolation — you need aggregate metrics to see the pattern |
| **Under-escalation**     | Agent tries to handle a frustrated customer instead of escalating | Only visible when you compare the agent's decision against the expected action                       |
| **Incomplete workflow**  | Agent looks up the ticket but never checks the related order      | The partial answer addresses part of the question, so it looks acceptable on a quick read            |

Netra's evaluation framework addresses this with [Datasets](/Evaluation/Datasets) (test cases with inputs, expected outputs, and metadata), [Evaluators](/Evaluation/Evaluators) (library and custom code-based scoring for tool usage, escalation, and completion), and [Test Runs](/Evaluation/TestRuns) (execution results with pass/fail rates, scores, and linked traces). The workflow is: create evaluators, build test cases, run, and review. See the [Evaluation Overview](/Evaluation/Evaluation-overview) for a deeper look at the framework.

***

Now, let's walk through the process of evaluating agent decisions:

## Step 1: Create Evaluators

You need three evaluators — one from the library and two custom LLM as Judge evaluators.

### Tool Correctness (Library)

Go to **Evaluation → Evaluators**, switch to the **Library** tab, and add **Tool Correctness** from the Tool Use category.

| Evaluator            | What It Measures                                                                         |
| -------------------- | ---------------------------------------------------------------------------------------- |
| **Tool Correctness** | Did the agent call the right tools, avoid forbidden tools, and use the correct sequence? |

### Escalation Accuracy (LLM as Judge)

Click **Add Evaluator** and create an LLM as Judge evaluator.

| Setting           | Value        |
| ----------------- | ------------ |
| **Type**          | LLM as Judge |
| **Output Type**   | Numerical    |
| **Pass Criteria** | score >= 0.8 |

Use the following prompt template:

```
You are evaluating whether a customer support agent made the correct escalation decision.

Metadata for this test case:
{{metadata}}

Agent's response:
{{agent_response}}

Evaluate the escalation decision based on the "should_escalate" field in the metadata:
- If should_escalate is true, the agent must have escalated (e.g., transferred to a human, mentioned a specialist, offered to connect to a manager).
- If should_escalate is false, the agent must NOT have escalated.

Score:
- 1.0 if the agent's escalation decision is correct
- 0.5 if the agent escalated unnecessarily (false positive — less severe)
- 0.0 if the agent failed to escalate when required (false negative — severe)
```

### Workflow Completion (LLM as Judge)

Create another LLM as Judge evaluator that validates whether the agent completed all required steps.

| Setting           | Value        |
| ----------------- | ------------ |
| **Type**          | LLM as Judge |
| **Output Type**   | Boolean      |
| **Pass Criteria** | true         |

Use the following prompt template:

```
You are evaluating whether a customer support agent completed all required workflow steps.

Metadata for this test case:
{{metadata}}

Agent's response:
{{agent_response}}

Check the "required_steps" field in the metadata. If required_steps is empty or missing, return true.

Otherwise, verify that the agent's response addresses every step listed. Steps can be addressed using different wording — check for semantic equivalence, not exact keyword matches.

Return true if ALL required steps are covered, false otherwise.
```

You can test each evaluator in the **Playground** before using it in a dataset. See [Evaluators](/Evaluation/Evaluators) for the full reference.

<video autoPlay muted loop playsInline className="w-full aspect-video rounded-xl" src="https://mintcdn.com/netra/flCEWIb7m_86sZYE/videos/cookbook-agent-decisions-evaluators.mp4?fit=max&auto=format&n=flCEWIb7m_86sZYE&q=85&s=58455fc7d36c72ab4f902c79f62ffa21" data-path="videos/cookbook-agent-decisions-evaluators.mp4" />

***

## Step 2: Create a Dataset

<video autoPlay muted loop playsInline className="w-full aspect-video rounded-xl" src="https://mintlify.s3.us-west-1.amazonaws.com/netra/videos/cookbook-agent-decisions-dataset.mp4" />

Go to **Evaluation → Datasets** and click **Create Dataset**. Name it "Agent Decisions Dataset" and attach the three evaluators from Step 1.

### Configure Variable Mappings

For each evaluator, map the variables to their data source so the evaluator receives the correct inputs at runtime:

**Tool Correctness**

| Variable         | Maps To                                  |
| ---------------- | ---------------------------------------- |
| `expected_tools` | Dataset item → `metadata.expected_tools` |
| `actual_tools`   | Execution data → summary metrics → tools |

**Escalation Accuracy**

| Variable         | Maps To                 |
| ---------------- | ----------------------- |
| `agent_response` | Agent response          |
| `metadata`       | Dataset item → metadata |

**Workflow Completion**

| Variable         | Maps To                 |
| ---------------- | ----------------------- |
| `agent_response` | Agent response          |
| `metadata`       | Dataset item → metadata |

### Add Test Cases

Add the following five test cases manually:

**1. Single-tool — Policy lookup**

| Field               | Value                                                                                                                                                                                |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Input**           | What is your return policy?                                                                                                                                                          |
| **Expected Output** | Items can be returned within 30 days of purchase. The item must be unused and in its original packaging. Refunds are processed in 5-7 business days to your original payment method. |
| **Metadata**        | `{"expected_tools": ["search_kb.tool"], "should_escalate": false}`                                                                                                                   |

**2. Multi-tool — Ticket with related order**

| Field               | Value                                                                                                                                                 |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Input**           | Check ticket TKT-002 and tell me the status of the related order                                                                                      |
| **Expected Output** | Ticket TKT-002 is open regarding a damaged item. The related order ORD-12345 has been delivered. The order contains Headphones totaling \$79.99.      |
| **Metadata**        | `{"expected_tools": ["lookup_ticket.tool", "check_order_status.tool"], "should_escalate": false, "required_steps": ["ticket", "order", "delivered"]}` |

**3. Escalation — Angry customer**

| Field               | Value                                                                                                                              |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| **Input**           | This is ridiculous! I've been waiting 3 weeks and nobody has helped me. I need to speak to a manager right now!                    |
| **Expected Output** | I understand your frustration, and I'm sorry for the delay. I'm transferring you to a specialist who can resolve this immediately. |
| **Metadata**        | `{"should_escalate": true, "expected_tools": ["escalate_to_human.tool"]}`                                                          |

**4. No-tool — Simple thank you**

| Field               | Value                                                                                     |
| ------------------- | ----------------------------------------------------------------------------------------- |
| **Input**           | Thanks for your help, that's all I needed!                                                |
| **Expected Output** | You're welcome! If you need anything else, don't hesitate to reach out. Have a great day! |
| **Metadata**        | `{"expected_tools": [], "should_escalate": false}`                                        |

**5. Edge case — Non-existent order**

| Field               | Value                                                                                                                                  |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| **Input**           | I want a refund for order ORD-99999                                                                                                    |
| **Expected Output** | I wasn't able to find an order with ID ORD-99999. Could you double-check the order number? You can find it in your confirmation email. |
| **Metadata**        | `{"expected_tools": ["check_order_status.tool"], "should_escalate": false, "required_steps": ["not found", "order"]}`                  |

Under **Evaluation → Datasets**, you should now see the "Agent Decisions Dataset" with five items and three evaluators under the **Evaluators** tab.

***

## Step 3: Trigger a Test Run

Copy the **Dataset ID** from the dataset page and use the code below.

<CodeGroup>
  ```python Python theme={null}
  from netra import Netra
  from netra.instrumentation.instruments import InstrumentSet

  Netra.init(
      app_name="agent-evaluation",
      instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},
  )

  # Your agent logic — wrap your agent in a function that takes
  # an input string and returns the generated response.
  # Tip: if you followed the tracing cookbook, you can call agent.invoke() here.
  def run_agent(input_data):
      result = agent.invoke({"messages": [{"role": "user", "content": input_data}]})
      return result["messages"][-1].content

  dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")

  result = Netra.evaluation.run_test_suite(
      name="Agent Decision Evaluation",
      data=dataset,
      task=run_agent,
  )
  ```

  ```typescript TypeScript theme={null}
  import { Netra, NetraInstruments } from "netra-sdk";

  await Netra.init({
    appName: "agent-evaluation",
    instruments: new Set([NetraInstruments.OPENAI, NetraInstruments.LANGCHAIN]),
  });

  // Your agent logic — wrap your agent in a function that takes
  // an input string and returns the generated response.
  // Tip: if you followed the tracing cookbook, you can call agent.invoke() here.
  async function runAgent(inputData: string): Promise<string> {
    const result = await agent.invoke({ messages: [{ role: "user", content: inputData }] });
    return result.messages[result.messages.length - 1].content;
  }

  const dataset = await Netra.evaluation.getDataset("your-dataset-id");

  const result = await Netra.evaluation.runTestSuite(
    "Agent Decision Evaluation",
    dataset,
    runAgent,
  );
  ```
</CodeGroup>

For more details on the evaluation API, refer to the [SDK documentation](/sdk-reference/evaluation/python).

***

## Step 4: View Results

Go to **Evaluation → Test Runs** to see your test run with its status. Click on the test run to see the result for each evaluator, for each dataset item — whether it passed or failed.

<video autoPlay muted loop playsInline className="w-full aspect-video rounded-xl" src="https://mintcdn.com/netra/flCEWIb7m_86sZYE/videos/cookbook-agent-decisions-results.mp4?fit=max&auto=format&n=flCEWIb7m_86sZYE&q=85&s=90b62555b8d37410376ad32219e96927" data-path="videos/cookbook-agent-decisions-results.mp4" />

You can also click **View Trace** on any result to see the exact reasoning steps (thought → action → observation), which tools were called and in what order, and where the agent deviated from expected behavior. See [Test Runs](/Evaluation/TestRuns) for the full reference.

***

## Interpreting Scores and Improving Quality

When evaluator scores are low, use this table to identify the likely cause and fix:

| Low Score In                     | Likely Cause                           | How to Fix                                                                                          |
| -------------------------------- | -------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Tool Correctness**             | Ambiguous tool descriptions            | Add clearer docstrings with explicit "use when" / "do not use for" guidance                         |
| **Escalation (false negatives)** | Agent misses urgency signals           | Add more escalation triggers to the system prompt (e.g., specific keywords, wait times)             |
| **Escalation (false positives)** | Agent is over-cautious                 | Narrow escalation criteria — list what should NOT be escalated                                      |
| **Workflow Completion**          | Agent stops before finishing all steps | Add explicit completion checks to the prompt (e.g., "verify all related records before responding") |

### Prompt Improvements Based on Evaluation

Use evaluation failures to refine your agent prompt:

```python theme={null}
# Before: Vague escalation guidance
system_prompt = """Escalate complex issues to human operators."""

# After: Specific criteria derived from evaluation failures
system_prompt = """
Escalate to human operators when ANY of these conditions are met:
- User expresses frustration ("ridiculous", "unacceptable", "furious")
- User has been waiting more than 2 weeks
- User explicitly asks to speak to a human
- The issue involves policy exceptions

Do NOT escalate for:
- Simple FAQ questions
- Routine order status checks
- Standard refund requests within policy
"""
```

After making changes, re-run the evaluation against the same dataset and compare results across test runs. Netra tracks all runs so you can see whether your changes improved quality.

***

## Continuous Evaluation Strategy

For production agents, run evaluations regularly:

1. **On every prompt change** — Re-run the full test suite to catch regressions
2. **After tool additions** — Ensure new tools don't disrupt existing tool selection patterns
3. **Weekly benchmarks** — Track quality trends over time to catch gradual degradation
4. **After model upgrades** — Verify that a new model version doesn't change escalation or tool selection behavior

***

## See Also

<CardGroup cols={2}>
  <Card title="Trace Your LangChain Agent" icon="robot" href="/Cookbooks/observability/tracing-langchain-agents">
    Set up comprehensive tracing for your agent before evaluating
  </Card>

  <Card title="Evaluation Overview" icon="gauge-high" href="/Evaluation/Evaluation-overview">
    Deep dive into Netra's evaluation framework: datasets, evaluators, and test runs
  </Card>

  <Card title="Simulating Customer Support" icon="headset" href="/Cookbooks/simulation/simulating-customer-support">
    Test your agent through multi-turn simulated conversations
  </Card>

  <Card title="A/B Testing Configurations" icon="flask" href="/Cookbooks/evaluation/ab-testing-configurations">
    Compare different pipeline configurations systematically
  </Card>
</CardGroup>
