> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> Python API reference for Netra evaluation. Run test suites, create and manage datasets, add items, and measure AI output quality programmatically.

The Netra SDK exposes an `evaluation` client that lets you:

* **Manage datasets** - Create datasets and add test items
* **Run test suites** - Execute tasks against datasets with automatic tracing
* **Apply evaluators** - Score outputs using built-in or custom evaluators
* **Fetch results** - Retrieve detailed run results programmatically

This page shows how to use `Netra.evaluation` to manage datasets, run test suites, fetch results, and programmatically evaluate your AI applications.

## Getting Started

The `evaluation` client is available on the main `Netra` entry point after initialization.

```python theme={null}
from netra import Netra

Netra.init(app_name="sample-app")

# Access the evaluation client
Netra.evaluation.create_dataset(...)
Netra.evaluation.add_dataset_item(...)
Netra.evaluation.get_dataset(...)
Netra.evaluation.run_test_suite(...)
Netra.evaluation.get_run_results(...)
```

***

## create\_dataset

Create an empty dataset that can hold test items for evaluation runs.

<CodeGroup>
  ```python Usage theme={null}
  from netra import Netra
  from netra.evaluation import TurnType

  Netra.init(app_name="sample-app")

  result = Netra.evaluation.create_dataset(
      name="Customer Support QA",
      tags=["support", "qa", "v1"],
      turn_type=TurnType.SINGLE,  # or TurnType.MULTI for multi-turn
  )

  print(f"Dataset created: {result.id}")
  print(f"Name: {result.name}")
  print(f"Tags: {result.tags}")
  ```

  ```python Signature theme={null}
  create_dataset(
      name: str,
      tags: Optional[List[str]] = None,
      turn_type: TurnType = TurnType.SINGLE,
  ) -> CreateDatasetResponse | None
  ```
</CodeGroup>

### Parameters

| Parameter   | Type         | Description                                                 |
| ----------- | ------------ | ----------------------------------------------------------- |
| `name`      | `str`        | Name of the dataset (required)                              |
| `tags`      | `list[str]?` | Optional tags for categorization                            |
| `turn_type` | `TurnType`   | `SINGLE` for single-turn or `MULTI` for multi-turn datasets |

### Response: CreateDatasetResponse

| Field             | Type        | Description                          |
| ----------------- | ----------- | ------------------------------------ |
| `id`              | `str`       | Unique dataset identifier            |
| `name`            | `str`       | Dataset name                         |
| `tags`            | `list[str]` | Associated tags                      |
| `project_id`      | `str`       | Project identifier                   |
| `organization_id` | `str`       | Organization identifier              |
| `created_by`      | `str`       | Creator identifier                   |
| `updated_by`      | `str`       | Last updater identifier              |
| `created_at`      | `str`       | Creation timestamp                   |
| `updated_at`      | `str`       | Last update timestamp                |
| `deleted_at`      | `str?`      | Deletion timestamp (if soft-deleted) |

<AccordionGroup>
  <Accordion title="TurnType" icon="comments">
    | Value             | Description                                     |
    | ----------------- | ----------------------------------------------- |
    | `TurnType.SINGLE` | Single-turn evaluation (one input → one output) |
    | `TurnType.MULTI`  | Multi-turn evaluation (conversation sequences)  |
  </Accordion>
</AccordionGroup>

***

## add\_dataset\_item

Add a single test item to an existing dataset.

<CodeGroup>
  ```python Usage theme={null}
  from netra import Netra
  from netra.evaluation import DatasetItem

  Netra.init(app_name="sample-app")

  result = Netra.evaluation.add_dataset_item(
      dataset_id="dataset-123",
      item=DatasetItem(
          input="What is the return policy for electronics?",
          expected_output="Electronics can be returned within 30 days with original packaging.",
          tags=["policy", "returns"],
          metadata={"category": "electronics", "priority": "high"},
      ),
  )

  print(f"Item added: {result.id}")
  print(f"Input: {result.input}")
  ```

  ```python Signature theme={null}
  add_dataset_item(
      dataset_id: str,
      item: DatasetItem,
  ) -> AddDatasetItemResponse | None
  ```
</CodeGroup>

### Parameters

| Parameter    | Type          | Description              |
| ------------ | ------------- | ------------------------ |
| `dataset_id` | `str`         | ID of the target dataset |
| `item`       | `DatasetItem` | The test item to add     |

### DatasetItem

| Field             | Type         | Description                               |
| ----------------- | ------------ | ----------------------------------------- |
| `input`           | `Any`        | The input to pass to your task (required) |
| `expected_output` | `Any?`       | Expected output for comparison            |
| `tags`            | `list[str]?` | Optional tags for the item                |
| `metadata`        | `dict?`      | Optional metadata for evaluators          |

### Response: AddDatasetItemResponse

| Field             | Type        | Description                          |
| ----------------- | ----------- | ------------------------------------ |
| `id`              | `str`       | Unique item identifier               |
| `dataset_id`      | `str`       | Parent dataset ID                    |
| `project_id`      | `str`       | Project identifier                   |
| `organization_id` | `str`       | Organization identifier              |
| `source`          | `str`       | Source of the item                   |
| `source_id`       | `str?`      | Source reference ID                  |
| `input`           | `Any`       | The input value                      |
| `expected_output` | `Any`       | The expected output                  |
| `is_active`       | `bool`      | Whether the item is active           |
| `tags`            | `list[str]` | Associated tags                      |
| `metadata`        | `dict?`     | Item metadata                        |
| `created_by`      | `str`       | Creator identifier                   |
| `updated_by`      | `str`       | Last updater identifier              |
| `created_at`      | `str`       | Creation timestamp                   |
| `updated_at`      | `str`       | Last update timestamp                |
| `deleted_at`      | `str?`      | Deletion timestamp (if soft-deleted) |

***

## get\_dataset

Retrieve a dataset and all its items by ID.

<CodeGroup>
  ```python Usage theme={null}
  from netra import Netra

  Netra.init(app_name="sample-app")

  dataset = Netra.evaluation.get_dataset(dataset_id="dataset-123")

  print(f"Total items: {len(dataset.items)}")

  for item in dataset.items:
      print(f"ID: {item.id}")
      print(f"Input: {item.input}")
      print(f"Expected: {item.expected_output}")
      print("---")
  ```

  ```python Signature theme={null}
  get_dataset(
      dataset_id: str,
  ) -> GetDatasetItemsResponse | None
  ```
</CodeGroup>

### Parameters

| Parameter    | Type  | Description                   |
| ------------ | ----- | ----------------------------- |
| `dataset_id` | `str` | ID of the dataset to retrieve |

### Response: GetDatasetItemsResponse

| Field   | Type                  | Description           |
| ------- | --------------------- | --------------------- |
| `items` | `list[DatasetRecord]` | List of dataset items |

### DatasetRecord

| Field             | Type  | Description         |
| ----------------- | ----- | ------------------- |
| `id`              | `str` | Item identifier     |
| `dataset_id`      | `str` | Parent dataset ID   |
| `input`           | `Any` | The input value     |
| `expected_output` | `Any` | The expected output |

***

## run\_test\_suite

Execute a test suite against a dataset, running your task function on each item and optionally applying evaluators.

<CodeGroup>
  ```python Usage theme={null}
  from netra import Netra
  from openai import OpenAI

  Netra.init(app_name="sample-app")

  client = OpenAI()

  def my_task(input_data):
      """Task function that processes each dataset item."""
      response = client.chat.completions.create(
          model="gpt-4o-mini",
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": input_data},
          ],
      )
      return response.choices[0].message.content

  # Get dataset
  dataset = Netra.evaluation.get_dataset(dataset_id="dataset-123")

  # Run test suite
  result = Netra.evaluation.run_test_suite(
      name="GPT-4o Mini Evaluation",
      data=dataset,
      task=my_task,
      evaluators=["correctness", "relevance"],  # Optional evaluator IDs
      max_concurrency=10,
  )

  print(f"Run ID: {result['runId']}")
  print(f"Items processed: {len(result['items'])}")
  ```

  ```python Signature theme={null}
  run_test_suite(
      name: str,
      data: Dataset,
      task: Callable[[Any], Any],
      evaluators: Optional[List[Any]] = None,
      max_concurrency: int = 50,
  ) -> Optional[Dict[str, Any]]
  ```
</CodeGroup>

### Parameters

| Parameter         | Type       | Description                                  |
| ----------------- | ---------- | -------------------------------------------- |
| `name`            | `str`      | Name for this test run (required)            |
| `data`            | `Dataset`  | Dataset from `get_dataset()`                 |
| `task`            | `Callable` | Function that takes input and returns output |
| `evaluators`      | `list?`    | Optional evaluator IDs or configs            |
| `max_concurrency` | `int`      | Max parallel task executions (default: 50)   |

### Response

| Field   | Type         | Description                     |
| ------- | ------------ | ------------------------------- |
| `runId` | `str`        | Unique run identifier           |
| `items` | `list[dict]` | Results for each processed item |

### Item Result

| Field           | Type  | Description                    |
| --------------- | ----- | ------------------------------ |
| `index`         | `int` | Item index in dataset          |
| `status`        | `str` | `"completed"` or `"failed"`    |
| `traceId`       | `str` | Trace ID for observability     |
| `spanId`        | `str` | Span ID for the task execution |
| `testRunItemId` | `str` | Backend item identifier        |

<Tip>
  The `task` function receives the `input` field from each dataset item. Return the output that should be compared against `expected_output` by evaluators.
</Tip>

***

## get\_run\_results

Fetch detailed results for a completed test run by its run ID. Use this after `run_test_suite` to retrieve evaluation scores, item-level outcomes, and other run metadata from the backend.

<CodeGroup>
  ```python Usage theme={null}
  from netra import Netra

  Netra.init(app_name="sample-app")

  # After running a test suite
  result = Netra.evaluation.run_test_suite(
      name="GPT-4o Mini Evaluation",
      data=dataset,
      task=my_task,
      evaluators=["correctness", "relevance"],
  )

  run_id = result["runId"]

  # Fetch the full run results
  run_results = Netra.evaluation.get_run_results(run_id=run_id)

  print(f"Run results: {run_results}")
  ```

  ```python Signature theme={null}
  get_run_results(
      run_id: str,
  ) -> Any | None
  ```
</CodeGroup>

### Parameters

| Parameter | Type  | Description                                                           |
| --------- | ----- | --------------------------------------------------------------------- |
| `run_id`  | `str` | The unique identifier of the test run to fetch results for (required) |

### Response

Returns the JSON response from the backend containing the full run results, including evaluation scores and item-level details. Returns `None` if the `run_id` is empty or the request fails.

The top-level response wraps the run data:

| Field     | Type        | Description                                              |
| --------- | ----------- | -------------------------------------------------------- |
| `success` | `bool`      | Whether the request succeeded                            |
| `data`    | `RunResult` | The run result object (see below)                        |
| `error`   | `Any?`      | Error details, `null` on success                         |
| `meta`    | `dict`      | Request metadata (timestamp, path, version, status code) |

### RunResult

| Field              | Type             | Description                                   |
| ------------------ | ---------------- | --------------------------------------------- |
| `id`               | `str`            | Unique run identifier                         |
| `name`             | `str`            | Name of the test run                          |
| `projectId`        | `str`            | Project identifier                            |
| `organizationId`   | `str`            | Organization identifier                       |
| `status`           | `str`            | Run status                                    |
| `evaluationStatus` | `str`            | Evaluation status                             |
| `time`             | `str`            | Timestamp when the run was initiated          |
| `turnType`         | `str`            | `"single"` or `"multi"`                       |
| `createdAt`        | `str`            | Creation timestamp                            |
| `updatedAt`        | `str`            | Last update timestamp                         |
| `deletedAt`        | `str?`           | Deletion timestamp (if soft-deleted)          |
| `runContext`       | `Any?`           | Optional run context                          |
| `testRunSummary`   | `TestRunSummary` | Aggregated summary of the run                 |
| `redirectUrl`      | `str`            | Direct link to the run in the Netra dashboard |

### TestRunSummary

| Field              | Type    | Description                              |
| ------------------ | ------- | ---------------------------------------- |
| `totalItems`       | `int`   | Total number of items in the run         |
| `passedItems`      | `int`   | Number of items that passed              |
| `failedItems`      | `int`   | Number of items that failed              |
| `durationMs`       | `float` | Total run duration in milliseconds       |
| `totalCostUsd`     | `float` | Total cost of the run in USD             |
| `averageLatencyMs` | `float` | Average latency per item in milliseconds |

<Accordion title="Example Response" icon="code">
  ```json theme={null}
  {
      "success": true,
      "data": {
          "id": "4dd93b23-a769-401e-96c3-db42f408b65b",
          "name": "Image Fidelity Test",
          "projectId": "b0e6f0f3-b3fb-4d73-aea4-a75fbb2d72f3",
          "organizationId": "b4a00d6b-52ff-4f2b-be84-db8bd2d5f657",
          "status": "completed",
          "evaluationStatus": "completed",
          "time": "2026-04-02T10:45:36.751Z",
          "turnType": "single",
          "createdAt": "2026-04-02T10:45:36.761Z",
          "updatedAt": "2026-04-02T10:47:48.271Z",
          "deletedAt": null,
          "runContext": null,
          "testRunSummary": {
              "totalItems": 1,
              "passedItems": 1,
              "failedItems": 0,
              "durationMs": 7978,
              "totalCostUsd": 0.024752200000000002,
              "averageLatencyMs": 136562
          },
          "redirectUrl": "https://demo.getnetra.ai/test-runs/4dd93b23-a769-401e-96c3-db42f408b65b?currentOrg=b4a00d6b-52ff-4f2b-be84-db8bd2d5f657&currentProject=b0e6f0f3-b3fb-4d73-aea4-a75fbb2d72f3"
      },
      "error": null,
      "meta": {
          "timestamp": "2026-04-20T05:51:22.075Z",
          "path": "/evaluations/run/4dd93b23-a769-401e-96c3-db42f408b65b",
          "version": "1.0.0",
          "statusCode": 200
      }
  }
  ```
</Accordion>

<Tip>
  Pair `get_run_results` with `run_test_suite` to programmatically inspect evaluation outcomes. The `run_id` is available in the `runId` field of the `run_test_suite` response. Use `redirectUrl` from the response to jump directly to the run in the Netra dashboard.
</Tip>

***

## When to Use Which API

<CardGroup cols={2}>
  <Card title="Dataset Management" icon="database">
    **`create_dataset` / `add_dataset_item` / `get_dataset`**

    Build and manage test datasets programmatically. Use for CI/CD pipelines or when generating test cases from production data.
  </Card>

  <Card title="Test Execution" icon="flask-vial">
    **`run_test_suite`**

    Execute your AI task against a dataset with automatic tracing and evaluation. Use for regression testing and model comparisons.
  </Card>

  <Card title="Results Retrieval" icon="chart-simple">
    **`get_run_results`**

    Fetch detailed results for a completed run, including evaluation scores and item-level outcomes. Use for post-run analysis and CI/CD assertions.
  </Card>

  <Card title="Advanced Workflows" icon="gears">
    **`create_run`**

    Create runs without immediate execution. Use when you need custom orchestration or want to manage run lifecycle separately.
  </Card>

  <Card title="Evaluators" icon="clipboard-check">
    **Evaluator IDs or Configs**

    Pass evaluator IDs to `run_test_suite` to automatically score outputs. Configure custom evaluators in the Netra dashboard.
  </Card>
</CardGroup>

***

## Complete Example

```python theme={null}
from netra import Netra
from netra.evaluation import DatasetItem, TurnType
from openai import OpenAI

# Initialize
Netra.init(
    app_name="evaluation-demo",
    headers="x-api-key=your-api-key",
)
client = OpenAI()

# 1. Create a dataset
dataset_response = Netra.evaluation.create_dataset(
    name="Product FAQ Evaluation",
    tags=["faq", "products", "v2"],
    turn_type=TurnType.SINGLE,
)
dataset_id = dataset_response.id
print(f"Created dataset: {dataset_id}")

# 2. Add test items
test_cases = [
    {
        "input": "What is your return policy?",
        "expected_output": "Items can be returned within 30 days.",
    },
    {
        "input": "How long does shipping take?",
        "expected_output": "Standard shipping takes 3-5 business days.",
    },
    {
        "input": "Do you offer international shipping?",
        "expected_output": "Yes, we ship to over 50 countries.",
    },
]

for case in test_cases:
    Netra.evaluation.add_dataset_item(
        dataset_id=dataset_id,
        item=DatasetItem(
            input=case["input"],
            expected_output=case["expected_output"],
        ),
    )
print(f"Added {len(test_cases)} test items")

# 3. Define the task
def faq_agent(query: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a customer support agent. Answer concisely."},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

# 4. Run the test suite
dataset = Netra.evaluation.get_dataset(dataset_id=dataset_id)

result = Netra.evaluation.run_test_suite(
    name="FAQ Agent v2 Evaluation",
    data=dataset,
    task=faq_agent,
    evaluators=["correctness", "relevance"],
    max_concurrency=5,
)

# 5. Review results
print(f"\nRun completed: {result['runId']}")
for item in result["items"]:
    print(f"  Item {item['index']}: {item['status']} (trace: {item['traceId']})")

# 6. Fetch detailed run results
run_results = Netra.evaluation.get_run_results(run_id=result["runId"])
print(f"\nDetailed run results: {run_results}")

print("\nView detailed results in Netra dashboard → Evaluation → Test Runs")
```

## Next Steps

* [Dashboard Query](/sdk-reference/dashboard-query/python) - Query dashboard metrics
* [Usage Utilities](/usage/usage-utilities) - Query traces and spans
* [Evaluators](/Evaluation/Evaluators) - Configure custom evaluators
* [Test Runs](/Evaluation/TestRuns) - View and analyze test run results