> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> TypeScript API reference for Netra evaluation. Run test suites, create and manage datasets, add items, and measure AI output quality programmatically.

The Netra SDK exposes an `evaluation` client that lets you:

* **Manage datasets** - Create datasets and add test items
* **Run test suites** - Execute tasks against datasets with automatic tracing
* **Apply evaluators** - Score outputs using built-in or custom evaluators

This page shows how to use `netra.evaluation` to manage datasets, run test suites, and programmatically evaluate your AI applications.

## Getting Started

The `evaluation` client is available on the main `Netra` entry point after initialization.

```typescript theme={null}
import { Netra } from "netra-sdk-js";

const client = new Netra({
  apiKey: "your-api-key",
});

// Access the evaluation client
await client.evaluation.createDataset(...);
await client.evaluation.addDatasetItem(...);
await client.evaluation.getDataset(...);
await client.evaluation.runTestSuite(...);
```

***

## createDataset

Create an empty dataset that can hold test items for evaluation runs.

<CodeGroup>
  ```typescript Usage theme={null}
  import { Netra } from "netra-sdk-js";

  const client = new Netra({ apiKey: "..." });

  const result = await client.evaluation.createDataset(
    "Customer Support QA",           // name
    ["support", "qa", "v1"]          // tags (optional)
  );

  if (result) {
    console.log(`Dataset created: ${result.id}`);
    console.log(`Name: ${result.name}`);
    console.log(`Tags: ${result.tags}`);
  }
  ```

  ```typescript Signature theme={null}
  createDataset(
    name: string,
    tags?: string[]
  ): Promise<CreateDatasetResponse | null>
  ```
</CodeGroup>

### Parameters

| Parameter | Type        | Description                      |
| --------- | ----------- | -------------------------------- |
| `name`    | `string`    | Name of the dataset (required)   |
| `tags`    | `string[]?` | Optional tags for categorization |

### Response: CreateDatasetResponse

| Field            | Type             | Description                          |
| ---------------- | ---------------- | ------------------------------------ |
| `id`             | `string`         | Unique dataset identifier            |
| `name`           | `string`         | Dataset name                         |
| `tags`           | `string[]`       | Associated tags                      |
| `projectId`      | `string`         | Project identifier                   |
| `organizationId` | `string`         | Organization identifier              |
| `createdBy`      | `string`         | Creator identifier                   |
| `updatedBy`      | `string`         | Last updater identifier              |
| `createdAt`      | `string`         | Creation timestamp                   |
| `updatedAt`      | `string`         | Last update timestamp                |
| `deletedAt`      | `string \| null` | Deletion timestamp (if soft-deleted) |

***

## addDatasetItem

Add a single test item to an existing dataset.

<CodeGroup>
  ```typescript Usage theme={null}
  import { Netra } from "netra-sdk-js";

  const client = new Netra({ apiKey: "..." });

  const result = await client.evaluation.addDatasetItem(
    "dataset-123",  // datasetId
    {               // item
      input: "What is the return policy for electronics?",
      expectedOutput: "Electronics can be returned within 30 days with original packaging.",
      tags: ["policy", "returns"],
      metadata: { category: "electronics", priority: "high" },
    }
  );

  if (result) {
    console.log(`Item added: ${result.id}`);
    console.log(`Input: ${result.input}`);
  }
  ```

  ```typescript Signature theme={null}
  addDatasetItem(
    datasetId: string,
    item: DatasetEntry
  ): Promise<AddDatasetItemResponse | null>

  // DatasetEntry interface
  interface DatasetEntry {
    input: any;
    expectedOutput?: any;
    tags?: string[];
    metadata?: Record<string, any>;
  }
  ```
</CodeGroup>

### Parameters

| Parameter   | Type           | Description              |
| ----------- | -------------- | ------------------------ |
| `datasetId` | `string`       | ID of the target dataset |
| `item`      | `DatasetEntry` | The test item to add     |

### DatasetEntry

| Field            | Type                   | Description                               |
| ---------------- | ---------------------- | ----------------------------------------- |
| `input`          | `any`                  | The input to pass to your task (required) |
| `expectedOutput` | `any?`                 | Expected output for comparison            |
| `tags`           | `string[]?`            | Optional tags for the item                |
| `metadata`       | `Record<string, any>?` | Optional metadata for evaluators          |

### Response: AddDatasetItemResponse

| Field            | Type                   | Description                          |
| ---------------- | ---------------------- | ------------------------------------ |
| `id`             | `string`               | Unique item identifier               |
| `datasetId`      | `string`               | Parent dataset ID                    |
| `projectId`      | `string`               | Project identifier                   |
| `organizationId` | `string`               | Organization identifier              |
| `source`         | `string`               | Source of the item                   |
| `sourceId`       | `string?`              | Source reference ID                  |
| `input`          | `any`                  | The input value                      |
| `expectedOutput` | `any`                  | The expected output                  |
| `isActive`       | `boolean`              | Whether the item is active           |
| `tags`           | `string[]`             | Associated tags                      |
| `metadata`       | `Record<string, any>?` | Item metadata                        |
| `createdBy`      | `string`               | Creator identifier                   |
| `updatedBy`      | `string`               | Last updater identifier              |
| `createdAt`      | `string`               | Creation timestamp                   |
| `updatedAt`      | `string`               | Last update timestamp                |
| `deletedAt`      | `string?`              | Deletion timestamp (if soft-deleted) |

***

## getDataset

Retrieve a dataset and all its items by ID.

<CodeGroup>
  ```typescript Usage theme={null}
  import { Netra } from "netra-sdk-js";

  const client = new Netra({ apiKey: "..." });

  const dataset = await client.evaluation.getDataset("dataset-123");

  if (dataset) {
    console.log(`Total items: ${dataset.items.length}`);

    for (const item of dataset.items) {
      console.log(`ID: ${item.id}`);
      console.log(`Input: ${item.input}`);
      console.log(`Expected: ${item.expectedOutput}`);
      console.log("---");
    }
  }
  ```

  ```typescript Signature theme={null}
  getDataset(
    datasetId: string
  ): Promise<GetDatasetItemsResponse | null>
  ```
</CodeGroup>

### Parameters

| Parameter   | Type     | Description                   |
| ----------- | -------- | ----------------------------- |
| `datasetId` | `string` | ID of the dataset to retrieve |

### Response: GetDatasetItemsResponse

| Field   | Type              | Description           |
| ------- | ----------------- | --------------------- |
| `items` | `DatasetRecord[]` | List of dataset items |

### DatasetRecord

| Field            | Type     | Description         |
| ---------------- | -------- | ------------------- |
| `id`             | `string` | Item identifier     |
| `datasetId`      | `string` | Parent dataset ID   |
| `input`          | `any`    | The input value     |
| `expectedOutput` | `any`    | The expected output |

***

## runTestSuite

Execute a test suite against a dataset, running your task function on each item and optionally applying evaluators.

<CodeGroup>
  ```typescript Usage theme={null}
  import { Netra } from "netra-sdk-js";
  import OpenAI from "openai";

  const client = new Netra({ apiKey: "..." });
  const openai = new OpenAI();

  // Task function that processes each dataset item
  async function myTask(inputData: any): Promise<string> {
    const response = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: inputData },
      ],
    });
    return response.choices[0].message.content || "";
  }

  // Get dataset
  const dataset = await client.evaluation.getDataset("dataset-123");

  if (dataset) {
    // Run test suite
    const result = await client.evaluation.runTestSuite(
      "GPT-4o Mini Evaluation",  // name
      dataset,                    // data
      myTask,                     // task
      ["correctness", "relevance"], // evaluators (optional)
      10                          // maxConcurrency
    );

    if (result) {
      console.log(`Run ID: ${result.runId}`);
      console.log(`Items processed: ${result.items.length}`);
    }
  }
  ```

  ```typescript Signature theme={null}
  runTestSuite(
    name: string,
    data: Dataset,
    task: TaskFunction,
    evaluators?: any[],
    maxConcurrency?: number
  ): Promise<Record<string, any> | null>

  // TaskFunction type
  type TaskFunction = (input: any) => any | Promise<any>
  ```
</CodeGroup>

### Parameters

| Parameter        | Type           | Description                                  |
| ---------------- | -------------- | -------------------------------------------- |
| `name`           | `string`       | Name for this test run (required)            |
| `data`           | `Dataset`      | Dataset from `getDataset()`                  |
| `task`           | `TaskFunction` | Function that takes input and returns output |
| `evaluators`     | `any[]?`       | Optional evaluator IDs or configs            |
| `maxConcurrency` | `number`       | Max parallel task executions (default: 50)   |

### Response

| Field   | Type       | Description                     |
| ------- | ---------- | ------------------------------- |
| `runId` | `string`   | Unique run identifier           |
| `items` | `object[]` | Results for each processed item |

### Item Result

| Field           | Type     | Description                    |
| --------------- | -------- | ------------------------------ |
| `index`         | `number` | Item index in dataset          |
| `status`        | `string` | `"completed"` or `"failed"`    |
| `traceId`       | `string` | Trace ID for observability     |
| `spanId`        | `string` | Span ID for the task execution |
| `testRunItemId` | `string` | Backend item identifier        |

<Tip>
  The `task` function receives the `input` field from each dataset item. Return the output that should be compared against `expectedOutput` by evaluators.
</Tip>

***

## When to Use Which API

<CardGroup cols={2}>
  <Card title="Dataset Management" icon="database">
    **`createDataset` / `addDatasetItem` / `getDataset`**

    Build and manage test datasets programmatically. Use for CI/CD pipelines or when generating test cases from production data.
  </Card>

  <Card title="Test Execution" icon="flask-vial">
    **`runTestSuite`**

    Execute your AI task against a dataset with automatic tracing and evaluation. Use for regression testing and model comparisons.
  </Card>

  <Card title="Advanced Workflows" icon="gears">
    **`createRun`**

    Create runs without immediate execution. Use when you need custom orchestration or want to manage run lifecycle separately.
  </Card>

  <Card title="Evaluators" icon="clipboard-check">
    **Evaluator IDs or Configs**

    Pass evaluator IDs to `runTestSuite` to automatically score outputs. Configure custom evaluators in the Netra dashboard.
  </Card>
</CardGroup>

***

## Complete Example

```typescript theme={null}
import { Netra } from "netra-sdk-js";
import OpenAI from "openai";

async function main() {
  // Initialize
  const client = new Netra({
    apiKey: "your-api-key",
  });
  const openai = new OpenAI();

  // 1. Create a dataset
  const datasetResponse = await client.evaluation.createDataset(
    "Product FAQ Evaluation",
    ["faq", "products", "v2"]
  );

  if (!datasetResponse) {
    console.error("Failed to create dataset");
    return;
  }

  const datasetId = datasetResponse.id;
  console.log(`Created dataset: ${datasetId}`);

  // 2. Add test items
  const testCases = [
    {
      input: "What is your return policy?",
      expectedOutput: "Items can be returned within 30 days.",
    },
    {
      input: "How long does shipping take?",
      expectedOutput: "Standard shipping takes 3-5 business days.",
    },
    {
      input: "Do you offer international shipping?",
      expectedOutput: "Yes, we ship to over 50 countries.",
    },
  ];

  for (const testCase of testCases) {
    await client.evaluation.addDatasetItem(datasetId, {
      input: testCase.input,
      expectedOutput: testCase.expectedOutput,
    });
  }
  console.log(`Added ${testCases.length} test items`);

  // 3. Define the task
  async function faqAgent(query: string): Promise<string> {
    const response = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        {
          role: "system",
          content: "You are a customer support agent. Answer concisely.",
        },
        { role: "user", content: query },
      ],
    });
    return response.choices[0].message.content || "";
  }

  // 4. Run the test suite
  const dataset = await client.evaluation.getDataset(datasetId);

  if (dataset) {
    const result = await client.evaluation.runTestSuite(
      "FAQ Agent v2 Evaluation",
      dataset,
      faqAgent,
      ["correctness", "relevance"],
      5
    );

    // 5. Review results
    if (result) {
      console.log(`\nRun completed: ${result.runId}`);
      for (const item of result.items) {
        console.log(
          `  Item ${item.index}: ${item.status} (trace: ${item.traceId})`
        );
      }

      console.log(
        "\nView detailed results in Netra dashboard → Evaluation → Test Runs"
      );
    }
  }
}

main();
```

## Next Steps

* [Dashboard Query](/sdk-reference/dashboard-query/typescript) - Query dashboard metrics
* [Usage Utilities](/usage/usage-utilities) - Query traces and spans
* [Evaluators](/Evaluation/Evaluators) - Configure custom evaluators
* [Test Runs](/Evaluation/TestRuns) - View and analyze test run results
