Datasets

Datasets are the foundation of your evaluation pipeline. They define what you’re testing—the inputs, expected outputs, and metadata that Evaluators use to score your AI system’s performance. Build them from real production traces or create them manually for comprehensive coverage.

Why Datasets Matter

Datasets transform ad-hoc testing into systematic quality assurance:

Benefit	Description
Reproducibility	Run the same tests across model updates, prompt changes, and code releases
Real-World Coverage	Convert production traces into test cases that reflect actual user behavior
Regression Detection	Compare results over time to catch quality degradation early
Objective Benchmarking	Measure performance against defined criteria, not gut feeling

Dataset Dashboard

Navigate to Evaluation → Datasets from the left navigation panel to access your datasets.

Column	Description
Dataset Name	Unique identifier for the test suite
Tags	Metadata labels for filtering and organization
Created At	Timestamp for version tracking
Actions	Quick access to edit or delete datasets

Creating a Dataset

There are two ways to create a dataset:

From Traces

Convert real production interactions into test cases (Recommended)

Manual Creation

Build test suites from scratch in the dashboard

Creating Dataset from Traces

The fastest way to build meaningful test cases is to capture real interactions from your production system. This ensures your evaluations reflect actual user behavior.

Find a Trace

Navigate to Observability → Traces and locate an interaction you want to use as a test case.

Add to Dataset

Click the Add to Dataset button on the trace.Choose to create a new dataset or add to an existing one.

Configure Test Case

In the creation form:

Enter a dataset name (e.g., “Customer Support QA”)
Add optional tags for organization
Review and edit the input prompt
Provide the expected output
Include any relevant metadata from the trace

Select Evaluators

Click Next and choose evaluators to score this test case:

Browse the evaluator library
Or select from your saved evaluators in My Evaluators

Map Variables

Configure how evaluator variables connect to your data:

Source	Use Case
Dataset field	Use values defined in your test case (input, expected output)
Agent response	Use the actual LLM output at evaluation time
Execution data	Use metadata from the trace (latency, tokens, model)

Create Dataset

Click Create Dataset to finalize.

Creating Dataset from Dashboard

For comprehensive test coverage, create datasets manually with carefully crafted test cases.

Open Creation Form

Click the Create Dataset button in the top right corner of the Datasets page.

Configure Dataset

Fill in the dataset details:

Field	Description
Name	A descriptive identifier for your test suite
Tags	Labels for filtering (e.g., “production”, “edge-cases”, “v2-prompts”)
Type	Single Turn for request/response pairs
Data Source	Add manually to create items one by one

Scenario (multi-turn conversations), Import from traces, and Import from CSV are coming soon.

Select Evaluators

Click Next and select evaluators from the library or your saved configurations.

Map Variables

Configure variable mappings to connect evaluator inputs to your dataset fields.

Finalize

Click Create Dataset to complete the process.

Running an Evaluation

Once your dataset is configured with evaluators, you can run evaluations:

Get Dataset ID

Open your dataset and copy the Dataset ID displayed at the top of the page.

Trigger Evaluation

Use the Dataset ID in your evaluation pipeline code. The evaluation runs automatically when traces are created.

View Results

Monitor progress and results in Test Runs.

Evaluations execute automatically when the associated code is triggered. You don’t need to manually start each run—just ensure your traces are flowing to Netra.

Best Practices

Organizing Datasets

Use descriptive names: “Customer Support - Refund Requests” is better than “Dataset 1”
Tag consistently: Create a tagging convention (e.g., by feature, model version, or test type)
Version your datasets: Include version numbers in tags when testing prompt iterations

Building Effective Test Cases

Cover edge cases: Include unusual inputs, long prompts, and potential failure scenarios
Balance quantity and quality: A smaller dataset of high-quality test cases beats a large dataset of weak ones
Include negative tests: Add cases where the expected behavior is to refuse or ask for clarification

Maintaining Datasets

Update regularly: Add new test cases from production traces as you discover new patterns
Remove outdated cases: Delete test cases that no longer reflect current requirements
Review failed cases: Investigate failures to determine if the AI is wrong or the expected output needs updating

Evaluation Overview - Understand the full evaluation framework
Evaluators - Configure scoring logic for your datasets
Test Runs - Analyze evaluation results
Traces - Source data for creating datasets from production

Get Started

Observability

Evaluation

Analytics & Dashboard

Monitoring & Dashboard

Account settings

Why Datasets Matter

Dataset Dashboard

Creating a Dataset

From Traces