Skip to main content
Datasets are the foundation of your evaluation pipeline. They define what you’re testing—the inputs, expected outputs, and metadata that Evaluators use to score your AI system’s performance. Build them from real production traces or create them manually for comprehensive coverage.

Why Datasets Matter

Datasets transform ad-hoc testing into systematic quality assurance:
BenefitDescription
ReproducibilityRun the same tests across model updates, prompt changes, and code releases
Real-World CoverageConvert production traces into test cases that reflect actual user behavior
Regression DetectionCompare results over time to catch quality degradation early
Objective BenchmarkingMeasure performance against defined criteria, not gut feeling

Dataset Dashboard

Navigate to Evaluation → Datasets from the left navigation panel to access your datasets. Dataset Dashboard
ColumnDescription
Dataset NameUnique identifier for the test suite
TagsMetadata labels for filtering and organization
Created AtTimestamp for version tracking
ActionsQuick access to edit or delete datasets

Creating a Dataset

There are two ways to create a dataset:

Creating Dataset from Traces

The fastest way to build meaningful test cases is to capture real interactions from your production system. This ensures your evaluations reflect actual user behavior.
1

Find a Trace

Navigate to Observability → Traces and locate an interaction you want to use as a test case.
2

Add to Dataset

Click the Add to Dataset button on the trace.Choose to create a new dataset or add to an existing one.
3

Configure Test Case

In the creation form:
  • Enter a dataset name (e.g., “Customer Support QA”)
  • Add optional tags for organization
  • Review and edit the input prompt
  • Provide the expected output
  • Include any relevant metadata from the trace
4

Select Evaluators

Click Next and choose evaluators to score this test case:
5

Map Variables

Configure how evaluator variables connect to your data:
SourceUse Case
Dataset fieldUse values defined in your test case (input, expected output)
Agent responseUse the actual LLM output at evaluation time
Execution dataUse metadata from the trace (latency, tokens, model)
6

Create Dataset

Click Create Dataset to finalize.

Creating Dataset from Dashboard

For comprehensive test coverage, create datasets manually with carefully crafted test cases.
1

Open Creation Form

Click the Create Dataset button in the top right corner of the Datasets page.
2

Configure Dataset

Create datasetFill in the dataset details:
FieldDescription
NameA descriptive identifier for your test suite
TagsLabels for filtering (e.g., “production”, “edge-cases”, “v2-prompts”)
TypeSingle Turn for request/response pairs
Data SourceAdd manually to create items one by one
Scenario (multi-turn conversations), Import from traces, and Import from CSV are coming soon.
3

Select Evaluators

Create datasetClick Next and select evaluators from the library or your saved configurations.
4

Map Variables

Create datasetConfigure variable mappings to connect evaluator inputs to your dataset fields.
5

Finalize

Click Create Dataset to complete the process.

Running an Evaluation

Once your dataset is configured with evaluators, you can run evaluations:
1

Get Dataset ID

Open your dataset and copy the Dataset ID displayed at the top of the page.Dataset ID
2

Trigger Evaluation

Use the Dataset ID in your evaluation pipeline code. The evaluation runs automatically when traces are created.
3

View Results

Monitor progress and results in Test Runs.
Evaluations execute automatically when the associated code is triggered. You don’t need to manually start each run—just ensure your traces are flowing to Netra.

Best Practices

Organizing Datasets

  • Use descriptive names: “Customer Support - Refund Requests” is better than “Dataset 1”
  • Tag consistently: Create a tagging convention (e.g., by feature, model version, or test type)
  • Version your datasets: Include version numbers in tags when testing prompt iterations

Building Effective Test Cases

  • Cover edge cases: Include unusual inputs, long prompts, and potential failure scenarios
  • Balance quantity and quality: A smaller dataset of high-quality test cases beats a large dataset of weak ones
  • Include negative tests: Add cases where the expected behavior is to refuse or ask for clarification

Maintaining Datasets

  • Update regularly: Add new test cases from production traces as you discover new patterns
  • Remove outdated cases: Delete test cases that no longer reflect current requirements
  • Review failed cases: Investigate failures to determine if the AI is wrong or the expected output needs updating
Last modified on January 28, 2026