Skip to main content
This guide walks you through setting up evaluations to measure your AI system’s accuracy, quality, and reliability.

1. Prerequisites

Before setting up evaluations, ensure you have:

2. Create a Dataset

Datasets are collections of test cases that define inputs and expected outputs for your AI system. Convert real-world interactions into test cases:
1

Navigate to Traces

Go to Observability → Traces and find a trace you want to use as a test case.
2

Add to Dataset

Click the Add to Dataset button on the trace.
3

Configure the Test Case

  • Enter a dataset name (e.g., “Customer Support QA”)
  • Add optional tags for organization
  • Review the input prompt
  • Provide the expected output
  • Click Next
4

Select Evaluators

Choose evaluators to score your AI’s performance (see Step 3).

Option B: Create Manually

1

Open Dataset Dashboard

Navigate to Evaluation → Datasets and click Create Dataset.
2

Configure Dataset

  • Enter a dataset name
  • Select Single Turn for request/response pairs
  • Choose Add manually
3

Add Test Cases

For each test case, provide:
  • Input: The prompt or question
  • Expected Output: The ideal response
  • Metadata (optional): Additional context

3. Configure Evaluators

Evaluators score your AI’s outputs against defined criteria. Netra offers two types:

LLM as Judge

Best for subjective quality assessment:
  • Answer Correctness: Does the response match the expected answer?
  • Relevance: Is the response relevant to the question?
  • Hallucination Detection: Does the response contain fabricated information?
  • Toxicity: Is the content safe and appropriate?

Code Evaluators

Best for deterministic checks:
  • JSON Validation: Verify JSON structure and schema
  • Regex Matching: Pattern-based validation
  • Custom Logic: Write JavaScript or Python for specific rules
1

Add Evaluators

When creating your dataset, click Next to reach the evaluator selection screen.
2

Select from Library

Browse pre-built evaluators in categories:
  • Quality
  • Performance
  • Agentic
  • Guardrails
3

Map Variables

Configure how evaluator variables map to your data:
  • Dataset field: Use values from your test cases
  • Agent response: Use the actual LLM output
  • Execution data: Use trace metadata

4. Run an Evaluation

Once your dataset is configured with evaluators:
1

Get Dataset ID

Open your dataset and copy the Dataset ID displayed at the top.
2

Trigger Evaluation

Run your AI system with the dataset inputs. Evaluations execute automatically when traces are created.
3

View Results

Navigate to Evaluation → Test Runs to see your evaluation results.

5. Analyze Test Run Results

Click on a test run to view detailed results:

Summary Metrics

  • Total Cost: Aggregate cost of all LLM calls
  • Average Latency: Response time across test cases
  • Pass/Fail Rate: Overall success rate

Per-Test-Case Results

Each test case shows:
FieldDescription
InputThe prompt sent to the AI
Expected OutputYour defined ideal response
Task OutputThe actual AI response
StatusPass/Fail indicator
Evaluator ScoresIndividual scores from each evaluator
View TraceLink to the full execution trace

Troubleshooting

IssueSolution
No test runs appearingEnsure your dataset has evaluators configured and traces are being sent
Evaluator errorsTest your evaluator in the Playground before adding to datasets
Unexpected failuresCheck variable mappings in evaluator configuration

Next Steps

Last modified on January 28, 2026