1. Prerequisites
Before setting up evaluations, ensure you have:- Netra SDK installed and initialized
- At least one traced LLM call in your dashboard
- Your API key configured
2. Create a Dataset
Datasets are collections of test cases that define inputs and expected outputs for your AI system.Option A: Create from Traces (Recommended)
Convert real-world interactions into test cases:Configure the Test Case
- Enter a dataset name (e.g., “Customer Support QA”)
- Add optional tags for organization
- Review the input prompt
- Provide the expected output
- Click Next
Option B: Create Manually
Configure Dataset
- Enter a dataset name
- Select Single Turn for request/response pairs
- Choose Add manually
3. Configure Evaluators
Evaluators score your AI’s outputs against defined criteria. Netra offers two types:LLM as Judge
Best for subjective quality assessment:- Answer Correctness: Does the response match the expected answer?
- Relevance: Is the response relevant to the question?
- Hallucination Detection: Does the response contain fabricated information?
- Toxicity: Is the content safe and appropriate?
Code Evaluators
Best for deterministic checks:- JSON Validation: Verify JSON structure and schema
- Regex Matching: Pattern-based validation
- Custom Logic: Write JavaScript or Python for specific rules
Select from Library
Browse pre-built evaluators in categories:
- Quality
- Performance
- Agentic
- Guardrails
4. Run an Evaluation
Once your dataset is configured with evaluators:Trigger Evaluation
Run your AI system with the dataset inputs. Evaluations execute automatically when traces are created.
View Results
Navigate to Evaluation → Test Runs to see your evaluation results.
5. Analyze Test Run Results
Click on a test run to view detailed results:Summary Metrics
- Total Cost: Aggregate cost of all LLM calls
- Average Latency: Response time across test cases
- Pass/Fail Rate: Overall success rate
Per-Test-Case Results
Each test case shows:| Field | Description |
|---|---|
| Input | The prompt sent to the AI |
| Expected Output | Your defined ideal response |
| Task Output | The actual AI response |
| Status | Pass/Fail indicator |
| Evaluator Scores | Individual scores from each evaluator |
| View Trace | Link to the full execution trace |
Troubleshooting
| Issue | Solution |
|---|---|
| No test runs appearing | Ensure your dataset has evaluators configured and traces are being sent |
| Evaluator errors | Test your evaluator in the Playground before adding to datasets |
| Unexpected failures | Check variable mappings in evaluator configuration |