Quick Start: Evaluation
New to evaluations? Get your first evaluation running in minutes.
Why Evaluation Matters
Without systematic evaluation, you’re flying blind. Netra helps you answer critical questions with confidence:| Question | What Netra Measures |
|---|---|
| Is my system producing correct answers? | Answer correctness, semantic similarity, factual accuracy |
| Did this update introduce a regression? | Side-by-side comparison across test runs |
| Are costs creeping up unnoticed? | Token usage, latency, and cost per evaluation |
| Are my agents executing correctly? | Tool call sequences, decision paths, guardrail compliance |
Core Building Blocks
The Evaluation suite is built on three interconnected pillars:Datasets
Datasets are collections of test cases that define what you want to evaluate.| Feature | Description |
|---|---|
| Create from Traces | Convert real production interactions into test cases with one click |
| Manual Creation | Build test suites from scratch in the dashboard |
| Variable Mapping | Map evaluator inputs to dataset fields, agent responses, or trace metadata |
| Metadata & Tags | Organize datasets by feature, model, or release version |
Evaluators
Evaluators are the scoring logic that assesses your AI’s performance. Netra offers two approaches: LLM as Judge Best for subjective quality, semantic correctness, and nuanced criteria. Use prebuilt templates or write custom prompts with providers like OpenAI, Anthropic, and Google. Code Evaluators Best for deterministic checks using JavaScript or Python—JSON schema validation, regex matching, mathematical calculations, and custom business logic.Netra provides a Library of preconfigured evaluators covering Quality, Performance, Agentic behavior, and Guardrails. Customize any evaluator and save it to My Evaluators for reuse across datasets.
- Input sample data and run evaluations in real-time
- Refine prompt templates and adjust pass/fail thresholds
- Verify edge case handling before adding to your pipeline
Test Runs
Test Runs execute your datasets through the evaluation pipeline, providing point-in-time snapshots of system health.| Feature | Description |
|---|---|
| Deep Diagnostics | Compare expected output vs. actual output side-by-side |
| Trace Integration | Link directly to execution traces to debug the “why” behind failures |
| Aggregated Metrics | View total cost, average latency, and pass/fail rates across the run |
Use Cases
Regression Testing
Catch quality degradation before it reaches production:- Create a dataset from your golden test cases
- Run evaluations after each model or prompt change
- Compare results across test runs to identify regressions
Continuous Quality Monitoring
Track quality metrics over time:- Convert production traces into datasets
- Schedule regular evaluation runs
- Set up alerts when pass rates drop below thresholds
Model Comparison
Evaluate different models or prompts objectively:- Create a standardized dataset
- Run the same inputs through different model configurations
- Compare scores across test runs to make data-driven decisions
Getting Started
Create a Dataset
Start by creating a dataset from traces or manually in the dashboard.
Configure Evaluators
Add evaluators to define your scoring criteria—choose from the library or create custom ones.
Run Evaluations
Execute your dataset and view results in Test Runs.
Related
- Quick Start: Evaluation - Get started with evaluations in minutes
- Datasets - Create and manage test case collections
- Evaluators - Configure scoring logic and criteria
- Test Runs - Analyze evaluation results and track regressions
- Traces - Understand how evaluations connect to trace data