Skip to main content
AI systems don’t fail loudly. They drift, regress, and quietly degrade over time. Netra’s Evaluation framework makes that invisible failure visible, giving you a structured, repeatable way to measure how your AI behaves—not just once, but continuously across releases, prompts, models, and environments.

Quick Start: Evaluation

New to evaluations? Get your first evaluation running in minutes.

Why Evaluation Matters

Without systematic evaluation, you’re flying blind. Netra helps you answer critical questions with confidence:
QuestionWhat Netra Measures
Is my system producing correct answers?Answer correctness, semantic similarity, factual accuracy
Did this update introduce a regression?Side-by-side comparison across test runs
Are costs creeping up unnoticed?Token usage, latency, and cost per evaluation
Are my agents executing correctly?Tool call sequences, decision paths, guardrail compliance

Core Building Blocks

The Evaluation suite is built on three interconnected pillars:

Datasets

Datasets are collections of test cases that define what you want to evaluate.
FeatureDescription
Create from TracesConvert real production interactions into test cases with one click
Manual CreationBuild test suites from scratch in the dashboard
Variable MappingMap evaluator inputs to dataset fields, agent responses, or trace metadata
Metadata & TagsOrganize datasets by feature, model, or release version

Evaluators

Evaluators are the scoring logic that assesses your AI’s performance. Netra offers two approaches: LLM as Judge Best for subjective quality, semantic correctness, and nuanced criteria. Use prebuilt templates or write custom prompts with providers like OpenAI, Anthropic, and Google. Code Evaluators Best for deterministic checks using JavaScript or Python—JSON schema validation, regex matching, mathematical calculations, and custom business logic.
Netra provides a Library of preconfigured evaluators covering Quality, Performance, Agentic behavior, and Guardrails. Customize any evaluator and save it to My Evaluators for reuse across datasets.
Playground Testing Before deploying an evaluator, test it in the integrated Playground:
  • Input sample data and run evaluations in real-time
  • Refine prompt templates and adjust pass/fail thresholds
  • Verify edge case handling before adding to your pipeline

Test Runs

Test Runs execute your datasets through the evaluation pipeline, providing point-in-time snapshots of system health.
FeatureDescription
Deep DiagnosticsCompare expected output vs. actual output side-by-side
Trace IntegrationLink directly to execution traces to debug the “why” behind failures
Aggregated MetricsView total cost, average latency, and pass/fail rates across the run

Use Cases

Regression Testing

Catch quality degradation before it reaches production:
  1. Create a dataset from your golden test cases
  2. Run evaluations after each model or prompt change
  3. Compare results across test runs to identify regressions

Continuous Quality Monitoring

Track quality metrics over time:
  1. Convert production traces into datasets
  2. Schedule regular evaluation runs
  3. Set up alerts when pass rates drop below thresholds

Model Comparison

Evaluate different models or prompts objectively:
  1. Create a standardized dataset
  2. Run the same inputs through different model configurations
  3. Compare scores across test runs to make data-driven decisions

Getting Started

1

Create a Dataset

Start by creating a dataset from traces or manually in the dashboard.
2

Configure Evaluators

Add evaluators to define your scoring criteria—choose from the library or create custom ones.
3

Run Evaluations

Execute your dataset and view results in Test Runs.
4

Iterate and Improve

Use insights from test runs to refine your prompts, models, and evaluation criteria.
  • Quick Start: Evaluation - Get started with evaluations in minutes
  • Datasets - Create and manage test case collections
  • Evaluators - Configure scoring logic and criteria
  • Test Runs - Analyze evaluation results and track regressions
  • Traces - Understand how evaluations connect to trace data
Last modified on January 28, 2026