Evaluators

Evaluators are the scoring logic that determines whether your AI system meets quality standards. They transform subjective assessments into measurable metrics—from semantic correctness and tool execution accuracy to safety guardrails and custom business logic. Use them with Datasets to build automated quality pipelines.

Why Evaluators Matter

Without systematic scoring, you can’t measure improvement or catch regressions:

Challenge	How Evaluators Help
Subjective quality	LLM as Judge provides consistent, scalable assessment
Format validation	Code Evaluators enforce JSON schemas, regex patterns, and business rules
Safety compliance	Guardrail evaluators detect toxic, harmful, or off-topic content
Tool execution	Agentic evaluators verify correct function calling sequences

Evaluator Types

Netra offers two approaches to scoring, each suited for different use cases:

LLM as Judge

Best for subjective quality, semantic correctness, and nuanced criteria. Uses AI models to evaluate AI outputs.

Code Evaluator

Best for deterministic checks—JSON validation, regex matching, calculations, and custom business logic in JavaScript or Python.

Evaluators Dashboard

Navigate to Evaluation → Evaluators from the left navigation panel. The interface has two tabs:

Tab	Description
Library	Netra’s preconfigured evaluators organized by category
My Evaluators	Your saved custom configurations for reuse across datasets

Evaluators page showing Library and My Evaluators tabs

Creating Custom Evaluators

Click the Add Evaluator button in the top right corner to create a new evaluator.

You can also customize any pre-built evaluator from the Library by clicking the Add button next to it.

LLM as Judge Configuration

Use LLM as Judge when you need to evaluate subjective criteria like answer quality, relevance, or helpfulness.

Name Your Evaluator

Provide a descriptive name (e.g., “Answer Correctness - Customer Support”).

Configure Prompt Template

Select a prebuilt template or write your own evaluation prompt
Define variables using {{variable_name}} syntax
Variables map to dataset fields, agent responses, or trace metadata

Example prompt:

Compare the following response to the expected answer.

Expected: {{expected_output}}
Actual: {{agent_response}}

Rate the correctness from 0-10.

Set Output & Pass Criteria

Output Type	Configuration
Numerical	Set threshold and operator (e.g., `> 7` to pass)
Boolean	Simple pass/fail evaluation

Select LLM Provider

Choose your preferred provider and model:

OpenAI (GPT-4, GPT-3.5)
Anthropic (Claude)
Google (Gemini)
Mistral

Test in Playground

Input sample data for each variable
Run the evaluator in real-time
Refine your prompt until results are consistent

Code Evaluator Configuration

Use Code Evaluators for deterministic checks that don’t require AI judgment.

Name Your Evaluator

Provide a descriptive name (e.g., “JSON Schema Validator”).

Write Your Code

Use the code editor to write JavaScript or Python. A handler function is required.JavaScript example:

function handler(input) {
  try {
    const parsed = JSON.parse(input.agent_response);
    return parsed.hasOwnProperty('name') && parsed.hasOwnProperty('email');
  } catch {
    return false;
  }
}

Python example:

import json

def handler(input):
    try:
        parsed = json.loads(input["agent_response"])
        return "name" in parsed and "email" in parsed
    except:
        return False

Set Output & Pass Criteria

Output Type	Configuration
Numerical	Set threshold and operator (e.g., `>= 0.8` to pass)
Boolean	Return `true`/`false` directly from your code

Test in Playground

Input sample data
Execute your code in real-time
Debug and refine until it handles edge cases correctly

Once created, your evaluator appears in My Evaluators and becomes available when creating datasets.

Library

The Library contains pre-built evaluators ready to use or customize.

Category	Description	Type
Quality	Answer correctness, relevance, completeness	LLM as Judge
Tool Use	Validates proper function/tool calling	LLM as Judge
Performance	Response time, token efficiency	Code
Semantic	Meaning preservation, context understanding	LLM as Judge
Agentic	Decision-making, multi-step reasoning	LLM as Judge
Guardrails	Content safety, toxicity, compliance	LLM as Judge
JSON Evaluator	Schema validation, structure checks	Code
Regex Evaluator	Pattern matching, format validation	Code

Customizing Pre-built Evaluators

Start with a library evaluator and tailor it to your needs:

Browse the Library

Find an evaluator that matches your use case.

Click Add

Opens the configuration window with pre-filled settings.

Customize

Modify the prompt template
Adjust variables and mappings
Change pass/fail thresholds

Test in Playground

Validate your changes with sample data.

Save

Click Create to save to My Evaluators.

Using Evaluators in Datasets

Once created, evaluators become available when building datasets:

Create or edit a dataset
In the evaluator selection step, choose from Library or My Evaluators
Map variables to connect evaluator inputs to your data
Run evaluations and view results in Test Runs

Best Practices

Choosing the Right Evaluator Type

Use Case	Recommended Type
”Is this answer correct?”	LLM as Judge
”Is the JSON valid?”	Code Evaluator
”Is the response helpful?”	LLM as Judge
”Does it match this regex?”	Code Evaluator
”Is content safe for users?”	LLM as Judge (Guardrails)
“Did the agent call the right tools?”	LLM as Judge (Agentic)

Writing Effective LLM Prompts

Be specific: Define exactly what “correct” or “good” means
Provide examples: Include sample inputs and expected scores
Set clear scales: “Rate 1-10” is better than “rate quality”
Test edge cases: Validate with ambiguous or tricky inputs

Testing Before Deployment

Always use the Playground before adding evaluators to production datasets:

Test with representative samples from your actual data
Include edge cases and potential failure scenarios
Verify pass/fail thresholds produce expected results

Evaluation Overview - Understand the full evaluation framework
Datasets - Create test cases that use your evaluators
Test Runs - View evaluation results and scores
Quick Start: Evaluation - Get started with evaluations

Get Started

Observability

Evaluation

Analytics & Dashboard

Monitoring & Dashboard

Account settings

Why Evaluators Matter

Evaluator Types

LLM as Judge

Code Evaluator

Evaluators Dashboard

Creating Custom Evaluators

LLM as Judge Configuration

Code Evaluator Configuration

Library

Customizing Pre-built Evaluators

Using Evaluators in Datasets

Best Practices

Choosing the Right Evaluator Type

Writing Effective LLM Prompts

Testing Before Deployment

Get Started

Observability

Evaluation

Analytics & Dashboard

Monitoring & Dashboard

Account settings

​Why Evaluators Matter

​Evaluator Types

LLM as Judge

Code Evaluator

​Evaluators Dashboard

​Creating Custom Evaluators

​LLM as Judge Configuration

​Code Evaluator Configuration

​Library

​Customizing Pre-built Evaluators

​Using Evaluators in Datasets

​Best Practices

​Choosing the Right Evaluator Type

​Writing Effective LLM Prompts

​Testing Before Deployment

​Related

Why Evaluators Matter

Evaluator Types

Evaluators Dashboard

Creating Custom Evaluators

LLM as Judge Configuration

Code Evaluator Configuration

Library

Customizing Pre-built Evaluators

Using Evaluators in Datasets

Best Practices

Choosing the Right Evaluator Type

Writing Effective LLM Prompts

Testing Before Deployment

Related