Skip to main content
Evaluators are the scoring logic that determines whether your AI system meets quality standards. They transform subjective assessments into measurable metrics—from semantic correctness and tool execution accuracy to safety guardrails and custom business logic. Use them with Datasets to build automated quality pipelines.

Why Evaluators Matter

Without systematic scoring, you can’t measure improvement or catch regressions:
ChallengeHow Evaluators Help
Subjective qualityLLM as Judge provides consistent, scalable assessment
Format validationCode Evaluators enforce JSON schemas, regex patterns, and business rules
Safety complianceGuardrail evaluators detect toxic, harmful, or off-topic content
Tool executionAgentic evaluators verify correct function calling sequences

Evaluator Types

Netra offers two approaches to scoring, each suited for different use cases:

LLM as Judge

Best for subjective quality, semantic correctness, and nuanced criteria. Uses AI models to evaluate AI outputs.

Code Evaluator

Best for deterministic checks—JSON validation, regex matching, calculations, and custom business logic in JavaScript or Python.

Evaluators Dashboard

Navigate to Evaluation → Evaluators from the left navigation panel. The interface has two tabs:
TabDescription
LibraryNetra’s preconfigured evaluators organized by category
My EvaluatorsYour saved custom configurations for reuse across datasets
Evaluators page showing Library and My Evaluators tabs

Creating Custom Evaluators

Click the Add Evaluator button in the top right corner to create a new evaluator.
You can also customize any pre-built evaluator from the Library by clicking the Add button next to it.

LLM as Judge Configuration

Use LLM as Judge when you need to evaluate subjective criteria like answer quality, relevance, or helpfulness. LLM as Judge configuration window
1

Name Your Evaluator

Provide a descriptive name (e.g., “Answer Correctness - Customer Support”).
2

Configure Prompt Template

  • Select a prebuilt template or write your own evaluation prompt
  • Define variables using {{variable_name}} syntax
  • Variables map to dataset fields, agent responses, or trace metadata
Example prompt:
Compare the following response to the expected answer.

Expected: {{expected_output}}
Actual: {{agent_response}}

Rate the correctness from 0-10.
3

Set Output & Pass Criteria

Output TypeConfiguration
NumericalSet threshold and operator (e.g., > 7 to pass)
BooleanSimple pass/fail evaluation
4

Select LLM Provider

Choose your preferred provider and model:
  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic (Claude)
  • Google (Gemini)
  • Mistral
5

Test in Playground

  • Input sample data for each variable
  • Run the evaluator in real-time
  • Refine your prompt until results are consistent

Code Evaluator Configuration

Use Code Evaluators for deterministic checks that don’t require AI judgment. Code Evaluator configuration window
1

Name Your Evaluator

Provide a descriptive name (e.g., “JSON Schema Validator”).
2

Write Your Code

Use the code editor to write JavaScript or Python. A handler function is required.JavaScript example:
function handler(input) {
  try {
    const parsed = JSON.parse(input.agent_response);
    return parsed.hasOwnProperty('name') && parsed.hasOwnProperty('email');
  } catch {
    return false;
  }
}
Python example:
import json

def handler(input):
    try:
        parsed = json.loads(input["agent_response"])
        return "name" in parsed and "email" in parsed
    except:
        return False
3

Set Output & Pass Criteria

Output TypeConfiguration
NumericalSet threshold and operator (e.g., >= 0.8 to pass)
BooleanReturn true/false directly from your code
4

Test in Playground

  • Input sample data
  • Execute your code in real-time
  • Debug and refine until it handles edge cases correctly
Once created, your evaluator appears in My Evaluators and becomes available when creating datasets.

Library

The Library contains pre-built evaluators ready to use or customize. Library tab
CategoryDescriptionType
QualityAnswer correctness, relevance, completenessLLM as Judge
Tool UseValidates proper function/tool callingLLM as Judge
PerformanceResponse time, token efficiencyCode
SemanticMeaning preservation, context understandingLLM as Judge
AgenticDecision-making, multi-step reasoningLLM as Judge
GuardrailsContent safety, toxicity, complianceLLM as Judge
JSON EvaluatorSchema validation, structure checksCode
Regex EvaluatorPattern matching, format validationCode

Customizing Pre-built Evaluators

Start with a library evaluator and tailor it to your needs:
1

Browse the Library

Find an evaluator that matches your use case.
2

Click Add

Opens the configuration window with pre-filled settings.
3

Customize

  • Modify the prompt template
  • Adjust variables and mappings
  • Change pass/fail thresholds
4

Test in Playground

Validate your changes with sample data.
5

Save

Click Create to save to My Evaluators.

Using Evaluators in Datasets

Once created, evaluators become available when building datasets:
  1. Create or edit a dataset
  2. In the evaluator selection step, choose from Library or My Evaluators
  3. Map variables to connect evaluator inputs to your data
  4. Run evaluations and view results in Test Runs

Best Practices

Choosing the Right Evaluator Type

Use CaseRecommended Type
”Is this answer correct?”LLM as Judge
”Is the JSON valid?”Code Evaluator
”Is the response helpful?”LLM as Judge
”Does it match this regex?”Code Evaluator
”Is content safe for users?”LLM as Judge (Guardrails)
“Did the agent call the right tools?”LLM as Judge (Agentic)

Writing Effective LLM Prompts

  • Be specific: Define exactly what “correct” or “good” means
  • Provide examples: Include sample inputs and expected scores
  • Set clear scales: “Rate 1-10” is better than “rate quality”
  • Test edge cases: Validate with ambiguous or tricky inputs

Testing Before Deployment

Always use the Playground before adding evaluators to production datasets:
  • Test with representative samples from your actual data
  • Include edge cases and potential failure scenarios
  • Verify pass/fail thresholds produce expected results
Last modified on January 28, 2026