Custom Evaluator Patterns

This cookbook shows you how to build custom evaluators for domain-specific quality measurement—from content quality scoring to multi-stage pipeline validation and business logic checks.

Open in Google Colab

Run the complete notebook in your browser

What You’ll Learn

LLM-as-Judge Evaluators

Build evaluators that use LLMs to assess quality, relevance, and accuracy

Code Evaluators

Create deterministic evaluators with custom JavaScript logic

Multi-Stage Evaluation

Measure quality at each stage of a multi-step pipeline

Composite Scoring

Combine multiple metrics into a single quality score

Prerequisites:

Python >=3.10, <3.14
OpenAI API key
Netra API key (Get started here)
A test dataset with expected outputs

Evaluator Types

Netra supports two evaluator types:

Type	Use Case	Strengths
LLM-as-Judge	Subjective quality, nuance, semantic meaning	Handles ambiguity, understands context
Code Evaluator	Deterministic checks, business rules, structure	Fast, consistent, no LLM cost

Choose based on what you’re measuring:

LLM-as-Judge: “Is this answer helpful?” “Is the tone professional?”
Code Evaluator: “Does it contain required fields?” “Is latency under threshold?”

LLM-as-Judge Patterns

Pattern 1: Content Quality Scoring

Evaluate content across multiple dimensions with weighted scoring: Create in Dashboard:

Go to Evaluation → Evaluators → Add Evaluator
Select LLM-as-Judge type
Configure the prompt:

Evaluate the quality of this content on a scale of 0 to 1.

Score based on these weighted criteria:
- Coherence (0.25): Does the content flow logically?
- Accuracy (0.25): Is the information factually correct?
- Completeness (0.25): Does it cover the topic adequately?
- Engagement (0.25): Is it interesting and well-written?

Content to evaluate:
{output}

Original request:
{input}

Respond with only a decimal number between 0 and 1.

Setting	Value
Output Type	Numerical
Pass Criteria	>= 0.7
LLM Provider	OpenAI (or your preference)
Model	gpt-4o-mini

Pattern 2: Factual Accuracy Check

Compare generated content against a reference for factual accuracy:

Compare the generated answer against the reference answer for factual accuracy.

Generated Answer:
{output}

Reference Answer:
{expected_output}

Evaluate on these criteria:
1. Are all key facts from the reference present in the generated answer?
2. Does the generated answer contain any factual errors?
3. Does the generated answer add any unverified claims?

Score from 0 to 1 where:
- 1.0 = Fully accurate, all key facts present
- 0.7-0.9 = Mostly accurate, minor omissions
- 0.4-0.6 = Partially accurate, significant gaps
- 0.0-0.3 = Inaccurate or missing key facts

Respond with only a decimal number.

Pattern 3: Tone and Style Evaluation

Assess whether content matches expected tone:

Evaluate whether this content matches the expected tone and style.

Content:
{output}

Expected tone: {expected_output.tone}
Target audience: {expected_output.audience}

Evaluate:
1. Is the vocabulary appropriate for the audience?
2. Is the tone consistent throughout?
3. Does it avoid inappropriate language or jargon?

Score from 0 to 1 where:
- 1.0 = Perfect tone match
- 0.7-0.9 = Good match with minor adjustments needed
- 0.4-0.6 = Tone partially correct
- 0.0-0.3 = Tone mismatch

Respond with only a decimal number.

Code Evaluator Patterns

Pattern 4: JSON Structure Validation

Validate that output contains required fields:

function handler(input, output, expectedOutput) {
    // Parse JSON output
    let result;
    try {
        result = JSON.parse(output);
    } catch (e) {
        return 0; // Invalid JSON
    }

    // Define required fields
    const requiredFields = expectedOutput?.required_fields ||
                          ["summary", "action_items"];

    // Check for each required field
    let fieldsPresent = 0;
    for (const field of requiredFields) {
        if (field in result && result[field]) {
            fieldsPresent++;
        }
    }

    return fieldsPresent / requiredFields.length;
}

Setting	Value
Output Type	Numerical
Pass Criteria	>= 1.0

Pattern 5: Length and Format Validation

Check content meets length requirements:

function handler(input, output, expectedOutput) {
    const minLength = expectedOutput?.min_length || 100;
    const maxLength = expectedOutput?.max_length || 2000;
    const requiredFormat = expectedOutput?.format || "markdown";

    let score = 0;

    // Check length (50% of score)
    const length = output.length;
    if (length >= minLength && length <= maxLength) {
        score += 0.5;
    } else if (length >= minLength * 0.8 && length <= maxLength * 1.2) {
        score += 0.25; // Partial credit for close
    }

    // Check format (50% of score)
    if (requiredFormat === "markdown") {
        const hasHeadings = output.includes("# ") || output.includes("## ");
        const hasParagraphs = output.split("\n\n").length > 1;

        if (hasHeadings) score += 0.25;
        if (hasParagraphs) score += 0.25;
    } else if (requiredFormat === "json") {
        try {
            JSON.parse(output);
            score += 0.5;
        } catch (e) {
            // Invalid JSON
        }
    }

    return score;
}

Pattern 6: Keyword and SEO Validation

Check for required keywords and SEO elements:

function handler(input, output, expectedOutput) {
    const keywords = expectedOutput?.keywords || [];
    const outputLower = output.toLowerCase();

    let score = 0;

    // Check meta description (30%)
    const hasMetaDesc = outputLower.includes("meta description") ||
                        outputLower.includes("meta:") ||
                        output.includes("**Meta Description**");
    if (hasMetaDesc) {
        score += 0.3;
    }

    // Check keyword usage (40%)
    if (keywords.length > 0) {
        const keywordsFound = keywords.filter(kw =>
            outputLower.includes(kw.toLowerCase())
        );
        score += (keywordsFound.length / keywords.length) * 0.4;
    } else {
        score += 0.2; // Partial credit if no keywords specified
    }

    // Check heading structure (30%)
    const hasH1 = output.includes("# ") && !output.startsWith("## ");
    const hasH2 = output.includes("## ");

    if (hasH1) score += 0.15;
    if (hasH2) score += 0.15;

    return Math.min(score, 1);
}

Pattern 7: Latency Threshold Check

Validate response time meets SLA:

function handler(input, output, expectedOutput) {
    const latencyMs = expectedOutput?.latency_ms;
    const threshold = expectedOutput?.threshold_ms || 2000;

    if (!latencyMs) {
        return 1; // No latency data, pass by default
    }

    if (latencyMs <= threshold) {
        return 1;
    } else if (latencyMs <= threshold * 1.5) {
        return 0.5; // Partial credit for close
    } else {
        return 0;
    }
}

Multi-Stage Evaluation Patterns

Pattern 8: Pipeline Stage Quality

Evaluate quality at each stage of a multi-step pipeline: Writer Quality Evaluator (LLM-as-Judge):

Evaluate the quality of this draft article.

Article:
{output}

Topic:
{input}

Score from 0 to 1 based on:
- Coherence: Does the article flow logically? (0.3 weight)
- Coverage: Does it cover the topic comprehensively? (0.3 weight)
- Engagement: Is it interesting to read? (0.2 weight)
- Structure: Does it have clear intro, body, conclusion? (0.2 weight)

Respond with only a decimal number.

Editor Effectiveness Evaluator (Code):

function handler(input, output, expectedOutput) {
    // input.draft = the original draft
    // output = the edited version

    const draft = input?.draft || "";
    const edited = output || "";

    // No changes = no improvement
    if (draft === edited) {
        return 0;
    }

    // Calculate changes
    const lengthRatio = edited.length / Math.max(draft.length, 1);

    // Edited should be similar length (0.8 to 1.2x)
    if (lengthRatio < 0.7 || lengthRatio > 1.3) {
        return 0.5; // Major length change might indicate issues
    }

    let score = 0.5; // Base score for making changes

    // Reward similar length (tight editing)
    if (lengthRatio >= 0.9 && lengthRatio <= 1.1) {
        score += 0.3;
    }

    // Reward if output is valid markdown
    if (edited.includes("# ") || edited.includes("## ")) {
        score += 0.2;
    }

    return Math.min(score, 1);
}

Pattern 9: End-to-End Pipeline Quality

Evaluate the final output holistically:

Evaluate this content for publication readiness.

Content:
{output}

Original Request:
{input}

Score from 0 to 1 based on:
- Informativeness (0.25): Does it provide valuable information?
- Readability (0.25): Is it easy to read and understand?
- SEO Optimization (0.25): Does it have proper structure and keywords?
- Professionalism (0.25): Is it publication-ready?

Respond with only a decimal number.

Composite Scoring Patterns

Pattern 10: Weighted Multi-Evaluator Score

Combine multiple evaluator results into a single score:

function handler(input, output, expectedOutput) {
    // Individual scores from other evaluators (passed via expectedOutput)
    const scores = expectedOutput?.evaluator_scores || {};

    const weights = {
        content_quality: 0.4,
        factual_accuracy: 0.3,
        seo_score: 0.2,
        format_validation: 0.1,
    };

    let totalWeight = 0;
    let weightedSum = 0;

    for (const [evaluator, weight] of Object.entries(weights)) {
        if (evaluator in scores) {
            weightedSum += scores[evaluator] * weight;
            totalWeight += weight;
        }
    }

    if (totalWeight === 0) {
        return 0;
    }

    return weightedSum / totalWeight;
}

Pattern 11: Threshold-Based Pass/Fail

Convert numerical scores to binary pass/fail with multiple conditions:

function handler(input, output, expectedOutput) {
    const scores = expectedOutput?.scores || {};

    // Define minimum thresholds
    const thresholds = {
        accuracy: 0.8,    // Must be >= 80%
        completeness: 0.7, // Must be >= 70%
        latency: 0.5,     // Must be >= 50% (under threshold)
    };

    // All conditions must pass
    for (const [metric, threshold] of Object.entries(thresholds)) {
        if (metric in scores && scores[metric] < threshold) {
            return 0; // Fail if any metric below threshold
        }
    }

    return 1; // Pass if all conditions met
}

Best Practices

1. Start Simple, Add Complexity

Begin with basic evaluators and refine based on failure patterns:

// Start: Simple length check
function handler(input, output, expectedOutput) {
    return output.length > 100 ? 1 : 0;
}

// Evolve: Add more criteria as you learn
function handler(input, output, expectedOutput) {
    let score = 0;
    if (output.length > 100) score += 0.3;
    if (output.includes("## ")) score += 0.3;
    // ... add more criteria based on observed failures
    return score;
}

2. Use LLM-as-Judge for Subjective Criteria

Reserve LLM evaluators for things that require understanding:

Use LLM-as-Judge	Use Code Evaluator
”Is this helpful?"	"Is it valid JSON?"
"Is the tone professional?"	"Is it under 1000 chars?"
"Does it answer the question?"	"Does it contain required fields?“

3. Calibrate Pass Criteria

Start with lenient thresholds and tighten based on results:

Phase 1: >= 0.5 (learn what fails)
Phase 2: >= 0.6 (after prompt improvements)
Phase 3: >= 0.7 (production threshold)

4. Include Clear Scoring Rubrics

Make LLM-as-Judge evaluators consistent by providing explicit criteria:

Score exactly as follows:
- 1.0 = Meets all criteria perfectly
- 0.8 = Meets all criteria with minor issues
- 0.6 = Meets most criteria
- 0.4 = Meets some criteria
- 0.2 = Meets few criteria
- 0.0 = Fails to meet criteria

Do not use values between these levels.

Summary

You’ve learned how to build custom evaluators for domain-specific quality measurement:

LLM-as-Judge evaluators handle subjective quality assessment
Code Evaluators handle deterministic checks and business rules
Multi-stage evaluation catches quality issues at each pipeline step
Composite scoring combines multiple metrics into actionable scores

Key Takeaways

Match evaluator type to what you’re measuring
Start simple and add complexity based on observed failures
Provide clear rubrics for consistent LLM-as-Judge scoring
Combine multiple evaluators for comprehensive coverage

Evaluators Documentation

Complete guide to creating and configuring evaluators

Evaluating RAG Quality

Apply evaluation patterns to RAG pipelines

Evaluating Agent Decisions

Evaluate tool selection and agent behavior

A/B Testing Configurations

Compare configurations with systematic evaluation

Observability

Evaluation

Open in Google Colab

What You’ll Learn

LLM-as-Judge Evaluators

Code Evaluators

Multi-Stage Evaluation

Composite Scoring

Evaluator Types

LLM-as-Judge Patterns

Pattern 1: Content Quality Scoring

Pattern 2: Factual Accuracy Check

Pattern 3: Tone and Style Evaluation

Code Evaluator Patterns

Pattern 4: JSON Structure Validation

Pattern 5: Length and Format Validation

Pattern 6: Keyword and SEO Validation

Pattern 7: Latency Threshold Check

Multi-Stage Evaluation Patterns

Pattern 8: Pipeline Stage Quality

Pattern 9: End-to-End Pipeline Quality

Composite Scoring Patterns

Pattern 10: Weighted Multi-Evaluator Score

Pattern 11: Threshold-Based Pass/Fail

Best Practices

1. Start Simple, Add Complexity

2. Use LLM-as-Judge for Subjective Criteria

3. Calibrate Pass Criteria

4. Include Clear Scoring Rubrics

Summary

Key Takeaways

See Also

Evaluators Documentation

Evaluating RAG Quality

Evaluating Agent Decisions

A/B Testing Configurations

Observability

Evaluation

Open in Google Colab

​What You’ll Learn

LLM-as-Judge Evaluators

Code Evaluators

Multi-Stage Evaluation

Composite Scoring

​Evaluator Types

​LLM-as-Judge Patterns

​Pattern 1: Content Quality Scoring

​Pattern 2: Factual Accuracy Check

​Pattern 3: Tone and Style Evaluation

​Code Evaluator Patterns

​Pattern 4: JSON Structure Validation

​Pattern 5: Length and Format Validation

​Pattern 6: Keyword and SEO Validation

​Pattern 7: Latency Threshold Check

​Multi-Stage Evaluation Patterns

​Pattern 8: Pipeline Stage Quality

​Pattern 9: End-to-End Pipeline Quality

​Composite Scoring Patterns

​Pattern 10: Weighted Multi-Evaluator Score

​Pattern 11: Threshold-Based Pass/Fail

​Best Practices

​1. Start Simple, Add Complexity

​2. Use LLM-as-Judge for Subjective Criteria

​3. Calibrate Pass Criteria

​4. Include Clear Scoring Rubrics

​Summary

​Key Takeaways

​See Also

Evaluators Documentation

Evaluating RAG Quality

Evaluating Agent Decisions

A/B Testing Configurations

What You’ll Learn

Evaluator Types

LLM-as-Judge Patterns

Pattern 1: Content Quality Scoring

Pattern 2: Factual Accuracy Check

Pattern 3: Tone and Style Evaluation

Code Evaluator Patterns

Pattern 4: JSON Structure Validation

Pattern 5: Length and Format Validation

Pattern 6: Keyword and SEO Validation

Pattern 7: Latency Threshold Check

Multi-Stage Evaluation Patterns

Pattern 8: Pipeline Stage Quality

Pattern 9: End-to-End Pipeline Quality

Composite Scoring Patterns

Pattern 10: Weighted Multi-Evaluator Score

Pattern 11: Threshold-Based Pass/Fail

Best Practices

1. Start Simple, Add Complexity

2. Use LLM-as-Judge for Subjective Criteria

3. Calibrate Pass Criteria

4. Include Clear Scoring Rubrics

Summary

Key Takeaways

See Also