Skip to main content
This cookbook shows you how to build custom evaluators for domain-specific quality measurement—from content quality scoring to multi-stage pipeline validation and business logic checks.

Open in Google Colab

Run the complete notebook in your browser

What You’ll Learn

LLM-as-Judge Evaluators

Build evaluators that use LLMs to assess quality, relevance, and accuracy

Code Evaluators

Create deterministic evaluators with custom JavaScript logic

Multi-Stage Evaluation

Measure quality at each stage of a multi-step pipeline

Composite Scoring

Combine multiple metrics into a single quality score
Prerequisites:
  • Python >=3.10, <3.14
  • OpenAI API key
  • Netra API key (Get started here)
  • A test dataset with expected outputs

Evaluator Types

Netra supports two evaluator types:
TypeUse CaseStrengths
LLM-as-JudgeSubjective quality, nuance, semantic meaningHandles ambiguity, understands context
Code EvaluatorDeterministic checks, business rules, structureFast, consistent, no LLM cost
Choose based on what you’re measuring:
  • LLM-as-Judge: “Is this answer helpful?” “Is the tone professional?”
  • Code Evaluator: “Does it contain required fields?” “Is latency under threshold?”

LLM-as-Judge Patterns

Pattern 1: Content Quality Scoring

Evaluate content across multiple dimensions with weighted scoring: Create in Dashboard:
  1. Go to Evaluation → Evaluators → Add Evaluator
  2. Select LLM-as-Judge type
  3. Configure the prompt:
Evaluate the quality of this content on a scale of 0 to 1.

Score based on these weighted criteria:
- Coherence (0.25): Does the content flow logically?
- Accuracy (0.25): Is the information factually correct?
- Completeness (0.25): Does it cover the topic adequately?
- Engagement (0.25): Is it interesting and well-written?

Content to evaluate:
{output}

Original request:
{input}

Respond with only a decimal number between 0 and 1.
SettingValue
Output TypeNumerical
Pass Criteria>= 0.7
LLM ProviderOpenAI (or your preference)
Modelgpt-4o-mini

Pattern 2: Factual Accuracy Check

Compare generated content against a reference for factual accuracy:
Compare the generated answer against the reference answer for factual accuracy.

Generated Answer:
{output}

Reference Answer:
{expected_output}

Evaluate on these criteria:
1. Are all key facts from the reference present in the generated answer?
2. Does the generated answer contain any factual errors?
3. Does the generated answer add any unverified claims?

Score from 0 to 1 where:
- 1.0 = Fully accurate, all key facts present
- 0.7-0.9 = Mostly accurate, minor omissions
- 0.4-0.6 = Partially accurate, significant gaps
- 0.0-0.3 = Inaccurate or missing key facts

Respond with only a decimal number.

Pattern 3: Tone and Style Evaluation

Assess whether content matches expected tone:
Evaluate whether this content matches the expected tone and style.

Content:
{output}

Expected tone: {expected_output.tone}
Target audience: {expected_output.audience}

Evaluate:
1. Is the vocabulary appropriate for the audience?
2. Is the tone consistent throughout?
3. Does it avoid inappropriate language or jargon?

Score from 0 to 1 where:
- 1.0 = Perfect tone match
- 0.7-0.9 = Good match with minor adjustments needed
- 0.4-0.6 = Tone partially correct
- 0.0-0.3 = Tone mismatch

Respond with only a decimal number.

Code Evaluator Patterns

Pattern 4: JSON Structure Validation

Validate that output contains required fields:
function handler(input, output, expectedOutput) {
    // Parse JSON output
    let result;
    try {
        result = JSON.parse(output);
    } catch (e) {
        return 0; // Invalid JSON
    }

    // Define required fields
    const requiredFields = expectedOutput?.required_fields ||
                          ["summary", "action_items"];

    // Check for each required field
    let fieldsPresent = 0;
    for (const field of requiredFields) {
        if (field in result && result[field]) {
            fieldsPresent++;
        }
    }

    return fieldsPresent / requiredFields.length;
}
SettingValue
Output TypeNumerical
Pass Criteria>= 1.0

Pattern 5: Length and Format Validation

Check content meets length requirements:
function handler(input, output, expectedOutput) {
    const minLength = expectedOutput?.min_length || 100;
    const maxLength = expectedOutput?.max_length || 2000;
    const requiredFormat = expectedOutput?.format || "markdown";

    let score = 0;

    // Check length (50% of score)
    const length = output.length;
    if (length >= minLength && length <= maxLength) {
        score += 0.5;
    } else if (length >= minLength * 0.8 && length <= maxLength * 1.2) {
        score += 0.25; // Partial credit for close
    }

    // Check format (50% of score)
    if (requiredFormat === "markdown") {
        const hasHeadings = output.includes("# ") || output.includes("## ");
        const hasParagraphs = output.split("\n\n").length > 1;

        if (hasHeadings) score += 0.25;
        if (hasParagraphs) score += 0.25;
    } else if (requiredFormat === "json") {
        try {
            JSON.parse(output);
            score += 0.5;
        } catch (e) {
            // Invalid JSON
        }
    }

    return score;
}

Pattern 6: Keyword and SEO Validation

Check for required keywords and SEO elements:
function handler(input, output, expectedOutput) {
    const keywords = expectedOutput?.keywords || [];
    const outputLower = output.toLowerCase();

    let score = 0;

    // Check meta description (30%)
    const hasMetaDesc = outputLower.includes("meta description") ||
                        outputLower.includes("meta:") ||
                        output.includes("**Meta Description**");
    if (hasMetaDesc) {
        score += 0.3;
    }

    // Check keyword usage (40%)
    if (keywords.length > 0) {
        const keywordsFound = keywords.filter(kw =>
            outputLower.includes(kw.toLowerCase())
        );
        score += (keywordsFound.length / keywords.length) * 0.4;
    } else {
        score += 0.2; // Partial credit if no keywords specified
    }

    // Check heading structure (30%)
    const hasH1 = output.includes("# ") && !output.startsWith("## ");
    const hasH2 = output.includes("## ");

    if (hasH1) score += 0.15;
    if (hasH2) score += 0.15;

    return Math.min(score, 1);
}

Pattern 7: Latency Threshold Check

Validate response time meets SLA:
function handler(input, output, expectedOutput) {
    const latencyMs = expectedOutput?.latency_ms;
    const threshold = expectedOutput?.threshold_ms || 2000;

    if (!latencyMs) {
        return 1; // No latency data, pass by default
    }

    if (latencyMs <= threshold) {
        return 1;
    } else if (latencyMs <= threshold * 1.5) {
        return 0.5; // Partial credit for close
    } else {
        return 0;
    }
}

Multi-Stage Evaluation Patterns

Pattern 8: Pipeline Stage Quality

Evaluate quality at each stage of a multi-step pipeline: Writer Quality Evaluator (LLM-as-Judge):
Evaluate the quality of this draft article.

Article:
{output}

Topic:
{input}

Score from 0 to 1 based on:
- Coherence: Does the article flow logically? (0.3 weight)
- Coverage: Does it cover the topic comprehensively? (0.3 weight)
- Engagement: Is it interesting to read? (0.2 weight)
- Structure: Does it have clear intro, body, conclusion? (0.2 weight)

Respond with only a decimal number.
Editor Effectiveness Evaluator (Code):
function handler(input, output, expectedOutput) {
    // input.draft = the original draft
    // output = the edited version

    const draft = input?.draft || "";
    const edited = output || "";

    // No changes = no improvement
    if (draft === edited) {
        return 0;
    }

    // Calculate changes
    const lengthRatio = edited.length / Math.max(draft.length, 1);

    // Edited should be similar length (0.8 to 1.2x)
    if (lengthRatio < 0.7 || lengthRatio > 1.3) {
        return 0.5; // Major length change might indicate issues
    }

    let score = 0.5; // Base score for making changes

    // Reward similar length (tight editing)
    if (lengthRatio >= 0.9 && lengthRatio <= 1.1) {
        score += 0.3;
    }

    // Reward if output is valid markdown
    if (edited.includes("# ") || edited.includes("## ")) {
        score += 0.2;
    }

    return Math.min(score, 1);
}

Pattern 9: End-to-End Pipeline Quality

Evaluate the final output holistically:
Evaluate this content for publication readiness.

Content:
{output}

Original Request:
{input}

Score from 0 to 1 based on:
- Informativeness (0.25): Does it provide valuable information?
- Readability (0.25): Is it easy to read and understand?
- SEO Optimization (0.25): Does it have proper structure and keywords?
- Professionalism (0.25): Is it publication-ready?

Respond with only a decimal number.

Composite Scoring Patterns

Pattern 10: Weighted Multi-Evaluator Score

Combine multiple evaluator results into a single score:
function handler(input, output, expectedOutput) {
    // Individual scores from other evaluators (passed via expectedOutput)
    const scores = expectedOutput?.evaluator_scores || {};

    const weights = {
        content_quality: 0.4,
        factual_accuracy: 0.3,
        seo_score: 0.2,
        format_validation: 0.1,
    };

    let totalWeight = 0;
    let weightedSum = 0;

    for (const [evaluator, weight] of Object.entries(weights)) {
        if (evaluator in scores) {
            weightedSum += scores[evaluator] * weight;
            totalWeight += weight;
        }
    }

    if (totalWeight === 0) {
        return 0;
    }

    return weightedSum / totalWeight;
}

Pattern 11: Threshold-Based Pass/Fail

Convert numerical scores to binary pass/fail with multiple conditions:
function handler(input, output, expectedOutput) {
    const scores = expectedOutput?.scores || {};

    // Define minimum thresholds
    const thresholds = {
        accuracy: 0.8,    // Must be >= 80%
        completeness: 0.7, // Must be >= 70%
        latency: 0.5,     // Must be >= 50% (under threshold)
    };

    // All conditions must pass
    for (const [metric, threshold] of Object.entries(thresholds)) {
        if (metric in scores && scores[metric] < threshold) {
            return 0; // Fail if any metric below threshold
        }
    }

    return 1; // Pass if all conditions met
}

Best Practices

1. Start Simple, Add Complexity

Begin with basic evaluators and refine based on failure patterns:
// Start: Simple length check
function handler(input, output, expectedOutput) {
    return output.length > 100 ? 1 : 0;
}

// Evolve: Add more criteria as you learn
function handler(input, output, expectedOutput) {
    let score = 0;
    if (output.length > 100) score += 0.3;
    if (output.includes("## ")) score += 0.3;
    // ... add more criteria based on observed failures
    return score;
}

2. Use LLM-as-Judge for Subjective Criteria

Reserve LLM evaluators for things that require understanding:
Use LLM-as-JudgeUse Code Evaluator
”Is this helpful?""Is it valid JSON?"
"Is the tone professional?""Is it under 1000 chars?"
"Does it answer the question?""Does it contain required fields?“

3. Calibrate Pass Criteria

Start with lenient thresholds and tighten based on results:
Phase 1: >= 0.5 (learn what fails)
Phase 2: >= 0.6 (after prompt improvements)
Phase 3: >= 0.7 (production threshold)

4. Include Clear Scoring Rubrics

Make LLM-as-Judge evaluators consistent by providing explicit criteria:
Score exactly as follows:
- 1.0 = Meets all criteria perfectly
- 0.8 = Meets all criteria with minor issues
- 0.6 = Meets most criteria
- 0.4 = Meets some criteria
- 0.2 = Meets few criteria
- 0.0 = Fails to meet criteria

Do not use values between these levels.

Summary

You’ve learned how to build custom evaluators for domain-specific quality measurement:
  • LLM-as-Judge evaluators handle subjective quality assessment
  • Code Evaluators handle deterministic checks and business rules
  • Multi-stage evaluation catches quality issues at each pipeline step
  • Composite scoring combines multiple metrics into actionable scores

Key Takeaways

  1. Match evaluator type to what you’re measuring
  2. Start simple and add complexity based on observed failures
  3. Provide clear rubrics for consistent LLM-as-Judge scoring
  4. Combine multiple evaluators for comprehensive coverage

See Also

Last modified on February 11, 2026