Open in Google Colab
Run the complete notebook in your browser
What You’ll Learn
LLM-as-Judge Evaluators
Build evaluators that use LLMs to assess quality, relevance, and accuracy
Code Evaluators
Create deterministic evaluators with custom JavaScript logic
Multi-Stage Evaluation
Measure quality at each stage of a multi-step pipeline
Composite Scoring
Combine multiple metrics into a single quality score
Prerequisites:
- Python >=3.10, <3.14
- OpenAI API key
- Netra API key (Get started here)
- A test dataset with expected outputs
Evaluator Types
Netra supports two evaluator types:| Type | Use Case | Strengths |
|---|---|---|
| LLM-as-Judge | Subjective quality, nuance, semantic meaning | Handles ambiguity, understands context |
| Code Evaluator | Deterministic checks, business rules, structure | Fast, consistent, no LLM cost |
- LLM-as-Judge: “Is this answer helpful?” “Is the tone professional?”
- Code Evaluator: “Does it contain required fields?” “Is latency under threshold?”
LLM-as-Judge Patterns
Pattern 1: Content Quality Scoring
Evaluate content across multiple dimensions with weighted scoring: Create in Dashboard:- Go to Evaluation → Evaluators → Add Evaluator
- Select LLM-as-Judge type
- Configure the prompt:
| Setting | Value |
|---|---|
| Output Type | Numerical |
| Pass Criteria | >= 0.7 |
| LLM Provider | OpenAI (or your preference) |
| Model | gpt-4o-mini |
Pattern 2: Factual Accuracy Check
Compare generated content against a reference for factual accuracy:Pattern 3: Tone and Style Evaluation
Assess whether content matches expected tone:Code Evaluator Patterns
Pattern 4: JSON Structure Validation
Validate that output contains required fields:| Setting | Value |
|---|---|
| Output Type | Numerical |
| Pass Criteria | >= 1.0 |
Pattern 5: Length and Format Validation
Check content meets length requirements:Pattern 6: Keyword and SEO Validation
Check for required keywords and SEO elements:Pattern 7: Latency Threshold Check
Validate response time meets SLA:Multi-Stage Evaluation Patterns
Pattern 8: Pipeline Stage Quality
Evaluate quality at each stage of a multi-step pipeline: Writer Quality Evaluator (LLM-as-Judge):Pattern 9: End-to-End Pipeline Quality
Evaluate the final output holistically:Composite Scoring Patterns
Pattern 10: Weighted Multi-Evaluator Score
Combine multiple evaluator results into a single score:Pattern 11: Threshold-Based Pass/Fail
Convert numerical scores to binary pass/fail with multiple conditions:Best Practices
1. Start Simple, Add Complexity
Begin with basic evaluators and refine based on failure patterns:2. Use LLM-as-Judge for Subjective Criteria
Reserve LLM evaluators for things that require understanding:| Use LLM-as-Judge | Use Code Evaluator |
|---|---|
| ”Is this helpful?" | "Is it valid JSON?" |
| "Is the tone professional?" | "Is it under 1000 chars?" |
| "Does it answer the question?" | "Does it contain required fields?“ |
3. Calibrate Pass Criteria
Start with lenient thresholds and tighten based on results:4. Include Clear Scoring Rubrics
Make LLM-as-Judge evaluators consistent by providing explicit criteria:Summary
You’ve learned how to build custom evaluators for domain-specific quality measurement:- LLM-as-Judge evaluators handle subjective quality assessment
- Code Evaluators handle deterministic checks and business rules
- Multi-stage evaluation catches quality issues at each pipeline step
- Composite scoring combines multiple metrics into actionable scores
Key Takeaways
- Match evaluator type to what you’re measuring
- Start simple and add complexity based on observed failures
- Provide clear rubrics for consistent LLM-as-Judge scoring
- Combine multiple evaluators for comprehensive coverage