Open in Google Colab
Run the complete notebook in your browser
What You’ll Learn
Build Comprehensive Test Datasets
Create test cases with expected answers to benchmark your RAG system
Configure LLM-as-Judge Evaluators
Set up evaluators for retrieval quality, answer correctness, and faithfulness
Execute Systematic Test Runs
Run evaluation suites and collect metrics across your entire dataset
Analyze Results & Iterate
Interpret results, identify failure patterns, and improve your pipeline
Prerequisites:
- Python >=3.10, <3.14
- OpenAI API key
- Netra API key (Get started here)
- A RAG pipeline with Netra tracing configured
Step 0: Install Packages
Step 1: Set Environment Variables
Step 2: Initialize Netra
Step 3: Create or Import Your RAG Pipeline
You can use an existing RAG pipeline or build one for demonstration. For this example, we’ll create a minimal RAG pipeline.Step 4: Create a Test Dataset
Start by building a dataset of question-answer pairs that represent real usage patterns. You can:- Create from Traces - Go to Traces in the Netra dashboard, find good question-answer pairs, and click “Add to Dataset”
- Create from Dashboard - Go to Evaluation → Datasets, click “Create Dataset”, and add test cases
- Create Programmatically - Use this notebook to create test cases
Step 5: Define Evaluators
Create evaluators in the Netra dashboard under Evaluation → Evaluators → Add Evaluator. For RAG pipelines, we recommend:- Context Relevance - Checks if retrieved chunks contain relevant information (score >= 0.7)
- Answer Correctness - Compares generated answer against expected answer (score >= 0.7)
- Faithfulness - Verifies answer is grounded in retrieved context (score >= 0.8)
Step 6: Create Sample PDF and Initialize Chatbot
For this example, we’ll create a sample PDF with machine learning content.Step 7: Run Evaluation on Test Dataset
Execute your RAG pipeline against each test case in your dataset.Step 8: Analyze Results
Review the evaluation results and identify patterns in failures.Step 9: Using the Netra Dashboard for Full Evaluation
For complete evaluation with LLM-as-Judge scoring:- Create a dataset in the Netra dashboard (Evaluation → Datasets)
- Configure evaluators (Evaluation → Evaluators)
- Run test suite: Get your dataset ID and use the API below
Step 10: Interpreting Evaluation Scores
When analyzing your evaluation results, look for patterns:| Low Score In | Likely Cause | How to Fix |
|---|---|---|
| Context Relevance | Wrong chunks retrieved | Increase top_k, reduce chunk size, add overlap |
| Answer Correctness | LLM misinterprets context | Improve system prompt, lower temperature |
| Faithfulness | Model hallucinates | Add explicit grounding instructions, use stronger model |
- The exact chunks that were retrieved
- Similarity scores for each chunk
- The full prompt sent to the LLM
- Token usage and latency
Continuous Evaluation Strategy
For production RAG systems, run evaluations regularly:- On every deployment — Run your test suite in CI/CD before releasing changes
- Weekly benchmarks — Track quality trends over time
- After prompt changes — Measure the impact of system prompt modifications
- After parameter tuning — Validate that chunk size or top_k changes improve quality
Documentation Links
Summary
You’ve learned how to systematically evaluate RAG pipeline quality:- Build comprehensive test datasets with expected answers to benchmark your RAG system
- Configure LLM-as-Judge evaluators for retrieval quality, answer correctness, and faithfulness
- Execute systematic test runs and collect metrics across your entire dataset
- Interpret results, identify failure patterns, and improve your pipeline