Skip to main content
A working RAG pipeline is only the starting point. Without systematic evaluation, you have no way to know whether your retriever is fetching the right chunks, whether the LLM is faithfully using the retrieved context, or whether the generated answer actually addresses the user’s question. These failures are subtle — they don’t throw errors, they just produce worse answers. This cookbook walks you through Netra’s evaluation workflow: creating test datasets from your traces, configuring evaluators that score RAG-specific quality dimensions, running test suites, and interpreting results to improve your pipeline.
Prerequisite: You need a RAG pipeline with Netra tracing configured and at least one trace visible in your dashboard. If you haven’t set this up yet, follow the Tracing a RAG Pipeline cookbook first.

What You’ll Learn

Create Datasets from Traces

Turn real RAG interactions into reusable test cases directly from the dashboard

Configure Evaluators

Set up LLM-as-Judge evaluators for retrieval quality, answer correctness, and faithfulness

Run Test Suites

Execute evaluations via the dashboard or SDK and collect quality metrics

Analyze Results & Iterate

Interpret scores, debug failures using trace integration, and improve your pipeline

Why RAG Pipelines Need Evaluation

RAG systems have multiple failure modes that are invisible without structured evaluation:
Failure ModeWhat Goes WrongWhy You Can’t Spot-Check It
Irrelevant retrievalThe retriever fetches chunks that don’t contain the answerSimilarity scores look reasonable, but the content is off-topic
HallucinationThe LLM generates information not present in the retrieved contextThe answer sounds fluent and confident, but fabricates details
Missed intentThe answer is factually correct but doesn’t address the user’s actual questionOnly noticeable when you compare against a known-good response
InconsistencyThe same question gets different quality answers depending on retrieved chunksRequires running the same inputs multiple times to detect
Netra’s evaluation framework addresses this with Datasets (test cases with inputs and expected outputs), Evaluators (LLM-as-Judge or code-based scoring for relevance, correctness, and faithfulness), and Test Runs (execution results with pass/fail rates, scores, and linked traces). The workflow is: create test cases, attach evaluators, run, and review. See the Evaluation Overview for a deeper look at the framework.
Now, let’s walk through the process of evaluating a RAG pipeline:

Step 1: Create Evaluators

Go to Evaluation → Evaluators and add the following three evaluators from the library:
EvaluatorWhat It Measures
Answer CorrectnessIs the generated answer factually correct compared to the expected output?
Context RelevanceAre the retrieved chunks relevant to the question being asked?
FaithfulnessIs the answer grounded in the retrieved context, without hallucination?
You can tune the prompt for each evaluator and test it in the Playground to see how it scores before using it in a dataset. See Evaluators for the full reference.

Step 2: Create a Dataset from Traces

1

Select a trace

Go to Observability → Traces and select a trace from the Tracing a RAG Pipeline cookbook, or any other RAG pipeline trace you have.
2

Add to Dataset

Click on the trace, then click Add to Dataset. Create a new dataset called “RAG Quality Dataset”.
3

Attach evaluators

Add the three evaluators you created in Step 1 to the dataset.
4

Configure variable mapping

For each evaluator, map the prompt variables to their source — Dataset item (input, expected output), Agent response (actual RAG output), or Execution data (trace metadata).

Step 3: Add More Test Cases

Add one more trace with a different question to the same dataset by repeating the Add to Dataset flow. Under Evaluation → Datasets, you should now see the “RAG Quality Dataset” with two dataset items and three evaluators under the Evaluators tab, as shown in the video above. You can add more evaluators or dataset items manually from this page.

Step 4: Trigger a Test Run

Currently in Netra, test runs are triggered via code. Copy the Dataset ID from the dataset page and use the code below. Dataset ID
from netra import Netra

Netra.init(app_name="rag-evaluation")

# Your RAG logic — wrap your pipeline in a function that takes
# an input string and returns the generated answer.
# Tip: if you followed the tracing cookbook, you can call chatbot.chat() here.
def rag_pipeline(input_data):
    response = chatbot.chat(input_data)
    return response["answer"]

dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")

result = Netra.evaluation.run_test_suite(
    name="RAG Quality Evaluation",
    data=dataset,
    task=rag_pipeline,
)
For more details on the evaluation API, refer to the SDK documentation.

Step 5: View Results

Go to Evaluation → Test Runs to see your test run with its status. Click on the test run to see the result for each evaluator, for each dataset item — whether it passed or failed. You can also click View Trace on any result to debug what went wrong or verify what was correct. See Test Runs for the full reference.

Interpreting Scores and Improving Quality

When evaluator scores are low, use this table to identify the likely cause and fix:
Low Score InLikely CauseHow to Fix
Answer RelevanceWrong chunks retrievedIncrease top_k, reduce chunk size, add overlap between chunks
Factual AccuracyLLM misinterprets contextImprove the system prompt, lower temperature, use a stronger model
CoherenceDisjointed or repetitive responseRefine the system prompt to request structured answers
FaithfulnessModel hallucinating beyond contextAdd explicit grounding instructions (e.g., “Only answer using the provided context”)
After making changes to your pipeline, re-run the evaluation against the same dataset and compare results across test runs. Netra tracks all runs so you can see whether your changes improved quality.

Continuous Evaluation Strategy

For production RAG systems, run evaluations regularly:
  1. On every deployment — Run your test suite in CI/CD before releasing changes to retrieval logic or prompts
  2. Weekly benchmarks — Track quality trends over time to catch gradual degradation
  3. After prompt changes — Measure the impact of system prompt modifications on all quality dimensions
  4. After parameter tuning — Validate that changes to chunk size, top_k, or overlap actually improve quality

See Also

Trace Your RAG Pipeline

Set up comprehensive tracing for your RAG pipeline before evaluating

Evaluation Overview

Deep dive into Netra’s evaluation framework: datasets, evaluators, and test runs

Evaluating Agent Decisions

Evaluate tool selection, escalation, and workflow completion in agents

A/B Testing Configurations

Compare different pipeline configurations systematically
Last modified on March 17, 2026