> ## Documentation Index > Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt > Use this file to discover all available pages before exploring further. # Evaluating RAG Quality > Evaluate RAG pipeline quality with Netra. Measure retrieval relevance, answer correctness, and faithfulness using LLM-as-Judge and code evaluators. A working RAG pipeline is only the starting point. Without systematic evaluation, you have no way to know whether your retriever is fetching the right chunks, whether the LLM is faithfully using the retrieved context, or whether the generated answer actually addresses the user's question. These failures are subtle — they don't throw errors, they just produce worse answers. This cookbook walks you through Netra's evaluation workflow: creating test datasets from your traces, configuring evaluators that score RAG-specific quality dimensions, running test suites, and interpreting results to improve your pipeline. **Prerequisite:** You need a RAG pipeline with Netra tracing configured and at least one trace visible in your dashboard. If you haven't set this up yet, follow the [Tracing a RAG Pipeline](/Cookbooks/observability/tracing-rag-pipeline) cookbook first. ## What You'll Learn Turn real RAG interactions into reusable test cases directly from the dashboard Set up LLM-as-Judge evaluators for retrieval quality, answer correctness, and faithfulness Execute evaluations via the dashboard or SDK and collect quality metrics Interpret scores, debug failures using trace integration, and improve your pipeline *** ## Why RAG Pipelines Need Evaluation RAG systems have multiple failure modes that are invisible without structured evaluation: | Failure Mode | What Goes Wrong | Why You Can't Spot-Check It | | ------------------------ | ------------------------------------------------------------------------------ | --------------------------------------------------------------- | | **Irrelevant retrieval** | The retriever fetches chunks that don't contain the answer | Similarity scores look reasonable, but the content is off-topic | | **Hallucination** | The LLM generates information not present in the retrieved context | The answer sounds fluent and confident, but fabricates details | | **Missed intent** | The answer is factually correct but doesn't address the user's actual question | Only noticeable when you compare against a known-good response | | **Inconsistency** | The same question gets different quality answers depending on retrieved chunks | Requires running the same inputs multiple times to detect | Netra's evaluation framework addresses this with [Datasets](/Evaluation/Datasets) (test cases with inputs and expected outputs), [Evaluators](/Evaluation/Evaluators) (LLM-as-Judge or code-based scoring for relevance, correctness, and faithfulness), and [Test Runs](/Evaluation/TestRuns) (execution results with pass/fail rates, scores, and linked traces). The workflow is: create test cases, attach evaluators, run, and review. See the [Evaluation Overview](/Evaluation/Evaluation-overview) for a deeper look at the framework. *** Now, let's walk through the process of evaluating a RAG pipeline: ## Step 1: Create Evaluators Go to **Evaluation → Evaluators** and add the following three evaluators from the [library](/Evaluation/Evaluators#library): | Evaluator | What It Measures | | ---------------------- | -------------------------------------------------------------------------- | | **Answer Correctness** | Is the generated answer factually correct compared to the expected output? | | **Context Relevance** | Are the retrieved chunks relevant to the question being asked? | | **Faithfulness** | Is the answer grounded in the retrieved context, without hallucination? |