Tracing and Evaluating a RAG Pipeline - What is Netra?

This cookbook walks you through adding full observability and systematic evaluation to a Retrieval-Augmented Generation (RAG) pipeline—tracing every stage from document ingestion to answer generation, and measuring quality with custom evaluators.

Open in Google Colab

The complete notebook for tracing a RAG pipeline is available here

What You’ll Learn

This cookbook guides you through 5 key stages of building a production-ready RAG system:

1. Build the RAG Pipeline

Create a complete RAG chatbot that loads PDFs, chunks documents, generates embeddings, and retrieves relevant context for answering questions.

2. Add Comprehensive Tracing

Instrument every stage—chunking, embedding, retrieval, and generation—with Netra spans to capture the full execution flow.

3. Track Costs & Performance

Monitor token usage, API costs, and latency at each step to identify bottlenecks and optimize your pipeline.

4. Build Evaluation Suite

Create LLM-as-Judge evaluators for retrieval quality, answer correctness, and faithfulness detection.

5. Run Systematic Quality Checks

Build test datasets and run evaluations to measure quality over time and catch regressions before they reach production.

Prerequisites

Python 3.9+ or Node.js 18+
OpenAI API key
Netra API key (Steps mentioned here)

High-Level Concepts

RAG Architecture

A RAG chatbot works in two phases: Ingestion (one-time):

Load and chunk the PDF into smaller text segments
Generate embeddings for each chunk
Store embeddings in a vector database

Query (per question):

Convert the user’s question to an embedding
Find the most similar chunks (retrieval)
Pass retrieved chunks + question to an LLM
Return the generated answer

Why Observability Matters for RAG

RAG systems can fail silently in multiple ways:

Problem	Symptom	What Tracing Reveals
Poor chunking	Incomplete answers	Chunk sizes, content boundaries
Wrong retrieval	Irrelevant answers	Similarity scores, retrieved chunks
Hallucination	Fabricated info	Context vs. generated content
High costs	Budget overruns	Token usage per stage

Why Evaluation Matters for RAG

Spot-checking a few queries isn’t enough. Systematic evaluation lets you:

Measure retrieval quality (did we find the right chunks?)
Verify answer correctness (does it match the PDF?)
Detect hallucinations (is it grounded in context?)
Track quality over time as you iterate on prompts and parameters

Creating the Chat Agent

Let’s build the RAG chatbot first, then add tracing and evaluation.

Installation

Start by installing the required packages. We’ll use OpenAI for embeddings and generation, ChromaDB as our vector store, and a PDF parsing library.

pip install netra-sdk openai chromadb pypdf reportlab

Environment Setup

Configure your API keys. You’ll need both an OpenAI key for the LLM operations and a Netra key for observability.

export NETRA_API_KEY="your-netra-api-key"
export NETRA_OTLP_ENDPOINT="your-netra-otlp-endpoint"
export OPENAI_API_KEY="your-openai-api-key"

Loading and Chunking Documents

The first step in any RAG pipeline is extracting text from your documents and splitting it into manageable chunks. We use overlapping chunks to ensure context isn’t lost at chunk boundaries—this helps when relevant information spans multiple segments.

from pypdf import PdfReader
from typing import List

def load_pdf(file_path: str) -> str:
    """Extract text from a PDF file."""
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

# Load and chunk the PDF
pdf_text = load_pdf("document.pdf")
chunks = chunk_text(pdf_text, chunk_size=1000, overlap=200)
print(f"Created {len(chunks)} chunks")

Generating Embeddings and Indexing

Next, we convert each chunk into a vector embedding and store it in ChromaDB. These embeddings capture the semantic meaning of each chunk, allowing us to find relevant content based on meaning rather than just keywords.

import chromadb
from openai import OpenAI

openai_client = OpenAI()
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="pdf_qa")

def generate_embeddings(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for a list of texts."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

# Generate embeddings and store in ChromaDB
embeddings = generate_embeddings(chunks)
collection.add(
    documents=chunks,
    embeddings=embeddings,
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)
print(f"Stored {len(chunks)} chunks in vector database")

Building the Query Pipeline

Now we implement the core RAG logic: given a user question, retrieve the most relevant chunks from our vector store, then pass them as context to the LLM to generate an answer. The top_k parameter controls how many chunks we retrieve—more chunks provide more context but also increase cost and latency.

def retrieve_chunks(query: str, top_k: int = 3) -> List[dict]:
    """Retrieve the most relevant chunks for a query."""
    query_embedding = generate_embeddings([query])[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "distances"]
    )

    retrieved = []
    for i, doc in enumerate(results["documents"][0]):
        retrieved.append({
            "content": doc,
            "similarity_score": 1 - results["distances"][0][i]  # Convert distance to similarity
        })
    return retrieved

def generate_answer(query: str, context_chunks: List[dict]) -> str:
    """Generate an answer using the retrieved context."""
    context = "\n\n".join([chunk["content"] for chunk in context_chunks])

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions based on the provided context.
                Only use information from the context to answer. If the answer is not in the context, say so."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    return response.choices[0].message.content

def ask_question(query: str) -> dict:
    """Main function to answer questions from the PDF."""
    # Retrieve relevant chunks
    chunks = retrieve_chunks(query, top_k=3)

    # Generate answer
    answer = generate_answer(query, chunks)

    return {
        "query": query,
        "answer": answer,
        "retrieved_chunks": chunks
    }

# Test the chatbot
result = ask_question("What is the main topic of this document?")
print(f"Answer: {result['answer']}")

Adding Session Support

For production use, we wrap everything in a class that maintains conversation history and session state. This enables multi-turn conversations where the chatbot remembers previous exchanges, and allows us to track usage per user and session.

from typing import Optional
import uuid

class PDFChatbot:
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.conversation_history = []
        self.session_id = str(uuid.uuid4())
        self._setup_vector_store()

    def _setup_vector_store(self):
        """Initialize the vector store with PDF content."""
        pdf_text = load_pdf(self.pdf_path)
        self.chunks = chunk_text(pdf_text)
        embeddings = generate_embeddings(self.chunks)

        self.collection = chroma_client.create_collection(
            name=f"pdf_{self.session_id}"
        )
        self.collection.add(
            documents=self.chunks,
            embeddings=embeddings,
            ids=[f"chunk_{i}" for i in range(len(self.chunks))]
        )

    def chat(self, query: str, user_id: Optional[str] = None) -> dict:
        """Process a chat message and return the response."""
        # Retrieve relevant chunks
        retrieved = self._retrieve(query)

        # Build conversation context
        context = "\n\n".join([chunk["content"] for chunk in retrieved])

        # Generate response
        messages = [
            {
                "role": "system",
                "content": f"""You are a helpful assistant answering questions about a PDF document.
                Use the following context to answer questions. If the answer is not in the context, say so.

                Context:
                {context}"""
            }
        ]

        # Add conversation history
        for msg in self.conversation_history[-6:]:  # Last 3 exchanges
            messages.append(msg)

        messages.append({"role": "user", "content": query})

        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )

        answer = response.choices[0].message.content

        # Update conversation history
        self.conversation_history.append({"role": "user", "content": query})
        self.conversation_history.append({"role": "assistant", "content": answer})

        return {
            "query": query,
            "answer": answer,
            "retrieved_chunks": retrieved,
            "session_id": self.session_id,
            "user_id": user_id,
            "token_usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
                "total_tokens": response.usage.total_tokens
            }
        }

    def _retrieve(self, query: str, top_k: int = 3) -> List[dict]:
        """Retrieve relevant chunks."""
        query_embedding = generate_embeddings([query])[0]
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include=["documents", "distances"]
        )

        retrieved = []
        for i, doc in enumerate(results["documents"][0]):
            retrieved.append({
                "content": doc,
                "similarity_score": 1 - results["distances"][0][i]
            })
        return retrieved

# Usage
chatbot = PDFChatbot("document.pdf")
response = chatbot.chat("What is the main topic?", user_id="user-123")
print(response["answer"])

Tracing the Agent

Now let’s add Netra observability to see what’s happening inside the RAG pipeline. The good news: with auto-instrumentation, you get full visibility with minimal code changes.

Initializing Netra

Add these two lines at application startup. Auto-instrumentation captures all OpenAI and ChromaDB operations automatically—no decorators or manual spans required.

from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="pdf-qa-chatbot",
    environment="development",
    trace_content=True,
    instruments={
        InstrumentSet.OPENAI,
        InstrumentSet.CHROMA,
    }
)

What gets auto-traced with zero code changes:

OpenAI chat completions with model, tokens, cost, and latency
OpenAI embeddings with token counts
ChromaDB queries and inserts with timing
Full prompts and responses (when trace_content=True)

What Gets Auto-Traced

With the initialization above, your existing code from the Creating the Chat agent section is automatically traced. Here’s what appears in your Netra dashboard:

Document Ingestion

The generate_embeddings() call to OpenAI and collection.add() to ChromaDB are captured automatically.

Ingestion trace showing OpenAI embeddings and ChromaDB operations

Retrieval Operations

Query embedding generation and vector search operations appear as child spans with timing and metadata.

Retrieval trace showing embedding and search spans

LLM Generation

OpenAI chat completions are fully traced with model, tokens, cost, latency, and full prompt/response content.

Generation trace showing OpenAI chat completion details

Adding User and Session Tracking

To analyze usage per user and track conversation flows, add user and session context. This is the one piece that requires explicit code—everything else is auto-traced.

import uuid
from netra import Netra

class PDFChatbot:
    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.session_id = str(uuid.uuid4())
        self.collection = None

    def initialize(self):
        result = ingest_pdf(self.pdf_path, f"pdf_{self.session_id[:8]}")
        self.collection = result["collection"]

    def chat(self, query: str, user_id: str = None) -> str:
        # Add user and session context to all auto-traced spans
        Netra.set_session_id(self.session_id)
        if user_id:
            Netra.set_user_id(user_id)

        chunks = retrieve_chunks(query, self.collection)
        answer = generate_answer(query, chunks)
        return answer

# Usage
chatbot = PDFChatbot("document.pdf")
chatbot.initialize()

response = chatbot.chat("What are the key findings?", user_id="user-123")
print(response)

What You’ll See in the Dashboard

After running the chatbot, you’ll see traces in the Netra dashboard with:

OpenAI spans showing model, tokens, cost, and full prompt/response
ChromaDB spans showing query timing and results
User and session IDs attached to all spans for filtering

Using Decorators

Auto-instrumentation handles most cases of tracing but if you want to bring in more structure, you can use decorators. Use decorators to create parent spans that group related operations. This is useful when you want to see a single trace for an entire pipeline rather than individual OpenAI/ChromaDB calls.

Decorator	Use Case
`@workflow`	Top-level pipeline or request handler
`@task`	Discrete unit of work within a workflow
`@span`	Fine-grained tracing for specific operations

Complete Example with Decorators

import os
import uuid
from typing import List, Dict, Optional
from pypdf import PdfReader
import chromadb
from openai import OpenAI

from netra import Netra
from netra.decorators import workflow, task, span
from netra.instrumentation.instruments import InstrumentSet

# Initialize Netra with auto-instrumentation
Netra.init(
    app_name="pdf-qa-chatbot",
    environment="development",
    trace_content=True,
    instruments={
        InstrumentSet.OPENAI,
        InstrumentSet.CHROMA,
    }
)

# Initialize clients
openai_client = OpenAI()
chroma_client = chromadb.Client()


def generate_embeddings(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for a list of texts."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]


@task(name="load-pdf")
def load_pdf(file_path: str) -> str:
    """Extract text from a PDF file."""
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


@task(name="chunk-text")
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks


class PDFChatbot:
    """A RAG-based chatbot for answering questions about PDF documents."""

    def __init__(self, pdf_path: str):
        self.pdf_path = pdf_path
        self.session_id = str(uuid.uuid4())
        self.collection = None
        self.chunks: List[str] = []
        self.conversation_history: List[Dict] = []

    @task(name="document-ingestion")
    def initialize(self):
        """Initialize the vector store with PDF content."""
        pdf_text = load_pdf(self.pdf_path)
        self.chunks = chunk_text(pdf_text)
        embeddings = generate_embeddings(self.chunks)

        self.collection = chroma_client.create_collection(name=f"pdf_{self.session_id[:8]}")
        self.collection.add(
            documents=self.chunks,
            embeddings=embeddings,
            ids=[f"chunk_{i}" for i in range(len(self.chunks))]
        )

    @workflow(name="pdf-qa-query")
    def chat(self, query: str, user_id: Optional[str] = None) -> Dict:
        """Process a chat message and return the response."""
        Netra.set_session_id(self.session_id)
        if user_id:
            Netra.set_user_id(user_id)

        retrieved = self._retrieve(query)
        answer, response = self._generate_answer(query, retrieved)

        # Update conversation history
        self.conversation_history.append({"role": "user", "content": query})
        self.conversation_history.append({"role": "assistant", "content": answer})

        return {"query": query, "answer": answer, "retrieved_chunks": retrieved}

    @task(name="retrieval")
    def _retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Retrieve relevant chunks."""
        query_embedding = self._get_query_embedding(query)
        retrieved = self._vector_search(query_embedding, top_k)
        return retrieved

    @span(name="query-embedding")
    def _get_query_embedding(self, query: str) -> List[float]:
        """Generate embedding for the query."""
        return generate_embeddings([query])[0]

    @span(name="vector-search")
    def _vector_search(self, query_embedding: List[float], top_k: int) -> List[Dict]:
        """Search vector database for relevant chunks."""
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include=["documents", "distances"]
        )
        return [{"content": doc, "similarity_score": 1 - results["distances"][0][i]}
                for i, doc in enumerate(results["documents"][0])]

    @span(name="answer-generation")
    def _generate_answer(self, query: str, retrieved: List[Dict]):
        """Generate answer using retrieved context."""
        context = "\n\n".join([chunk["content"] for chunk in retrieved])
        messages = [
            {"role": "system", "content": f"Use this context to answer: {context}"},
            {"role": "user", "content": query}
        ]
        response = openai_client.chat.completions.create(model="gpt-4o-mini", messages=messages)
        return response.choices[0].message.content, response


# Usage
chatbot = PDFChatbot("document.pdf")
chatbot.initialize()

response = chatbot.chat("What is the main topic?", user_id="user-123")
print(response["answer"])

Netra.shutdown()

Evaluating the Agent

Now let’s set up systematic evaluation to measure and improve RAG quality. While tracing tells you what happened, evaluation tells you how well it worked.

Creating a Test Dataset

Start by building a dataset of question-answer pairs from your PDF. Include a mix of straightforward questions, edge cases, and negative tests (questions that shouldn’t be answerable from the document). You can create this through the Netra dashboard either by adding to the dataset from the traces page or by creating a dataset from the datasets tab.

Defining Evaluators

Create evaluators in the Netra dashboard under Evaluation → Evaluators → Add Evaluator. For RAG pipelines, we recommend three evaluators that cover different failure modes. You can use the built-in templates to get started quickly:

Context Relevance

Use the Context Relevance template to check whether the retrieved chunks contain information relevant to answering the question. Low scores here indicate a retrieval problem—you might need to adjust chunk size, overlap, or the number of retrieved chunks.

Setting	Value
Template	Context Relevance
Output Type	Numerical
Pass Criteria	score >= 0.7

The template evaluates whether the retrieved context is pertinent to the user’s query. Configure the LLM provider and model in the evaluator settings, then customize the threshold based on your quality requirements.

Answer Correctness

Use the Answer Correctness template to compare the generated answer against the expected answer. It catches cases where the retrieval was good but the LLM misinterpreted or missed information.

Setting	Value
Template	Answer Correctness
Output Type	Numerical
Pass Criteria	score >= 0.7

This template evaluates whether the actual answer conveys the same information as the expected answer, checking for factual errors and completeness.

Faithfulness

Use the Faithfulness template to check whether the answer is grounded in the retrieved context. High correctness but low faithfulness indicates the model is “getting lucky” by using its training data rather than the provided context—a reliability risk.

Setting	Value
Template	Faithfulness
Output Type	Numerical
Pass Criteria	score >= 0.8

This template verifies that every claim in the answer is supported by the provided context and that the model avoids hallucinating information. The video below shows detailed steps on how to configure evaluators.

Running Evaluation Experiments

With your dataset and evaluators configured, use Netra’s built-in evaluation API to run test suites. The run_test_suite method fetches test cases from your dataset and executes your task function against each one.

def run_evaluation(dataset_id: str = "local-eval"):
    """Run evaluation on all test cases."""
    dataset = Netra.evaluation.get_dataset(dataset_id)
    Netra.evaluation.run_test_suite(
        name="Evaluation Test Demo",
        data=dataset,
        # Define a function based on your evaluation needs!
        # The supplied function is called with the evaluator input as defined in the dataset
        task=lambda eval_input: print(eval_input)
    )

# Run evaluation
print("\n" + "=" * 60)
print("Running Evaluation")
print("=" * 60 + "\n")

# Get your dataset ID from the Netra dashboard
eval_results = run_evaluation("your-dataset-id-from-netra")

Netra.shutdown()

Analyzing Results and Iterating

After running evaluations, review results in Evaluation → Test Runs. Look for patterns in failures:

Low Score In	Likely Cause	How to Fix
Retrieval Quality	Wrong chunks retrieved	Increase `top_k`, reduce chunk size, add overlap
Answer Correctness	LLM misinterprets context	Improve system prompt, lower temperature
Faithfulness	Model hallucinates	Add explicit grounding instructions, use stronger model

Click View Trace on any failed test case to see the full execution flow, including the exact chunks that were retrieved and the full prompt sent to the LLM. In the Evaluation Dashboard, you’ll see:

Pass/Fail rates for each evaluator
Score distributions across test cases
Trace links for debugging failures
Cost and latency metrics per test run

Summary

You’ve built a fully observable RAG pipeline with systematic evaluation. Your chatbot now has:

End-to-end tracing across document ingestion, retrieval, and generation
Cost and performance tracking at each pipeline stage
Quality evaluation using Context Relevance, Answer Correctness, and Faithfulness metrics
Debugging capabilities to trace failures back to specific chunks and prompts

With this foundation, you can iterate on your RAG system with confidence—adjusting chunk sizes, retrieval parameters, and prompts while measuring the impact on quality and cost.

Cookbooks

Open in Google Colab

​What You’ll Learn

1. Build the RAG Pipeline

2. Add Comprehensive Tracing

3. Track Costs & Performance

4. Build Evaluation Suite

5. Run Systematic Quality Checks

​Prerequisites

​High-Level Concepts

​RAG Architecture

​Why Observability Matters for RAG

​Why Evaluation Matters for RAG

​Creating the Chat Agent

​Installation

​Environment Setup

​Loading and Chunking Documents

​Generating Embeddings and Indexing

​Building the Query Pipeline

​Adding Session Support

​Tracing the Agent

​Initializing Netra

​What Gets Auto-Traced

​Document Ingestion

​Retrieval Operations

​LLM Generation

​Adding User and Session Tracking

​What You’ll See in the Dashboard

​Using Decorators

​Evaluating the Agent

​Creating a Test Dataset

​Defining Evaluators

​Context Relevance

​Answer Correctness

​Faithfulness

​Running Evaluation Experiments

​Analyzing Results and Iterating

​Summary

What You’ll Learn

Prerequisites

High-Level Concepts

RAG Architecture

Why Observability Matters for RAG

Why Evaluation Matters for RAG

Creating the Chat Agent

Installation

Environment Setup

Loading and Chunking Documents

Generating Embeddings and Indexing

Building the Query Pipeline

Adding Session Support

Tracing the Agent

Initializing Netra

What Gets Auto-Traced

Document Ingestion

Retrieval Operations

LLM Generation

Adding User and Session Tracking

What You’ll See in the Dashboard

Using Decorators

Evaluating the Agent

Creating a Test Dataset

Defining Evaluators

Context Relevance

Answer Correctness

Faithfulness

Running Evaluation Experiments

Analyzing Results and Iterating

Summary