> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# A/B Testing Model Configurations

> A/B test AI model configurations with Netra's evaluation framework. Compare prompts, models, and parameters by running the same dataset against each setup.

In the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook, you set up tier-based configurations for a meeting summarization pipeline — Enterprise on GPT-4, Professional on GPT-4-turbo, and Starter on GPT-3.5-turbo. But how do you know whether the Enterprise tier's output is actually better enough to justify the cost? Without structured evaluation, you're guessing.

This cookbook walks you through the next step: using Netra's evaluation framework to A/B test those configurations. You'll run the same test cases against two tiers, score both with the same evaluators, and compare results side by side to make a data-driven decision.

<Info>
  **Prerequisite:** You need a Netra API key ([Get started here](/quick-start/Overview)) and the meeting summarization pipeline from the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook. The code below reuses the `MultiTenantMeetingSummarizer` class and tenant configurations from that cookbook.
</Info>

## What You'll Learn

<CardGroup cols={2}>
  <Card icon="database" title="Build a Shared Test Dataset">
    Create test cases that both configurations will be evaluated against
  </Card>

  <Card icon="scale-balanced" title="Configure Quality Evaluators">
    Set up evaluators for answer correctness and conciseness
  </Card>

  <Card icon="flask-vial" title="Run Parallel Test Suites">
    Trigger separate evaluation runs for each configuration via the SDK
  </Card>

  <Card icon="chart-line" title="Compare Results & Decide">
    Interpret scores across runs to make data-driven configuration decisions
  </Card>
</CardGroup>

***

## Why A/B Test AI Configurations?

Different configurations serve different trade-offs. Systematic A/B testing answers these questions with data:

| Scenario                | What to Compare                          | What You'll Learn                                        |
| ----------------------- | ---------------------------------------- | -------------------------------------------------------- |
| **Model upgrade**       | GPT-3.5-turbo vs GPT-4-turbo             | Does the quality improvement justify the cost increase?  |
| **Prompt optimization** | Original prompt vs revised prompt        | Does the new prompt improve quality with the same model? |
| **Parameter tuning**    | temperature=0.1 vs temperature=0.3       | Which setting produces more consistent results?          |
| **Tier validation**     | Enterprise config vs Professional config | Does the quality gap justify the price gap?              |

Netra's evaluation framework makes this straightforward: create one dataset, run it against each configuration as a separate [Test Run](/Evaluation/TestRuns), and compare evaluator scores in the dashboard. See the [Evaluation Overview](/Evaluation/Evaluation-overview) for a deeper look at the framework.

***

Now, let's walk through the process of A/B testing two configurations:

## Step 1: Create Evaluators

You need two evaluators from the library.

### Answer Correctness (Library)

Go to **Evaluation → Evaluators**, switch to the **Library** tab, and add **Answer Correctness** from the Quality category.

### Conciseness (Library)

Add **Conciseness** from the Quality category.

| Evaluator              | What It Measures                                                           |
| ---------------------- | -------------------------------------------------------------------------- |
| **Answer Correctness** | Is the generated output factually correct compared to the expected output? |
| **Conciseness**        | Is the output appropriately brief without losing key information?          |

You can test each evaluator in the **Playground** before using it in a dataset. See [Evaluators](/Evaluation/Evaluators) for the full reference.

<video autoPlay={true} muted={true} loop={true} playsInline={true} className="w-full aspect-video rounded-xl" src="https://mintcdn.com/netra/flCEWIb7m_86sZYE/videos/cookbook-ab-testing-evaluators.mp4?fit=max&auto=format&n=flCEWIb7m_86sZYE&q=85&s=ffe69e3a973a8f7b9bb7c99402eae02a" data-path="videos/cookbook-ab-testing-evaluators.mp4" />

***

## Step 2: Create a Dataset

Go to **Evaluation → Datasets** and click **Create Dataset**. Name it "A/B Test Dataset" and attach the two evaluators from Step 1.

<video autoPlay={true} muted={true} loop={true} playsInline={true} className="w-full aspect-video rounded-xl" src="https://mintcdn.com/netra/flCEWIb7m_86sZYE/videos/cookbook-ab-testing-create-dataset.mp4?fit=max&auto=format&n=flCEWIb7m_86sZYE&q=85&s=1c51cccb80c0fb78b2d674d727b2d7bd" data-path="videos/cookbook-ab-testing-create-dataset.mp4" />

You already have traces from running the meeting summarization pipeline in the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook. Add them to your dataset directly:

<Steps>
  <Step title="Select a trace">
    Go to **Observability → Traces** and select a trace from the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook. Choose traces with different meeting types (short standups, planning sessions, open-ended discussions) to get a diverse set of test cases.
  </Step>

  <Step title="Add to Dataset">
    Click on the trace, then click **Add to Dataset**. Select the "A/B Test Dataset" you just created. Fill in the **Expected Output** with the correct summary for that meeting transcript.
  </Step>

  <Step title="Repeat for more traces">
    Add 3–5 traces covering different meeting types — short standups, complex planning sessions, and ambiguous discussions. More diverse test cases give you a clearer comparison between configurations.
  </Step>
</Steps>

You can also add test cases manually if you want to include specific edge cases that aren't in your traces.

For each evaluator, configure the variable mappings so the evaluator receives the correct inputs at runtime — map evaluator variables like `query` and `expected_output` to **Dataset item** fields, and `agent_response` to **Agent response**. See [Datasets](/Evaluation/Datasets) for the full mapping reference.

***

## Step 3: Trigger Test Runs

The key to A/B testing is running the **same dataset** against **different configurations** as separate test runs. Copy the **Dataset ID** from the dataset page and trigger one run per configuration.

<CodeGroup>
  ```python Python theme={null}
  from netra import Netra
  from netra.instrumentation.instruments import InstrumentSet

  Netra.init(
      app_name="ab-testing",
      instruments={InstrumentSet.OPENAI},
  )

  # Reuse the summarizer from the Multi-Tenant Cost Tracking cookbook
  summarizer = MultiTenantMeetingSummarizer()
  dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")

  # --- Configuration A: Enterprise tier (GPT-4) ---
  def enterprise_pipeline(input_data):
      result = summarizer.summarize_meeting(
          tenant_id="apex-legal",
          meeting_transcript=input_data,
      )
      return result["summary"]

  Netra.evaluation.run_test_suite(
      name="Meeting Summary — Enterprise (GPT-4)",
      data=dataset,
      task=enterprise_pipeline,
  )

  # --- Configuration B: Professional tier (GPT-4-turbo) ---
  def professional_pipeline(input_data):
      result = summarizer.summarize_meeting(
          tenant_id="stratex-consulting",
          meeting_transcript=input_data,
      )
      return result["summary"]

  Netra.evaluation.run_test_suite(
      name="Meeting Summary — Professional (GPT-4-turbo)",
      data=dataset,
      task=professional_pipeline,
  )
  ```

  ```typescript TypeScript theme={null}
  import { Netra, NetraInstruments } from "netra-sdk";

  await Netra.init({
    appName: "ab-testing",
    instruments: new Set([NetraInstruments.OPENAI]),
  });

  // Reuse the summarizer from the Multi-Tenant Cost Tracking cookbook
  const summarizer = new MultiTenantMeetingSummarizer();
  const dataset = await Netra.evaluation.getDataset("your-dataset-id");

  // --- Configuration A: Enterprise tier (GPT-4) ---
  async function enterprisePipeline(inputData: string): Promise<string> {
    const result = await summarizer.summarizeMeeting(
      "apex-legal",
      inputData,
    );
    return result.summary;
  }

  await Netra.evaluation.runTestSuite(
    "Meeting Summary — Enterprise (GPT-4)",
    dataset,
    enterprisePipeline,
  );

  // --- Configuration B: Professional tier (GPT-4-turbo) ---
  async function professionalPipeline(inputData: string): Promise<string> {
    const result = await summarizer.summarizeMeeting(
      "stratex-consulting",
      inputData,
    );
    return result.summary;
  }

  await Netra.evaluation.runTestSuite(
    "Meeting Summary — Professional (GPT-4-turbo)",
    dataset,
    professionalPipeline,
  );
  ```
</CodeGroup>

<Tip>
  You can test any number of configurations — model swaps, prompt variations, temperature changes — by adding more `run_test_suite` calls against the same dataset.
</Tip>

For more details on the evaluation API, refer to the [SDK documentation](/sdk-reference/evaluation/python).

***

## Step 4: Compare Results

Go to **Evaluation → Test Runs** to see both runs listed. Click into each run to see per-evaluator, per-item results.

### Build a Comparison Table

Pull the evaluator scores from each run and compare:

| Evaluator          | Enterprise (GPT-4) | Professional (GPT-4-turbo) | Delta |
| ------------------ | ------------------ | -------------------------- | ----- |
| Answer Correctness | 0.95               | 0.89                       | -0.06 |
| Conciseness        | 0.80               | 0.88                       | +0.08 |

You can also click **View Trace** on any result to inspect the exact LLM input, output, and token usage for that test case. This is useful for understanding why one configuration scored higher on a specific item.

<video autoPlay={true} muted={true} loop={true} playsInline={true} className="w-full aspect-video rounded-xl" src="https://mintcdn.com/netra/flCEWIb7m_86sZYE/videos/cookbook-ab-testing-results.mp4?fit=max&auto=format&n=flCEWIb7m_86sZYE&q=85&s=94797554987b863c28d0e1dedce80e98" data-path="videos/cookbook-ab-testing-results.mp4" />

***

## Interpreting Scores and Making Decisions

### Quality vs. Cost Analysis

Combine evaluator scores with cost data from your traces to see the full picture:

| Metric             | Enterprise (GPT-4) | Professional (GPT-4-turbo) | Delta |
| ------------------ | ------------------ | -------------------------- | ----- |
| Answer Correctness | 0.95               | 0.89                       | -6%   |
| Conciseness        | 0.80               | 0.88                       | +10%  |
| Avg Cost per Item  | \$0.023            | \$0.008                    | -65%  |
| Avg Latency        | 2.1s               | 1.4s                       | -33%  |

### Decision Framework

Use the comparison data to make an informed decision:

| Condition                                                            | Action                                                                     |
| -------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| Quality scores equivalent, one configuration is cheaper or faster    | Switch to the cheaper/faster configuration                                 |
| One configuration scores higher on your most important evaluator     | Keep the higher-quality configuration if the cost difference is acceptable |
| Scores are mixed (one wins on correctness, the other on conciseness) | Prioritize the evaluator that matters most for your use case               |
| Quality drops below your pass threshold (e.g., 0.7)                  | Do not switch — the cost savings aren't worth the quality loss             |

After making a decision, re-run the evaluation periodically to confirm the quality gap hasn't changed — model behavior can shift with provider updates.

***

## Continuous A/B Testing Strategy

Run A/B tests regularly as part of your development workflow:

1. **Before model upgrades** — Compare the new model against your current one before switching in production
2. **After prompt changes** — Measure the impact of prompt modifications across all quality dimensions
3. **When optimizing cost** — Verify that a cheaper configuration maintains acceptable quality
4. **For tier validation** — Confirm that premium tiers deliver measurably better results than lower tiers

***

## See Also

<CardGroup cols={2}>
  <Card icon="building" href="/Cookbooks/observability/multi-tenant-cost-tracking" title="Multi-Tenant Cost Tracking">
    Set up the tier-based meeting summarization pipeline this cookbook evaluates
  </Card>

  <Card icon="gauge-high" href="/Evaluation/Evaluation-overview" title="Evaluation Overview">
    Deep dive into Netra's evaluation framework: datasets, evaluators, and test runs
  </Card>

  <Card icon="robot" href="/Cookbooks/evaluation/evaluating-agent-decisions" title="Evaluating Agent Decisions">
    Evaluate tool selection, escalation, and workflow completion in agents
  </Card>
</CardGroup>
