Skip to main content
In the Multi-Tenant Cost Tracking cookbook, you set up tier-based configurations for a meeting summarization pipeline — Enterprise on GPT-4, Professional on GPT-4-turbo, and Starter on GPT-3.5-turbo. But how do you know whether the Enterprise tier’s output is actually better enough to justify the cost? Without structured evaluation, you’re guessing. This cookbook walks you through the next step: using Netra’s evaluation framework to A/B test those configurations. You’ll run the same test cases against two tiers, score both with the same evaluators, and compare results side by side to make a data-driven decision.
Prerequisite: You need a Netra API key (Get started here) and the meeting summarization pipeline from the Multi-Tenant Cost Tracking cookbook. The code below reuses the MultiTenantMeetingSummarizer class and tenant configurations from that cookbook.

What You’ll Learn

Build a Shared Test Dataset

Create test cases that both configurations will be evaluated against

Configure Quality Evaluators

Set up evaluators for answer correctness and conciseness

Run Parallel Test Suites

Trigger separate evaluation runs for each configuration via the SDK

Compare Results & Decide

Interpret scores across runs to make data-driven configuration decisions

Why A/B Test AI Configurations?

Different configurations serve different trade-offs. Systematic A/B testing answers these questions with data:
ScenarioWhat to CompareWhat You’ll Learn
Model upgradeGPT-3.5-turbo vs GPT-4-turboDoes the quality improvement justify the cost increase?
Prompt optimizationOriginal prompt vs revised promptDoes the new prompt improve quality with the same model?
Parameter tuningtemperature=0.1 vs temperature=0.3Which setting produces more consistent results?
Tier validationEnterprise config vs Professional configDoes the quality gap justify the price gap?
Netra’s evaluation framework makes this straightforward: create one dataset, run it against each configuration as a separate Test Run, and compare evaluator scores in the dashboard. See the Evaluation Overview for a deeper look at the framework.
Now, let’s walk through the process of A/B testing two configurations:

Step 1: Create Evaluators

You need two evaluators from the library.

Answer Correctness (Library)

Go to Evaluation → Evaluators, switch to the Library tab, and add Answer Correctness from the Quality category.

Conciseness (Library)

Add Conciseness from the Quality category.
EvaluatorWhat It Measures
Answer CorrectnessIs the generated output factually correct compared to the expected output?
ConcisenessIs the output appropriately brief without losing key information?
You can test each evaluator in the Playground before using it in a dataset. See Evaluators for the full reference.

Step 2: Create a Dataset

Go to Evaluation → Datasets and click Create Dataset. Name it “A/B Test Dataset” and attach the two evaluators from Step 1. You already have traces from running the meeting summarization pipeline in the Multi-Tenant Cost Tracking cookbook. Add them to your dataset directly:
1

Select a trace

Go to Observability → Traces and select a trace from the Multi-Tenant Cost Tracking cookbook. Choose traces with different meeting types (short standups, planning sessions, open-ended discussions) to get a diverse set of test cases.
2

Add to Dataset

Click on the trace, then click Add to Dataset. Select the “A/B Test Dataset” you just created. Fill in the Expected Output with the correct summary for that meeting transcript.
3

Repeat for more traces

Add 3–5 traces covering different meeting types — short standups, complex planning sessions, and ambiguous discussions. More diverse test cases give you a clearer comparison between configurations.
You can also add test cases manually if you want to include specific edge cases that aren’t in your traces. For each evaluator, configure the variable mappings so the evaluator receives the correct inputs at runtime — map evaluator variables like query and expected_output to Dataset item fields, and agent_response to Agent response. See Datasets for the full mapping reference.

Step 3: Trigger Test Runs

The key to A/B testing is running the same dataset against different configurations as separate test runs. Copy the Dataset ID from the dataset page and trigger one run per configuration.
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="ab-testing",
    instruments={InstrumentSet.OPENAI},
)

# Reuse the summarizer from the Multi-Tenant Cost Tracking cookbook
summarizer = MultiTenantMeetingSummarizer()
dataset = Netra.evaluation.get_dataset(dataset_id="your-dataset-id")

# --- Configuration A: Enterprise tier (GPT-4) ---
def enterprise_pipeline(input_data):
    result = summarizer.summarize_meeting(
        tenant_id="apex-legal",
        meeting_transcript=input_data,
    )
    return result["summary"]

Netra.evaluation.run_test_suite(
    name="Meeting Summary — Enterprise (GPT-4)",
    data=dataset,
    task=enterprise_pipeline,
)

# --- Configuration B: Professional tier (GPT-4-turbo) ---
def professional_pipeline(input_data):
    result = summarizer.summarize_meeting(
        tenant_id="stratex-consulting",
        meeting_transcript=input_data,
    )
    return result["summary"]

Netra.evaluation.run_test_suite(
    name="Meeting Summary — Professional (GPT-4-turbo)",
    data=dataset,
    task=professional_pipeline,
)
You can test any number of configurations — model swaps, prompt variations, temperature changes — by adding more run_test_suite calls against the same dataset.
For more details on the evaluation API, refer to the SDK documentation.

Step 4: Compare Results

Go to Evaluation → Test Runs to see both runs listed. Click into each run to see per-evaluator, per-item results.

Build a Comparison Table

Pull the evaluator scores from each run and compare:
EvaluatorEnterprise (GPT-4)Professional (GPT-4-turbo)Delta
Answer Correctness0.950.89-0.06
Conciseness0.800.88+0.08
You can also click View Trace on any result to inspect the exact LLM input, output, and token usage for that test case. This is useful for understanding why one configuration scored higher on a specific item.

Interpreting Scores and Making Decisions

Quality vs. Cost Analysis

Combine evaluator scores with cost data from your traces to see the full picture:
MetricEnterprise (GPT-4)Professional (GPT-4-turbo)Delta
Answer Correctness0.950.89-6%
Conciseness0.800.88+10%
Avg Cost per Item$0.023$0.008-65%
Avg Latency2.1s1.4s-33%

Decision Framework

Use the comparison data to make an informed decision:
ConditionAction
Quality scores equivalent, one configuration is cheaper or fasterSwitch to the cheaper/faster configuration
One configuration scores higher on your most important evaluatorKeep the higher-quality configuration if the cost difference is acceptable
Scores are mixed (one wins on correctness, the other on conciseness)Prioritize the evaluator that matters most for your use case
Quality drops below your pass threshold (e.g., 0.7)Do not switch — the cost savings aren’t worth the quality loss
After making a decision, re-run the evaluation periodically to confirm the quality gap hasn’t changed — model behavior can shift with provider updates.

Continuous A/B Testing Strategy

Run A/B tests regularly as part of your development workflow:
  1. Before model upgrades — Compare the new model against your current one before switching in production
  2. After prompt changes — Measure the impact of prompt modifications across all quality dimensions
  3. When optimizing cost — Verify that a cheaper configuration maintains acceptable quality
  4. For tier validation — Confirm that premium tiers deliver measurably better results than lower tiers

See Also

Multi-Tenant Cost Tracking

Set up the tier-based meeting summarization pipeline this cookbook evaluates

Evaluation Overview

Deep dive into Netra’s evaluation framework: datasets, evaluators, and test runs

Evaluating Agent Decisions

Evaluate tool selection, escalation, and workflow completion in agents
Last modified on March 17, 2026