Prerequisite: You need a Netra API key (Get started here) and the meeting summarization pipeline from the Multi-Tenant Cost Tracking cookbook. The code below reuses the
MultiTenantMeetingSummarizer class and tenant configurations from that cookbook.What You’ll Learn
Build a Shared Test Dataset
Create test cases that both configurations will be evaluated against
Configure Quality Evaluators
Set up evaluators for answer correctness and conciseness
Run Parallel Test Suites
Trigger separate evaluation runs for each configuration via the SDK
Compare Results & Decide
Interpret scores across runs to make data-driven configuration decisions
Why A/B Test AI Configurations?
Different configurations serve different trade-offs. Systematic A/B testing answers these questions with data:| Scenario | What to Compare | What You’ll Learn |
|---|---|---|
| Model upgrade | GPT-3.5-turbo vs GPT-4-turbo | Does the quality improvement justify the cost increase? |
| Prompt optimization | Original prompt vs revised prompt | Does the new prompt improve quality with the same model? |
| Parameter tuning | temperature=0.1 vs temperature=0.3 | Which setting produces more consistent results? |
| Tier validation | Enterprise config vs Professional config | Does the quality gap justify the price gap? |
Now, let’s walk through the process of A/B testing two configurations:
Step 1: Create Evaluators
You need two evaluators from the library.Answer Correctness (Library)
Go to Evaluation → Evaluators, switch to the Library tab, and add Answer Correctness from the Quality category.Conciseness (Library)
Add Conciseness from the Quality category.| Evaluator | What It Measures |
|---|---|
| Answer Correctness | Is the generated output factually correct compared to the expected output? |
| Conciseness | Is the output appropriately brief without losing key information? |
Step 2: Create a Dataset
Go to Evaluation → Datasets and click Create Dataset. Name it “A/B Test Dataset” and attach the two evaluators from Step 1. You already have traces from running the meeting summarization pipeline in the Multi-Tenant Cost Tracking cookbook. Add them to your dataset directly:Select a trace
Go to Observability → Traces and select a trace from the Multi-Tenant Cost Tracking cookbook. Choose traces with different meeting types (short standups, planning sessions, open-ended discussions) to get a diverse set of test cases.
Add to Dataset
Click on the trace, then click Add to Dataset. Select the “A/B Test Dataset” you just created. Fill in the Expected Output with the correct summary for that meeting transcript.
query and expected_output to Dataset item fields, and agent_response to Agent response. See Datasets for the full mapping reference.
Step 3: Trigger Test Runs
The key to A/B testing is running the same dataset against different configurations as separate test runs. Copy the Dataset ID from the dataset page and trigger one run per configuration.Step 4: Compare Results
Go to Evaluation → Test Runs to see both runs listed. Click into each run to see per-evaluator, per-item results.Build a Comparison Table
Pull the evaluator scores from each run and compare:| Evaluator | Enterprise (GPT-4) | Professional (GPT-4-turbo) | Delta |
|---|---|---|---|
| Answer Correctness | 0.95 | 0.89 | -0.06 |
| Conciseness | 0.80 | 0.88 | +0.08 |
Interpreting Scores and Making Decisions
Quality vs. Cost Analysis
Combine evaluator scores with cost data from your traces to see the full picture:| Metric | Enterprise (GPT-4) | Professional (GPT-4-turbo) | Delta |
|---|---|---|---|
| Answer Correctness | 0.95 | 0.89 | -6% |
| Conciseness | 0.80 | 0.88 | +10% |
| Avg Cost per Item | $0.023 | $0.008 | -65% |
| Avg Latency | 2.1s | 1.4s | -33% |
Decision Framework
Use the comparison data to make an informed decision:| Condition | Action |
|---|---|
| Quality scores equivalent, one configuration is cheaper or faster | Switch to the cheaper/faster configuration |
| One configuration scores higher on your most important evaluator | Keep the higher-quality configuration if the cost difference is acceptable |
| Scores are mixed (one wins on correctness, the other on conciseness) | Prioritize the evaluator that matters most for your use case |
| Quality drops below your pass threshold (e.g., 0.7) | Do not switch — the cost savings aren’t worth the quality loss |
Continuous A/B Testing Strategy
Run A/B tests regularly as part of your development workflow:- Before model upgrades — Compare the new model against your current one before switching in production
- After prompt changes — Measure the impact of prompt modifications across all quality dimensions
- When optimizing cost — Verify that a cheaper configuration maintains acceptable quality
- For tier validation — Confirm that premium tiers deliver measurably better results than lower tiers
See Also
Multi-Tenant Cost Tracking
Set up the tier-based meeting summarization pipeline this cookbook evaluates
Evaluation Overview
Deep dive into Netra’s evaluation framework: datasets, evaluators, and test runs
Evaluating Agent Decisions
Evaluate tool selection, escalation, and workflow completion in agents
