Open in Google Colab
Run the complete notebook in your browser
What You’ll Learn
Per-Tier Evaluation
Measure quality across different pricing tiers using the same test dataset
A/B Test Models
Compare model performance (GPT-4 vs GPT-4-turbo vs GPT-3.5) with controlled experiments
Quality-Cost Analysis
Make data-driven decisions about tier configurations based on quality and cost metrics
Trace Comparison
Use Netra’s comparison tools to analyze differences between configurations
Prerequisites:
- Python >=3.10, <3.14 or Node.js 18+
- OpenAI API key
- Netra API key (Get started here)
- A test dataset with expected outputs
Why A/B Test AI Configurations?
Different customer tiers often use different models to balance quality and cost:| Tier | Typical Model | Question |
|---|---|---|
| Enterprise | GPT-4 | Is the quality worth 10x the cost? |
| Professional | GPT-4-turbo | Could we downgrade to save costs? |
| Starter | GPT-3.5-turbo | Should we upgrade to improve retention? |
Common A/B Testing Scenarios
| Scenario | What to Compare | Success Metric |
|---|---|---|
| Model upgrade | GPT-3.5 → GPT-4-turbo | Quality improvement vs. cost increase |
| Prompt optimization | Original vs. revised prompt | Quality with same model |
| Parameter tuning | temperature=0.1 vs 0.3 | Consistency vs. creativity |
| Tier validation | Enterprise vs. Professional output | Quality gap justifies price gap |
Setting Up the Experiment
Tier Configuration
First, define the configurations you want to compare:Test Dataset
Create a consistent test dataset that you’ll run against all configurations:Creating Evaluators
Set up evaluators in Netra to measure quality consistently across configurations.Using LLM-as-Judge Templates
Navigate to Evaluation → Evaluators and add these evaluators from the Library:| Evaluator | Purpose | Pass Criteria |
|---|---|---|
| Answer Correctness | Compare output against expected summary | score >= 0.7 |
| Conciseness | Ensure outputs are appropriately brief | score >= 0.7 |
| Completeness | Check that all required fields are present | score >= 0.8 |
Custom Tier Completeness Evaluator
Create a code evaluator that validates outputs based on tier requirements:>= 1.
Running the A/B Test
Per-Tier Evaluation
Run the same test cases against each tier configuration:Comparing Specific Models
To A/B test a potential model upgrade for a specific tier:Using Trace Comparison
Netra’s trace comparison feature lets you analyze A/B test results visually.Comparing Traces in the Dashboard
- Navigate to Observability → Traces
- Filter by
experiment = model-ab-test - Select one trace from each model variant
- Click Compare
| Metric | GPT-3.5-turbo | GPT-4-turbo | Delta |
|---|---|---|---|
| Latency | 800ms | 1200ms | +50% |
| Cost | $0.002 | $0.008 | +300% |
| Tokens | 450 | 520 | +16% |
Running Evaluations on Test Results
Connect your A/B test to Netra’s evaluation framework:Analyzing Results
Quality vs. Cost Analysis
After running evaluations, compare results in Evaluation → Test Runs:| Tier | Quality Score | Avg Cost | Cost per Quality Point |
|---|---|---|---|
| Enterprise | 94% | $0.023 | $0.024 |
| Professional | 89% | $0.012 | $0.013 |
| Starter | 76% | $0.002 | $0.003 |
Decision Framework
Use this framework to decide on tier changes: Should you upgrade Starter from GPT-3.5 to GPT-4-turbo?| Factor | GPT-3.5 | GPT-4-turbo | Verdict |
|---|---|---|---|
| Quality | 76% | 89% | +13% improvement |
| Cost | $0.002 | $0.008 | 4x increase |
| Customer Price | $0.01/meeting | $0.01/meeting | No change |
| Margin Impact | $0.008 | $0.002 | 75% margin reduction |
Summary
You’ve learned how to systematically A/B test AI configurations:- Per-tier evaluation measures quality across pricing segments
- Model comparisons use controlled experiments with consistent test data
- Trace comparison visualizes differences in Netra’s dashboard
- Quality-cost analysis supports data-driven tier decisions
Key Takeaways
- Always use the same test dataset when comparing configurations
- Tag experiments with custom attributes for easy filtering
- Consider quality, cost, AND latency in your analysis
- Connect A/B tests to business metrics (retention, revenue) for final decisions