Skip to main content
This cookbook shows you how to systematically compare AI configurations across different customer segments—testing models, prompts, and parameters to optimize quality-cost tradeoffs per pricing tier.

Open in Google Colab

Run the complete notebook in your browser

What You’ll Learn

Per-Tier Evaluation

Measure quality across different pricing tiers using the same test dataset

A/B Test Models

Compare model performance (GPT-4 vs GPT-4-turbo vs GPT-3.5) with controlled experiments

Quality-Cost Analysis

Make data-driven decisions about tier configurations based on quality and cost metrics

Trace Comparison

Use Netra’s comparison tools to analyze differences between configurations
Prerequisites:
  • Python >=3.10, <3.14 or Node.js 18+
  • OpenAI API key
  • Netra API key (Get started here)
  • A test dataset with expected outputs

Why A/B Test AI Configurations?

Different customer tiers often use different models to balance quality and cost:
TierTypical ModelQuestion
EnterpriseGPT-4Is the quality worth 10x the cost?
ProfessionalGPT-4-turboCould we downgrade to save costs?
StarterGPT-3.5-turboShould we upgrade to improve retention?
Systematic A/B testing answers these questions with data, not guesses.

Common A/B Testing Scenarios

ScenarioWhat to CompareSuccess Metric
Model upgradeGPT-3.5 → GPT-4-turboQuality improvement vs. cost increase
Prompt optimizationOriginal vs. revised promptQuality with same model
Parameter tuningtemperature=0.1 vs 0.3Consistency vs. creativity
Tier validationEnterprise vs. Professional outputQuality gap justifies price gap

Setting Up the Experiment

Tier Configuration

First, define the configurations you want to compare:
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class TierConfig:
    """Configuration for a pricing tier."""
    tier: str
    model: str
    features: List[str]
    temperature: float = 0.1

# Configurations to compare
TIER_CONFIGS = {
    "enterprise": TierConfig(
        tier="enterprise",
        model="gpt-4",
        features=["summary", "action_items", "decisions"],
        temperature=0.1
    ),
    "professional": TierConfig(
        tier="professional",
        model="gpt-4-turbo",
        features=["summary", "action_items"],
        temperature=0.1
    ),
    "starter": TierConfig(
        tier="starter",
        model="gpt-3.5-turbo",
        features=["summary"],
        temperature=0.1
    ),
}

# Model pricing (per 1M tokens)
MODEL_PRICING = {
    "gpt-4": {"input": 30.0, "output": 60.0},
    "gpt-4-turbo": {"input": 10.0, "output": 30.0},
    "gpt-3.5-turbo": {"input": 0.5, "output": 1.5},
}

Test Dataset

Create a consistent test dataset that you’ll run against all configurations:
# Sample test cases with expected outputs
TEST_CASES = [
    {
        "input": """
        Meeting: Q4 Planning Session
        Attendees: Alice (PM), Bob (Engineering), Carol (Design)

        Alice: Let's review Q4 priorities. The dashboard must ship by March.
        Bob: Backend is 80% done. Need two more weeks for API endpoints.
        Carol: Designs will be ready by Friday.
        Alice: Bob, also look into last week's performance issues.
        Bob: I'll fix those by Wednesday.
        """,
        "expected_summary": "Q4 planning meeting discussed dashboard launch timeline for March, with backend at 80% completion and designs due Friday.",
        "expected_action_items": ["Bob: Complete API endpoints (2 weeks)", "Carol: Finalize designs (Friday)", "Bob: Fix performance issues (Wednesday)"],
    },
    # Add more test cases...
]

Creating Evaluators

Set up evaluators in Netra to measure quality consistently across configurations.

Using LLM-as-Judge Templates

Navigate to Evaluation → Evaluators and add these evaluators from the Library:
EvaluatorPurposePass Criteria
Answer CorrectnessCompare output against expected summaryscore >= 0.7
ConcisenessEnsure outputs are appropriately briefscore >= 0.7
CompletenessCheck that all required fields are presentscore >= 0.8

Custom Tier Completeness Evaluator

Create a code evaluator that validates outputs based on tier requirements:
// handler function validates output has all tier-required fields
function handler(input, output, expectedOutput) {
    let result;
    try {
        result = JSON.parse(output);
    } catch {
        return 0; // Fail if not valid JSON
    }

    // Get tier from test metadata
    const tier = expectedOutput?.tier || "starter";

    const required = {
        "enterprise": ["summary", "action_items", "decisions"],
        "professional": ["summary", "action_items"],
        "starter": ["summary"],
    };

    const requiredKeys = required[tier] || ["summary"];
    const hasAllKeys = requiredKeys.every(key => key in result);

    return hasAllKeys ? 1 : 0;
}
Set Output Type to Numerical and Pass Criteria to >= 1.

Running the A/B Test

Per-Tier Evaluation

Run the same test cases against each tier configuration:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet
from openai import OpenAI
import json
import time

# Initialize Netra
Netra.init(
    app_name="ab-testing",
    environment="evaluation",
    trace_content=True,
    instruments={InstrumentSet.OPENAI},
)

openai_client = OpenAI()

def run_tier_test(tier: str, test_input: str) -> dict:
    """Run a single test case against a tier configuration."""
    config = TIER_CONFIGS[tier]

    # Set context for filtering
    Netra.set_custom_attributes(key="tier", value=tier)
    Netra.set_custom_attributes(key="model", value=config.model)
    Netra.set_custom_attributes(key="experiment", value="tier-comparison")

    start_time = time.time()

    # Build prompt based on tier features
    feature_instructions = []
    if "summary" in config.features:
        feature_instructions.append("- Summary: 2-3 sentence overview")
    if "action_items" in config.features:
        feature_instructions.append("- Action Items: list with owners")
    if "decisions" in config.features:
        feature_instructions.append("- Decisions: key decisions made")

    prompt = f"""Extract the following from this meeting transcript:
{chr(10).join(feature_instructions)}

Transcript:
{test_input}

Respond in JSON format."""

    response = openai_client.chat.completions.create(
        model=config.model,
        messages=[
            {"role": "system", "content": "You are a meeting analyst. Respond with valid JSON only."},
            {"role": "user", "content": prompt}
        ],
        temperature=config.temperature,
    )

    latency_ms = (time.time() - start_time) * 1000
    prompt_tokens = response.usage.prompt_tokens
    completion_tokens = response.usage.completion_tokens

    # Calculate cost
    pricing = MODEL_PRICING[config.model]
    cost = (prompt_tokens / 1_000_000) * pricing["input"] + \
           (completion_tokens / 1_000_000) * pricing["output"]

    return {
        "tier": tier,
        "model": config.model,
        "output": response.choices[0].message.content,
        "latency_ms": latency_ms,
        "cost_usd": cost,
        "tokens": prompt_tokens + completion_tokens,
    }

def run_tier_comparison():
    """Compare all tiers against the test dataset."""
    results = {tier: [] for tier in TIER_CONFIGS.keys()}

    for test_case in TEST_CASES:
        print(f"\nRunning test case...")
        for tier in TIER_CONFIGS.keys():
            result = run_tier_test(tier, test_case["input"])
            results[tier].append(result)
            print(f"  {tier}: {result['latency_ms']:.0f}ms, ${result['cost_usd']:.6f}")

    return results

# Run the comparison
results = run_tier_comparison()

Comparing Specific Models

To A/B test a potential model upgrade for a specific tier:
def run_model_ab_test(test_input: str, model_a: str, model_b: str):
    """Compare two models head-to-head."""
    results = {}

    for model in [model_a, model_b]:
        Netra.set_custom_attributes(key="experiment", value="model-ab-test")
        Netra.set_custom_attributes(key="model_variant", value=model)

        start_time = time.time()

        response = openai_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "Summarize the meeting concisely."},
                {"role": "user", content: test_input}
            ],
            temperature=0.1,
        )

        latency_ms = (time.time() - start_time) * 1000
        pricing = MODEL_PRICING[model]
        cost = (response.usage.prompt_tokens / 1_000_000) * pricing["input"] + \
               (response.usage.completion_tokens / 1_000_000) * pricing["output"]

        results[model] = {
            "output": response.choices[0].message.content,
            "latency_ms": latency_ms,
            "cost_usd": cost,
        }

    return results

# Test upgrading Starter tier from GPT-3.5 to GPT-4-turbo
ab_results = run_model_ab_test(
    TEST_CASES[0]["input"],
    model_a="gpt-3.5-turbo",
    model_b="gpt-4-turbo"
)

print("\n=== Model A/B Test Results ===")
for model, data in ab_results.items():
    print(f"{model}: {data['latency_ms']:.0f}ms, ${data['cost_usd']:.6f}")

Using Trace Comparison

Netra’s trace comparison feature lets you analyze A/B test results visually.

Comparing Traces in the Dashboard

  1. Navigate to Observability → Traces
  2. Filter by experiment = model-ab-test
  3. Select one trace from each model variant
  4. Click Compare
You’ll see side-by-side comparison of:
MetricGPT-3.5-turboGPT-4-turboDelta
Latency800ms1200ms+50%
Cost$0.002$0.008+300%
Tokens450520+16%

Running Evaluations on Test Results

Connect your A/B test to Netra’s evaluation framework:
# Get dataset ID from Netra dashboard
DATASET_ID = "your-ab-test-dataset-id"

# Run evaluation for each configuration
for tier in TIER_CONFIGS.keys():
    dataset = Netra.evaluation.get_dataset(DATASET_ID)

    Netra.evaluation.run_test_suite(
        name=f"Tier Evaluation - {tier}",
        data=dataset,
        task=lambda eval_input: run_tier_test(tier, eval_input["input"])["output"]
    )

print("Evaluations complete! View results in Netra dashboard.")
Netra.shutdown()

Analyzing Results

Quality vs. Cost Analysis

After running evaluations, compare results in Evaluation → Test Runs:
TierQuality ScoreAvg CostCost per Quality Point
Enterprise94%$0.023$0.024
Professional89%$0.012$0.013
Starter76%$0.002$0.003

Decision Framework

Use this framework to decide on tier changes: Should you upgrade Starter from GPT-3.5 to GPT-4-turbo?
FactorGPT-3.5GPT-4-turboVerdict
Quality76%89%+13% improvement
Cost$0.002$0.0084x increase
Customer Price$0.01/meeting$0.01/meetingNo change
Margin Impact$0.008$0.00275% margin reduction
Recommendation: If quality improvement drives higher retention, the margin reduction may be worth it. Run a cohort analysis on customer retention vs. quality scores to make the final decision.

Summary

You’ve learned how to systematically A/B test AI configurations:
  • Per-tier evaluation measures quality across pricing segments
  • Model comparisons use controlled experiments with consistent test data
  • Trace comparison visualizes differences in Netra’s dashboard
  • Quality-cost analysis supports data-driven tier decisions

Key Takeaways

  1. Always use the same test dataset when comparing configurations
  2. Tag experiments with custom attributes for easy filtering
  3. Consider quality, cost, AND latency in your analysis
  4. Connect A/B tests to business metrics (retention, revenue) for final decisions

See Also

Last modified on February 11, 2026