> ## Documentation Index > Fetch the complete documentation index at: https://docs.getnetra.ai/llms.txt > Use this file to discover all available pages before exploring further. # A/B Testing Model Configurations > A/B test AI model configurations with Netra's evaluation framework. Compare prompts, models, and parameters by running the same dataset against each setup. In the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook, you set up tier-based configurations for a meeting summarization pipeline — Enterprise on GPT-4, Professional on GPT-4-turbo, and Starter on GPT-3.5-turbo. But how do you know whether the Enterprise tier's output is actually better enough to justify the cost? Without structured evaluation, you're guessing. This cookbook walks you through the next step: using Netra's evaluation framework to A/B test those configurations. You'll run the same test cases against two tiers, score both with the same evaluators, and compare results side by side to make a data-driven decision. **Prerequisite:** You need a Netra API key ([Get started here](/quick-start/Overview)) and the meeting summarization pipeline from the [Multi-Tenant Cost Tracking](/Cookbooks/observability/multi-tenant-cost-tracking) cookbook. The code below reuses the `MultiTenantMeetingSummarizer` class and tenant configurations from that cookbook. ## What You'll Learn Create test cases that both configurations will be evaluated against Set up evaluators for answer correctness and conciseness Trigger separate evaluation runs for each configuration via the SDK Interpret scores across runs to make data-driven configuration decisions *** ## Why A/B Test AI Configurations? Different configurations serve different trade-offs. Systematic A/B testing answers these questions with data: | Scenario | What to Compare | What You'll Learn | | ----------------------- | ---------------------------------------- | -------------------------------------------------------- | | **Model upgrade** | GPT-3.5-turbo vs GPT-4-turbo | Does the quality improvement justify the cost increase? | | **Prompt optimization** | Original prompt vs revised prompt | Does the new prompt improve quality with the same model? | | **Parameter tuning** | temperature=0.1 vs temperature=0.3 | Which setting produces more consistent results? | | **Tier validation** | Enterprise config vs Professional config | Does the quality gap justify the price gap? | Netra's evaluation framework makes this straightforward: create one dataset, run it against each configuration as a separate [Test Run](/Evaluation/TestRuns), and compare evaluator scores in the dashboard. See the [Evaluation Overview](/Evaluation/Evaluation-overview) for a deeper look at the framework. *** Now, let's walk through the process of A/B testing two configurations: ## Step 1: Create Evaluators You need two evaluators from the library. ### Answer Correctness (Library) Go to **Evaluation → Evaluators**, switch to the **Library** tab, and add **Answer Correctness** from the Quality category. ### Conciseness (Library) Add **Conciseness** from the Quality category. | Evaluator | What It Measures | | ---------------------- | -------------------------------------------------------------------------- | | **Answer Correctness** | Is the generated output factually correct compared to the expected output? | | **Conciseness** | Is the output appropriately brief without losing key information? | You can test each evaluator in the **Playground** before using it in a dataset. See [Evaluators](/Evaluation/Evaluators) for the full reference.