Skip to main content
Agent changes are risky. A prompt tweak that improves refund conversations might break order status inquiries. A model upgrade that reduces cost might degrade conversational memory. Without a systematic way to compare agent versions across multi-turn scenarios, you’re deploying blind. This cookbook shows you how to use simulation as a regression testing workflow: establish a baseline dataset with all 8 evaluators, run simulations before and after agent changes, compare results across test runs, and decide whether to deploy.
Prerequisite: You should be familiar with Netra’s simulation workflow. If this is your first simulation cookbook, start with Simulating Customer Support Conversations first.

What You’ll Learn

Build a Regression Suite

Create a reusable dataset with diverse scenarios, personas, and all 8 evaluators

Establish a Baseline

Run your current agent version to set performance benchmarks

Compare Agent Versions

Run the same dataset against a modified agent and compare scores side by side

Make Deploy/No-Deploy Decisions

Use evaluator scores, exit reasons, and cost data to decide whether a change is safe to ship

Why Regression Testing Matters for Agents

The A/B Testing Configurations cookbook covers single-turn model comparison. But conversational agents need multi-turn regression testing — a change that improves individual responses might break behavior across a full conversation:
Change TypeSingle-Turn ImpactMulti-Turn Impact
Prompt updateResponses may improve in tone or accuracyMay break conversation memory or guideline adherence across turns
Model upgradeFaster or cheaper responsesMay change how the agent handles frustrated users or manages context
Tool additionNew capability availableAgent may over-use the new tool or forget existing conversation flow
Temperature changeMore/less creative individual responsesMay affect consistency across a multi-turn session
Simulation regression testing catches these cross-turn impacts before they reach users.

Step 1: Create a Comprehensive Dataset

Build a regression test suite with diverse scenarios that cover your agent’s core use cases. Use all 8 evaluators for comprehensive coverage.

Select All 8 Evaluators

Go to Evaluation → Evaluators, switch to the Library tab, and filter by Multi turn. Add all evaluators: Quality evaluators:
EvaluatorWhat It Measures
Guideline AdherenceDid the agent follow its instructions throughout the conversation?
Conversation CompletenessWere all of the user’s intents addressed?
Profile UtilizationDid the agent use the provided user profile to adapt responses?
Conversational FlowDid the conversation progress logically without stalling or looping?
Conversation MemoryDid the agent remember and correctly reference information from earlier turns?
Factual AccuracyDid the agent communicate facts correctly, consistent with provided reference data?
Agentic evaluators:
EvaluatorWhat It Measures
Goal FulfillmentDid the conversation achieve the user’s stated objective?
Information ElicitationDid the agent effectively gather required information from the user?

Create Diverse Scenarios

Go to Evaluation → Datasets, click Create Dataset, and select Multi-turn. Name it “Agent Regression Suite” and add 4-5 scenarios that cover different situations:
#Scenario GoalPersonaMax TurnsKey Features
1Get a refund for a damaged productFrustrated5Fact checkers for refund timeline and method
2Understand the return policy for an online orderConfused6User data with purchase details
3Check order status and request expedited shippingNeutral4Goal requires two actions in one conversation
4Report a billing discrepancy and get it resolvedFriendly5Fact checkers for billing correction process
5Get technical help setting up a productCustom7Custom “Non-Technical User” persona, no fact checkers
Attach all 8 evaluators to the dataset with the appropriate variable mappings.
This dataset is reusable. Once created, you run it against every agent version — the scenarios, personas, and evaluators stay the same. Only the agent changes.

Step 2: Run the Baseline (Version A)

Run the current agent version to establish performance benchmarks.
from netra import Netra
from netra.simulation.task import BaseTask
from netra.simulation.models import TaskResult
from openai import OpenAI
import uuid

Netra.init(app_name="agent-regression")
client = OpenAI()

conversations: dict[str, list] = {}

class AgentV1(BaseTask):
    """Current production agent — Version A (baseline)."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations:
            conversations[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent. Help customers with "
                        "refunds, orders, billing, and technical setup. Be professional "
                        "and empathetic. Confirm resolution before ending."
                    ),
                }
            ]

        conversations[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=conversations[session],
        )

        content = response.choices[0].message.content
        conversations[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)

# Run baseline
baseline = Netra.simulation.run_simulation(
    name="Regression — v1.0 Baseline",
    dataset_id="your-dataset-id",
    task=AgentV1(),
    context={"version": "v1.0", "model": "gpt-4o-mini"},
    max_concurrency=3,
)

print(f"Baseline: {len(baseline['completed'])} completed, {len(baseline['failed'])} failed")

Netra.shutdown()

Step 3: Make an Agent Change (Version B)

Modify your agent — update the system prompt, switch models, or add new capabilities. Here’s an example that upgrades the model and refines the prompt:
conversations_v2: dict[str, list] = {}

class AgentV2(BaseTask):
    """Updated agent — Version B (prompt refinement + model upgrade)."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations_v2:
            conversations_v2[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent. Help customers with "
                        "refunds, orders, billing, and technical setup.\n\n"
                        "Guidelines:\n"
                        "- Be professional, empathetic, and concise.\n"
                        "- Always confirm the action taken before ending.\n"
                        "- If the customer seems confused, simplify your language.\n"
                        "- Reference earlier parts of the conversation when relevant.\n"
                        "- Check whether the customer has additional questions."
                    ),
                }
            ]

        conversations_v2[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o",  # Upgraded from gpt-4o-mini
            messages=conversations_v2[session],
        )

        content = response.choices[0].message.content
        conversations_v2[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)

# Run updated agent against the same dataset
updated = Netra.simulation.run_simulation(
    name="Regression — v1.1 Prompt + Model Upgrade",
    dataset_id="your-dataset-id",  # Same dataset as baseline
    task=AgentV2(),
    context={"version": "v1.1", "model": "gpt-4o"},
    max_concurrency=3,
)

print(f"Updated: {len(updated['completed'])} completed, {len(updated['failed'])} failed")

Netra.shutdown()

Step 4: Compare Results

Go to Evaluation → Test Runs, filter by Multi turn, and open both runs side by side.

Build a Comparison Table

Pull the evaluator scores from each run and compare:
Evaluatorv1.0 (Baseline)v1.1 (Updated)Change
Goal Fulfillment0.700.85+0.15
Factual Accuracy0.800.800.00
Guideline Adherence0.650.90+0.25
Conversation Completeness0.600.80+0.20
Conversation Memory0.550.75+0.20
Conversational Flow0.700.85+0.15
Profile Utilization0.500.70+0.20
Information Elicitation0.600.65+0.05

Compare Exit Reasons

Exit Reasonv1.0v1.1
Goal Achieved2/54/5
Goal Failed1/50/5
Max Turns Reached2/51/5

Compare Cost and Latency

Metricv1.0v1.1Change
Total Cost$0.12$0.45+275%
Avg Latency per Turn1.2s2.8s+133%
Avg Turns to Goal4.53.2-29%

Step 5: Make the Deploy Decision

Use the comparison data to make an informed decision:

Decision Framework

ConditionAction
All evaluator scores stable or improved, cost acceptableDeploy
Some scores improved but others regressedInvestigate the regressed scenarios before deploying
Any evaluator dropped below your pass threshold (default 0.6)Do not deploy — fix the regression first
Scores improved but cost increase is unacceptableConsider a hybrid approach (upgraded model for complex scenarios only)

Investigating Regressions

If any evaluator score dropped between versions:
  1. Open both test runs and find the specific scenarios where scores changed
  2. Compare the Conversation tabs side by side — what did the agent say differently?
  3. Click View Trace on the turns where behavior diverged to inspect LLM inputs and outputs
  4. Check the Scenario Details tab to confirm the test conditions were identical

Continuous Regression Workflow

Integrate simulation regression testing into your development process:
  1. Before every deployment — Run the regression suite against the candidate version. Compare against the last known-good baseline.
  2. After prompt changes — Even small prompt tweaks can have outsized effects on multi-turn behavior. Always simulate.
  3. After model upgrades — New model versions may change conversational patterns. Run the full suite.
  4. Monthly baseline refresh — Periodically add new scenarios that reflect recent support patterns or edge cases.

Automating the Workflow

The simulation SDK returns structured results you can parse programmatically:
result = Netra.simulation.run_simulation(
    name="CI — Pre-Deploy Check",
    dataset_id="your-dataset-id",
    task=AgentV2(),
    context={"version": "v1.1", "ci_run": "build-456"},
)

# Check for failures
if result["failed"]:
    print(f"FAILED: {len(result['failed'])} scenarios did not complete")
    for f in result["failed"]:
        print(f"  - {f['run_item_id']}: {f['error']}")
    exit(1)

print(f"PASSED: All {result['total_items']} scenarios completed")
The SDK result tells you whether scenarios completed or failed at the execution level. Evaluator scores are available in the Test Runs dashboard. For fully automated pass/fail decisions based on evaluator scores, check the dashboard after each run.

See Also

Last modified on February 24, 2026