Regression Testing Agent Deployments

Agent changes are risky. A prompt tweak that improves refund conversations might break order status inquiries. A model upgrade that reduces cost might degrade conversational memory. Without a systematic way to compare agent versions across multi-turn scenarios, you’re deploying blind. This cookbook shows you how to use simulation as a regression testing workflow: establish a baseline dataset with all 8 evaluators, run simulations before and after agent changes, compare results across test runs, and decide whether to deploy.

Prerequisite: You should be familiar with Netra’s simulation workflow. If this is your first simulation cookbook, start with Simulating Customer Support Conversations first.

What You’ll Learn

Build a Regression Suite

Create a reusable dataset with diverse scenarios, personas, and all 8 evaluators

Establish a Baseline

Run your current agent version to set performance benchmarks

Compare Agent Versions

Run the same dataset against a modified agent and compare scores side by side

Make Deploy/No-Deploy Decisions

Use evaluator scores, exit reasons, and cost data to decide whether a change is safe to ship

Why Regression Testing Matters for Agents

The A/B Testing Configurations cookbook covers single-turn model comparison. But conversational agents need multi-turn regression testing — a change that improves individual responses might break behavior across a full conversation:

Change Type	Single-Turn Impact	Multi-Turn Impact
Prompt update	Responses may improve in tone or accuracy	May break conversation memory or guideline adherence across turns
Model upgrade	Faster or cheaper responses	May change how the agent handles frustrated users or manages context
Tool addition	New capability available	Agent may over-use the new tool or forget existing conversation flow
Temperature change	More/less creative individual responses	May affect consistency across a multi-turn session

Simulation regression testing catches these cross-turn impacts before they reach users.

Step 1: Create a Comprehensive Dataset

Build a regression test suite with diverse scenarios that cover your agent’s core use cases. Use all 8 evaluators for comprehensive coverage.

Select All 8 Evaluators

Go to Evaluation → Evaluators, switch to the Library tab, and filter by Multi turn. Add all evaluators: Quality evaluators:

Evaluator	What It Measures
Guideline Adherence	Did the agent follow its instructions throughout the conversation?
Conversation Completeness	Were all of the user’s intents addressed?
Profile Utilization	Did the agent use the provided user profile to adapt responses?
Conversational Flow	Did the conversation progress logically without stalling or looping?
Conversation Memory	Did the agent remember and correctly reference information from earlier turns?
Factual Accuracy	Did the agent communicate facts correctly, consistent with provided reference data?

Agentic evaluators:

Evaluator	What It Measures
Goal Fulfillment	Did the conversation achieve the user’s stated objective?
Information Elicitation	Did the agent effectively gather required information from the user?

Create Diverse Scenarios

Go to Evaluation → Datasets, click Create Dataset, and select Multi-turn. Name it “Agent Regression Suite” and add 4-5 scenarios that cover different situations:

#	Scenario Goal	Persona	Max Turns	Key Features
1	Get a refund for a damaged product	Frustrated	5	Fact checkers for refund timeline and method
2	Understand the return policy for an online order	Confused	6	User data with purchase details
3	Check order status and request expedited shipping	Neutral	4	Goal requires two actions in one conversation
4	Report a billing discrepancy and get it resolved	Friendly	5	Fact checkers for billing correction process
5	Get technical help setting up a product	Custom	7	Custom “Non-Technical User” persona, no fact checkers

Attach all 8 evaluators to the dataset with the appropriate variable mappings.

This dataset is reusable. Once created, you run it against every agent version — the scenarios, personas, and evaluators stay the same. Only the agent changes.

Step 2: Run the Baseline (Version A)

Run the current agent version to establish performance benchmarks.

from netra import Netra
from netra.simulation.task import BaseTask
from netra.simulation.models import TaskResult
from openai import OpenAI
import uuid

Netra.init(app_name="agent-regression")
client = OpenAI()

conversations: dict[str, list] = {}

class AgentV1(BaseTask):
    """Current production agent — Version A (baseline)."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations:
            conversations[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent. Help customers with "
                        "refunds, orders, billing, and technical setup. Be professional "
                        "and empathetic. Confirm resolution before ending."
                    ),
                }
            ]

        conversations[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=conversations[session],
        )

        content = response.choices[0].message.content
        conversations[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)

# Run baseline
baseline = Netra.simulation.run_simulation(
    name="Regression — v1.0 Baseline",
    dataset_id="your-dataset-id",
    task=AgentV1(),
    context={"version": "v1.0", "model": "gpt-4o-mini"},
    max_concurrency=3,
)

print(f"Baseline: {len(baseline['completed'])} completed, {len(baseline['failed'])} failed")

Netra.shutdown()

Step 3: Make an Agent Change (Version B)

Modify your agent — update the system prompt, switch models, or add new capabilities. Here’s an example that upgrades the model and refines the prompt:

conversations_v2: dict[str, list] = {}

class AgentV2(BaseTask):
    """Updated agent — Version B (prompt refinement + model upgrade)."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations_v2:
            conversations_v2[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent. Help customers with "
                        "refunds, orders, billing, and technical setup.\n\n"
                        "Guidelines:\n"
                        "- Be professional, empathetic, and concise.\n"
                        "- Always confirm the action taken before ending.\n"
                        "- If the customer seems confused, simplify your language.\n"
                        "- Reference earlier parts of the conversation when relevant.\n"
                        "- Check whether the customer has additional questions."
                    ),
                }
            ]

        conversations_v2[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o",  # Upgraded from gpt-4o-mini
            messages=conversations_v2[session],
        )

        content = response.choices[0].message.content
        conversations_v2[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)

# Run updated agent against the same dataset
updated = Netra.simulation.run_simulation(
    name="Regression — v1.1 Prompt + Model Upgrade",
    dataset_id="your-dataset-id",  # Same dataset as baseline
    task=AgentV2(),
    context={"version": "v1.1", "model": "gpt-4o"},
    max_concurrency=3,
)

print(f"Updated: {len(updated['completed'])} completed, {len(updated['failed'])} failed")

Netra.shutdown()

Step 4: Compare Results

Go to Evaluation → Test Runs, filter by Multi turn, and open both runs side by side.

Build a Comparison Table

Pull the evaluator scores from each run and compare:

Evaluator	v1.0 (Baseline)	v1.1 (Updated)	Change
Goal Fulfillment	0.70	0.85	+0.15
Factual Accuracy	0.80	0.80	0.00
Guideline Adherence	0.65	0.90	+0.25
Conversation Completeness	0.60	0.80	+0.20
Conversation Memory	0.55	0.75	+0.20
Conversational Flow	0.70	0.85	+0.15
Profile Utilization	0.50	0.70	+0.20
Information Elicitation	0.60	0.65	+0.05

Compare Exit Reasons

Exit Reason	v1.0	v1.1
Goal Achieved	2/5	4/5
Goal Failed	1/5	0/5
Max Turns Reached	2/5	1/5

Compare Cost and Latency

Metric	v1.0	v1.1	Change
Total Cost	$0.12	$0.45	+275%
Avg Latency per Turn	1.2s	2.8s	+133%
Avg Turns to Goal	4.5	3.2	-29%

Step 5: Make the Deploy Decision

Use the comparison data to make an informed decision:

Decision Framework

Condition	Action
All evaluator scores stable or improved, cost acceptable	Deploy
Some scores improved but others regressed	Investigate the regressed scenarios before deploying
Any evaluator dropped below your pass threshold (default 0.6)	Do not deploy — fix the regression first
Scores improved but cost increase is unacceptable	Consider a hybrid approach (upgraded model for complex scenarios only)

Investigating Regressions

If any evaluator score dropped between versions:

Open both test runs and find the specific scenarios where scores changed
Compare the Conversation tabs side by side — what did the agent say differently?
Click View Trace on the turns where behavior diverged to inspect LLM inputs and outputs
Check the Scenario Details tab to confirm the test conditions were identical

Continuous Regression Workflow

Integrate simulation regression testing into your development process:

Before every deployment — Run the regression suite against the candidate version. Compare against the last known-good baseline.
After prompt changes — Even small prompt tweaks can have outsized effects on multi-turn behavior. Always simulate.
After model upgrades — New model versions may change conversational patterns. Run the full suite.
Monthly baseline refresh — Periodically add new scenarios that reflect recent support patterns or edge cases.

Automating the Workflow

The simulation SDK returns structured results you can parse programmatically:

result = Netra.simulation.run_simulation(
    name="CI — Pre-Deploy Check",
    dataset_id="your-dataset-id",
    task=AgentV2(),
    context={"version": "v1.1", "ci_run": "build-456"},
)

# Check for failures
if result["failed"]:
    print(f"FAILED: {len(result['failed'])} scenarios did not complete")
    for f in result["failed"]:
        print(f"  - {f['run_item_id']}: {f['error']}")
    exit(1)

print(f"PASSED: All {result['total_items']} scenarios completed")

The SDK result tells you whether scenarios completed or failed at the execution level. Evaluator scores are available in the Test Runs dashboard. For fully automated pass/fail decisions based on evaluator scores, check the dashboard after each run.

A/B Testing Configurations

Single-turn model comparison for cost and quality trade-offs

Simulation Overview

Deep dive into Netra’s simulation framework

Simulating Customer Support

The foundational simulation cookbook — start here if you’re new to simulation

Observability

Evaluation

Simulation

What You’ll Learn

Build a Regression Suite

Establish a Baseline

Compare Agent Versions

Make Deploy/No-Deploy Decisions

Why Regression Testing Matters for Agents

Step 1: Create a Comprehensive Dataset

Select All 8 Evaluators

Create Diverse Scenarios

Step 2: Run the Baseline (Version A)

Step 3: Make an Agent Change (Version B)

Step 4: Compare Results

Build a Comparison Table

Compare Exit Reasons

Compare Cost and Latency

Step 5: Make the Deploy Decision

Decision Framework

Investigating Regressions

Continuous Regression Workflow

Automating the Workflow

See Also

A/B Testing Configurations

Simulation Overview

Simulating Customer Support

Observability

Evaluation

Simulation

​What You’ll Learn

Build a Regression Suite

Establish a Baseline

Compare Agent Versions

Make Deploy/No-Deploy Decisions

​Why Regression Testing Matters for Agents

​Step 1: Create a Comprehensive Dataset

​Select All 8 Evaluators

​Create Diverse Scenarios

​Step 2: Run the Baseline (Version A)

​Step 3: Make an Agent Change (Version B)

​Step 4: Compare Results

​Build a Comparison Table

​Compare Exit Reasons

​Compare Cost and Latency

​Step 5: Make the Deploy Decision

​Decision Framework

​Investigating Regressions

​Continuous Regression Workflow

​Automating the Workflow

​See Also

A/B Testing Configurations

Simulation Overview

Simulating Customer Support

What You’ll Learn

Why Regression Testing Matters for Agents

Step 1: Create a Comprehensive Dataset

Select All 8 Evaluators

Create Diverse Scenarios

Step 2: Run the Baseline (Version A)

Step 3: Make an Agent Change (Version B)

Step 4: Compare Results

Build a Comparison Table

Compare Exit Reasons

Compare Cost and Latency

Step 5: Make the Deploy Decision

Decision Framework

Investigating Regressions

Continuous Regression Workflow

Automating the Workflow

See Also