Use simulation as a quality gate to compare agent versions before and after changes
Agent changes are risky. A prompt tweak that improves refund conversations might break order status inquiries. A model upgrade that reduces cost might degrade conversational memory. Without a systematic way to compare agent versions across multi-turn scenarios, you’re deploying blind.This cookbook shows you how to use simulation as a regression testing workflow: establish a baseline dataset with all 8 evaluators, run simulations before and after agent changes, compare results across test runs, and decide whether to deploy.
Prerequisite: You should be familiar with Netra’s simulation workflow. If this is your first simulation cookbook, start with Simulating Customer Support Conversations first.
The A/B Testing Configurations cookbook covers single-turn model comparison. But conversational agents need multi-turn regression testing — a change that improves individual responses might break behavior across a full conversation:
Change Type
Single-Turn Impact
Multi-Turn Impact
Prompt update
Responses may improve in tone or accuracy
May break conversation memory or guideline adherence across turns
Model upgrade
Faster or cheaper responses
May change how the agent handles frustrated users or manages context
Tool addition
New capability available
Agent may over-use the new tool or forget existing conversation flow
Temperature change
More/less creative individual responses
May affect consistency across a multi-turn session
Simulation regression testing catches these cross-turn impacts before they reach users.
Go to Evaluation → Datasets, click Create Dataset, and select Multi-turn. Name it “Agent Regression Suite” and add 4-5 scenarios that cover different situations:
#
Scenario Goal
Persona
Max Turns
Key Features
1
Get a refund for a damaged product
Frustrated
5
Fact checkers for refund timeline and method
2
Understand the return policy for an online order
Confused
6
User data with purchase details
3
Check order status and request expedited shipping
Neutral
4
Goal requires two actions in one conversation
4
Report a billing discrepancy and get it resolved
Friendly
5
Fact checkers for billing correction process
5
Get technical help setting up a product
Custom
7
Custom “Non-Technical User” persona, no fact checkers
Attach all 8 evaluators to the dataset with the appropriate variable mappings.
This dataset is reusable. Once created, you run it against every agent version — the scenarios, personas, and evaluators stay the same. Only the agent changes.
Modify your agent — update the system prompt, switch models, or add new capabilities. Here’s an example that upgrades the model and refines the prompt:
Copy
conversations_v2: dict[str, list] = {}class AgentV2(BaseTask): """Updated agent — Version B (prompt refinement + model upgrade).""" def run(self, message: str, session_id: str | None = None) -> TaskResult: session = session_id or str(uuid.uuid4()) if session not in conversations_v2: conversations_v2[session] = [ { "role": "system", "content": ( "You are a customer support agent. Help customers with " "refunds, orders, billing, and technical setup.\n\n" "Guidelines:\n" "- Be professional, empathetic, and concise.\n" "- Always confirm the action taken before ending.\n" "- If the customer seems confused, simplify your language.\n" "- Reference earlier parts of the conversation when relevant.\n" "- Check whether the customer has additional questions." ), } ] conversations_v2[session].append({"role": "user", "content": message}) response = client.chat.completions.create( model="gpt-4o", # Upgraded from gpt-4o-mini messages=conversations_v2[session], ) content = response.choices[0].message.content conversations_v2[session].append({"role": "assistant", "content": content}) return TaskResult(message=content, session_id=session)# Run updated agent against the same datasetupdated = Netra.simulation.run_simulation( name="Regression — v1.1 Prompt + Model Upgrade", dataset_id="your-dataset-id", # Same dataset as baseline task=AgentV2(), context={"version": "v1.1", "model": "gpt-4o"}, max_concurrency=3,)print(f"Updated: {len(updated['completed'])} completed, {len(updated['failed'])} failed")Netra.shutdown()
The simulation SDK returns structured results you can parse programmatically:
Copy
result = Netra.simulation.run_simulation( name="CI — Pre-Deploy Check", dataset_id="your-dataset-id", task=AgentV2(), context={"version": "v1.1", "ci_run": "build-456"},)# Check for failuresif result["failed"]: print(f"FAILED: {len(result['failed'])} scenarios did not complete") for f in result["failed"]: print(f" - {f['run_item_id']}: {f['error']}") exit(1)print(f"PASSED: All {result['total_items']} scenarios completed")
The SDK result tells you whether scenarios completed or failed at the execution level. Evaluator scores are available in the Test Runs dashboard. For fully automated pass/fail decisions based on evaluator scores, check the dashboard after each run.