Skip to main content
A traced support agent tells you what happened — which tools were called, how long each step took, and what the LLM generated. Simulation tells you whether the agent actually resolved the customer’s issue. Without multi-turn testing, you can’t tell if the agent communicates the correct refund timeline, handles a frustrated customer gracefully, or gives up before achieving the goal. This cookbook walks you through Netra’s simulation workflow: selecting evaluators for customer support quality, creating multi-turn scenarios with different user personas, wrapping your agent in a BaseTask, running simulations, and interpreting conversation transcripts.
Prerequisite: You need a customer support agent integrated with Netra. If you haven’t set this up yet, follow the Tracing LangChain Agents cookbook first.

What You’ll Learn

Configure Simulation Evaluators

Select session-level evaluators that measure goal achievement, fact accuracy, and conversation quality

Create Multi-Turn Scenarios

Build scenarios with goals, user data, fact checkers, and different personas

Wrap Your Agent in BaseTask

Implement the BaseTask interface to connect your agent to the simulation engine

Compare Persona Performance

Run the same scenario with Neutral, Friendly, Frustrated, and Confused personas and compare results

Why Simulate Customer Support?

Customer support agents engage in goal-oriented, multi-turn conversations where subtle failures compound:
Failure ModeWhat Goes WrongWhy Single-Turn Evaluation Misses It
Incomplete resolutionAgent answers the question but never confirms the action was takenEach individual response looks correct, but the goal is never achieved
Fact miscommunicationAgent states the wrong refund timeline or return policyOnly detectable when you define specific facts the agent must communicate
Persona sensitivityAgent handles friendly users well but breaks down with frustrated customersSingle-turn tests don’t model emotional progression across a conversation
Premature closureAgent ends the conversation before the customer’s issue is fully resolvedOnly visible in multi-turn context with goal tracking
Simulation addresses this by creating realistic conversations with a simulated user who has a goal, a persona, and context data — then scoring the entire session.
Now, let’s walk through the process of simulating customer support conversations:

Step 1: Select Simulation Evaluators

Go to Evaluation → Evaluators, switch to the Library tab, and filter by Multi turn. Add the following four evaluators:
EvaluatorWhat It Measures
Goal FulfillmentDid the conversation achieve the customer’s objective (e.g., process the refund)?
Factual AccuracyDid the agent communicate the correct refund timeline, return policy, and other facts?
Conversation CompletenessWere all of the customer’s questions and intents addressed during the conversation?
Guideline AdherenceDid the agent follow its instructions throughout — staying professional, not making promises it shouldn’t?
See Simulation Evaluators for the full library and configuration options.

Step 2: Create a Multi-Turn Dataset

Go to Evaluation → Datasets and click Create Dataset. Select Multi-turn as the type.
1

Configure basics

Set the dataset name to “Customer Support Scenarios” and add tags like support, refunds. Select Multi-turn as the type and Add manually as the data source.
2

Configure the scenario

Define the first scenario:Scenario Goal:
The customer wants to get a refund for a product they purchased
15 days ago because it arrived damaged.
Behavior Instructions (optional):
Start politely, but become slightly impatient if the agent
asks for information already provided.
Max Turns: 5User Persona: FrustratedProvider and Model: Choose the LLM that will generate simulated user responses (e.g., OpenAI / GPT-4.1).
3

Add user data and facts

Simulated User Data — context the simulated user can reference:
KeyValue
order_numberORD-12345
purchase_date2024-01-15
product_nameWireless Headphones
order_total$79.99
issueArrived damaged — left earcup cracked
Fact Checker — facts the agent must communicate correctly:
FactExpected Value
refund_processing_time5-7 business days
refund_methodOriginal payment method
return_label_deliveryWithin 24 hours via email
4

Select evaluators

Add the four evaluators from Step 1 — Goal Fulfillment, Factual Accuracy, Conversation Completeness, and Guideline Adherence. Configure variable mappings to connect evaluator inputs to scenario fields, agent responses, and conversation metadata.
5

Configure evaluators

Select a provider and model for each evaluator (e.g., OpenAI / GPT-4.1). Optionally rename evaluators to match your use case (e.g., “Refund Goal Fulfillment”). Review and click Create Dataset.
See Simulation Datasets for the full dataset creation reference.

Step 3: Add More Scenarios

Add two more scenarios to the same dataset, each with a different persona and goal: Scenario 2 — Order Status Inquiry (Neutral persona)
FieldValue
GoalThe customer wants to know the current status of their order and the expected delivery date.
Max Turns4
PersonaNeutral
User Dataorder_number: ORD-67890, product_name: Standing Desk
Factsestimated_delivery: March 15, carrier: FedEx, tracking_available: Yes
Scenario 3 — Return Policy Question (Confused persona)
FieldValue
GoalThe customer wants to understand the return policy for an item they bought three weeks ago. They are unsure whether they are still within the return window.
Max Turns6
PersonaConfused
User Dataorder_number: ORD-11111, product_name: Bluetooth Speaker, purchase_date: 2024-02-01
Factsreturn_window: 30 days from purchase, return_condition: Item must be unused and in original packaging
Under Evaluation → Datasets, you should now see the “Customer Support Scenarios” dataset with three scenarios and four evaluators.

Step 4: Implement the BaseTask Wrapper

Wrap your customer support agent in a BaseTask so the simulation engine can call it turn by turn. The run() method receives the simulated user’s message and a session_id for conversation continuity.
from netra import Netra
from netra.simulation.task import BaseTask
from netra.simulation.models import TaskResult
from openai import OpenAI
import uuid

Netra.init(app_name="support-simulation")
client = OpenAI()

# Store conversation history per session
conversations: dict[str, list] = {}

class SupportAgentTask(BaseTask):
    """Wraps a customer support agent for simulation."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations:
            conversations[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent for an e-commerce store. "
                        "Help customers with refunds, order status, and return policies. "
                        "Be professional and empathetic. Always confirm the action taken "
                        "before ending the conversation."
                    ),
                }
            ]

        conversations[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=conversations[session],
        )

        content = response.choices[0].message.content
        conversations[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)
If you built a LangChain agent in the Tracing LangChain Agents cookbook, you can wrap it the same way — call your agent’s .invoke() method inside run() and return the response as a TaskResult.

Step 5: Trigger the Simulation

Copy the Dataset ID from the dataset page and run the simulation.
result = Netra.simulation.run_simulation(
    name="Customer Support — All Personas",
    dataset_id="your-dataset-id",
    task=SupportAgentTask(),
    context={"model": "gpt-4o-mini", "agent_version": "v1"},
    max_concurrency=3,
)

print(f"Total scenarios: {result['total_items']}")
print(f"Completed: {len(result['completed'])}")
print(f"Failed: {len(result['failed'])}")

for failure in result["failed"]:
    print(f"  Failed {failure['run_item_id']}: {failure['error']}")

Netra.shutdown()
For the full API reference, see the SDK documentation for Python and TypeScript.

Step 6: Analyze Results

Go to Evaluation → Test Runs and filter by Multi turn to find your simulation run.

Summary Metrics

The top of the detail view shows aggregated data — total scenarios, pass/fail counts, total cost, and average latency. Use this for a quick health check before diving into individual scenarios.

Conversation Transcripts

Click on any scenario to open the detail view. The Conversation tab shows the full turn-by-turn dialogue between the simulated user and your agent. Look for:
  • Where the conversation stalled — did the agent ask for information the user already provided?
  • Fact accuracy — did the agent state the correct refund timeline?
  • Resolution confirmation — did the agent explicitly confirm the action before ending?

Exit Reasons

Each scenario ends with one of four exit reasons:
Exit ReasonWhat It Means
Goal AchievedThe customer’s objective was successfully completed
Goal FailedThe conversation ended without achieving the goal
AbandonedThe simulated user gave up or stopped engaging
Max Turns ReachedHit the turn limit before goal completion

Evaluation Scores

The Evaluation Results tab shows scores for each evaluator. Compare scores across the three scenarios to spot persona-specific weaknesses:
ScenarioPersonaGoal FulfillmentFactual AccuracyCompletenessGuideline Adherence
Refund requestFrustrated0.81.00.750.6
Order statusNeutral1.01.01.01.0
Return policyConfused0.60.750.50.8
In this example, the Confused persona scenario scores lowest on Completeness — the agent may not be explaining things clearly enough for users who need extra clarification.

Debugging with Traces

Click View Trace on any conversation turn to inspect the full execution — LLM inputs, tool calls (if applicable), token usage, and latency. This connects simulation results directly to your observability traces.

Interpreting Scores and Improving Quality

When evaluator scores are low, use this table to identify the likely cause and fix:
Low Score InLikely CauseHow to Fix
Goal FulfillmentAgent answered questions but never completed the action (e.g., processed the refund)Add explicit instructions in the system prompt to confirm resolution before ending
Factual AccuracyAgent stated incorrect policy details or timelinesInclude accurate policy data in the system prompt or connect to a knowledge base tool
Conversation CompletenessAgent addressed the primary question but missed follow-up intentsImprove the system prompt to check whether the customer has additional questions
Guideline AdherenceAgent deviated from tone or made unauthorized promisesTighten the system prompt guidelines and add guardrails for what the agent should not promise
After making changes, re-run the simulation against the same dataset and compare results across test runs.

Continuous Simulation Strategy

For production support agents, run simulations regularly:
  1. On every prompt change — Verify that updated instructions don’t break existing conversation patterns
  2. After adding new tools — Ensure the agent correctly integrates new capabilities into conversations
  3. After model upgrades — Compare conversation quality across model versions
  4. Weekly regression runs — Catch gradual degradation in goal achievement or fact accuracy

See Also

Last modified on February 24, 2026