Simulating Customer Support Conversations

A traced support agent tells you what happened — which tools were called, how long each step took, and what the LLM generated. Simulation tells you whether the agent actually resolved the customer’s issue. Without multi-turn testing, you can’t tell if the agent communicates the correct refund timeline, handles a frustrated customer gracefully, or gives up before achieving the goal. This cookbook walks you through Netra’s simulation workflow: selecting evaluators for customer support quality, creating multi-turn scenarios with different user personas, wrapping your agent in a BaseTask, running simulations, and interpreting conversation transcripts.

Prerequisite: You need a customer support agent integrated with Netra. If you haven’t set this up yet, follow the Tracing LangChain Agents cookbook first.

What You’ll Learn

Configure Simulation Evaluators

Select session-level evaluators that measure goal achievement, fact accuracy, and conversation quality

Create Multi-Turn Scenarios

Build scenarios with goals, user data, fact checkers, and different personas

Wrap Your Agent in BaseTask

Implement the BaseTask interface to connect your agent to the simulation engine

Compare Persona Performance

Run the same scenario with Neutral, Friendly, Frustrated, and Confused personas and compare results

Why Simulate Customer Support?

Customer support agents engage in goal-oriented, multi-turn conversations where subtle failures compound:

Failure Mode	What Goes Wrong	Why Single-Turn Evaluation Misses It
Incomplete resolution	Agent answers the question but never confirms the action was taken	Each individual response looks correct, but the goal is never achieved
Fact miscommunication	Agent states the wrong refund timeline or return policy	Only detectable when you define specific facts the agent must communicate
Persona sensitivity	Agent handles friendly users well but breaks down with frustrated customers	Single-turn tests don’t model emotional progression across a conversation
Premature closure	Agent ends the conversation before the customer’s issue is fully resolved	Only visible in multi-turn context with goal tracking

Simulation addresses this by creating realistic conversations with a simulated user who has a goal, a persona, and context data — then scoring the entire session.

Now, let’s walk through the process of simulating customer support conversations:

Step 1: Select Simulation Evaluators

Go to Evaluation → Evaluators, switch to the Library tab, and filter by Multi turn. Add the following four evaluators:

Evaluator	What It Measures
Goal Fulfillment	Did the conversation achieve the customer’s objective (e.g., process the refund)?
Factual Accuracy	Did the agent communicate the correct refund timeline, return policy, and other facts?
Conversation Completeness	Were all of the customer’s questions and intents addressed during the conversation?
Guideline Adherence	Did the agent follow its instructions throughout — staying professional, not making promises it shouldn’t?

See Simulation Evaluators for the full library and configuration options.

Step 2: Create a Multi-Turn Dataset

Go to Evaluation → Datasets and click Create Dataset. Select Multi-turn as the type.

Configure basics

Set the dataset name to “Customer Support Scenarios” and add tags like support, refunds. Select Multi-turn as the type and Add manually as the data source.

Configure the scenario

Define the first scenario:Scenario Goal:

The customer wants to get a refund for a product they purchased
15 days ago because it arrived damaged.

Behavior Instructions (optional):

Start politely, but become slightly impatient if the agent
asks for information already provided.

Max Turns: 5User Persona: FrustratedProvider and Model: Choose the LLM that will generate simulated user responses (e.g., OpenAI / GPT-4.1).

Add user data and facts

Simulated User Data — context the simulated user can reference:

Key	Value
order_number	ORD-12345
purchase_date	2024-01-15
product_name	Wireless Headphones
order_total	$79.99
issue	Arrived damaged — left earcup cracked

Fact Checker — facts the agent must communicate correctly:

Fact	Expected Value
refund_processing_time	5-7 business days
refund_method	Original payment method
return_label_delivery	Within 24 hours via email

Select evaluators

Add the four evaluators from Step 1 — Goal Fulfillment, Factual Accuracy, Conversation Completeness, and Guideline Adherence. Configure variable mappings to connect evaluator inputs to scenario fields, agent responses, and conversation metadata.

Configure evaluators

Select a provider and model for each evaluator (e.g., OpenAI / GPT-4.1). Optionally rename evaluators to match your use case (e.g., “Refund Goal Fulfillment”). Review and click Create Dataset.

See Simulation Datasets for the full dataset creation reference.

Step 3: Add More Scenarios

Add two more scenarios to the same dataset, each with a different persona and goal: Scenario 2 — Order Status Inquiry (Neutral persona)

Field	Value
Goal	The customer wants to know the current status of their order and the expected delivery date.
Max Turns	4
Persona	Neutral
User Data	`order_number`: ORD-67890, `product_name`: Standing Desk
Facts	`estimated_delivery`: March 15, `carrier`: FedEx, `tracking_available`: Yes

Scenario 3 — Return Policy Question (Confused persona)

Field	Value
Goal	The customer wants to understand the return policy for an item they bought three weeks ago. They are unsure whether they are still within the return window.
Max Turns	6
Persona	Confused
User Data	`order_number`: ORD-11111, `product_name`: Bluetooth Speaker, `purchase_date`: 2024-02-01
Facts	`return_window`: 30 days from purchase, `return_condition`: Item must be unused and in original packaging

Under Evaluation → Datasets, you should now see the “Customer Support Scenarios” dataset with three scenarios and four evaluators.

Step 4: Implement the BaseTask Wrapper

Wrap your customer support agent in a BaseTask so the simulation engine can call it turn by turn. The run() method receives the simulated user’s message and a session_id for conversation continuity.

from netra import Netra
from netra.simulation.task import BaseTask
from netra.simulation.models import TaskResult
from openai import OpenAI
import uuid

Netra.init(app_name="support-simulation")
client = OpenAI()

# Store conversation history per session
conversations: dict[str, list] = {}

class SupportAgentTask(BaseTask):
    """Wraps a customer support agent for simulation."""

    def run(self, message: str, session_id: str | None = None) -> TaskResult:
        session = session_id or str(uuid.uuid4())

        if session not in conversations:
            conversations[session] = [
                {
                    "role": "system",
                    "content": (
                        "You are a customer support agent for an e-commerce store. "
                        "Help customers with refunds, order status, and return policies. "
                        "Be professional and empathetic. Always confirm the action taken "
                        "before ending the conversation."
                    ),
                }
            ]

        conversations[session].append({"role": "user", "content": message})

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=conversations[session],
        )

        content = response.choices[0].message.content
        conversations[session].append({"role": "assistant", "content": content})

        return TaskResult(message=content, session_id=session)

If you built a LangChain agent in the Tracing LangChain Agents cookbook, you can wrap it the same way — call your agent’s .invoke() method inside run() and return the response as a TaskResult.

Step 5: Trigger the Simulation

Copy the Dataset ID from the dataset page and run the simulation.

result = Netra.simulation.run_simulation(
    name="Customer Support — All Personas",
    dataset_id="your-dataset-id",
    task=SupportAgentTask(),
    context={"model": "gpt-4o-mini", "agent_version": "v1"},
    max_concurrency=3,
)

print(f"Total scenarios: {result['total_items']}")
print(f"Completed: {len(result['completed'])}")
print(f"Failed: {len(result['failed'])}")

for failure in result["failed"]:
    print(f"  Failed {failure['run_item_id']}: {failure['error']}")

Netra.shutdown()

For the full API reference, see the SDK documentation for Python and TypeScript.

Step 6: Analyze Results

Go to Evaluation → Test Runs and filter by Multi turn to find your simulation run.

Summary Metrics

The top of the detail view shows aggregated data — total scenarios, pass/fail counts, total cost, and average latency. Use this for a quick health check before diving into individual scenarios.

Conversation Transcripts

Click on any scenario to open the detail view. The Conversation tab shows the full turn-by-turn dialogue between the simulated user and your agent. Look for:

Where the conversation stalled — did the agent ask for information the user already provided?
Fact accuracy — did the agent state the correct refund timeline?
Resolution confirmation — did the agent explicitly confirm the action before ending?

Exit Reasons

Each scenario ends with one of four exit reasons:

Exit Reason	What It Means
Goal Achieved	The customer’s objective was successfully completed
Goal Failed	The conversation ended without achieving the goal
Abandoned	The simulated user gave up or stopped engaging
Max Turns Reached	Hit the turn limit before goal completion

Evaluation Scores

The Evaluation Results tab shows scores for each evaluator. Compare scores across the three scenarios to spot persona-specific weaknesses:

Scenario	Persona	Goal Fulfillment	Factual Accuracy	Completeness	Guideline Adherence
Refund request	Frustrated	0.8	1.0	0.75	0.6
Order status	Neutral	1.0	1.0	1.0	1.0
Return policy	Confused	0.6	0.75	0.5	0.8

In this example, the Confused persona scenario scores lowest on Completeness — the agent may not be explaining things clearly enough for users who need extra clarification.

Debugging with Traces

Click View Trace on any conversation turn to inspect the full execution — LLM inputs, tool calls (if applicable), token usage, and latency. This connects simulation results directly to your observability traces.

Interpreting Scores and Improving Quality

When evaluator scores are low, use this table to identify the likely cause and fix:

Low Score In	Likely Cause	How to Fix
Goal Fulfillment	Agent answered questions but never completed the action (e.g., processed the refund)	Add explicit instructions in the system prompt to confirm resolution before ending
Factual Accuracy	Agent stated incorrect policy details or timelines	Include accurate policy data in the system prompt or connect to a knowledge base tool
Conversation Completeness	Agent addressed the primary question but missed follow-up intents	Improve the system prompt to check whether the customer has additional questions
Guideline Adherence	Agent deviated from tone or made unauthorized promises	Tighten the system prompt guidelines and add guardrails for what the agent should not promise

After making changes, re-run the simulation against the same dataset and compare results across test runs.

Continuous Simulation Strategy

For production support agents, run simulations regularly:

On every prompt change — Verify that updated instructions don’t break existing conversation patterns
After adding new tools — Ensure the agent correctly integrates new capabilities into conversations
After model upgrades — Compare conversation quality across model versions
Weekly regression runs — Catch gradual degradation in goal achievement or fact accuracy

Trace Your LangChain Agent

Set up comprehensive tracing for your support agent before simulating

Simulation Overview

Deep dive into Netra’s simulation framework: datasets, evaluators, and test runs

Regression Testing Deployments

Use simulation as a quality gate for agent deployments

Observability

Evaluation

Simulation

What You’ll Learn

Configure Simulation Evaluators

Create Multi-Turn Scenarios

Wrap Your Agent in BaseTask

Compare Persona Performance

Why Simulate Customer Support?

Step 1: Select Simulation Evaluators

Step 2: Create a Multi-Turn Dataset

Step 3: Add More Scenarios

Step 4: Implement the BaseTask Wrapper

Step 5: Trigger the Simulation

Step 6: Analyze Results

Summary Metrics

Conversation Transcripts

Exit Reasons

Evaluation Scores

Debugging with Traces

Interpreting Scores and Improving Quality

Continuous Simulation Strategy

See Also

Trace Your LangChain Agent

Simulation Overview

Regression Testing Deployments

Observability

Evaluation

Simulation

​What You’ll Learn

Configure Simulation Evaluators

Create Multi-Turn Scenarios

Wrap Your Agent in BaseTask

Compare Persona Performance

​Why Simulate Customer Support?

​Step 1: Select Simulation Evaluators

​Step 2: Create a Multi-Turn Dataset

​Step 3: Add More Scenarios

​Step 4: Implement the BaseTask Wrapper

​Step 5: Trigger the Simulation

​Step 6: Analyze Results

​Summary Metrics

​Conversation Transcripts

​Exit Reasons

​Evaluation Scores

​Debugging with Traces

​Interpreting Scores and Improving Quality

​Continuous Simulation Strategy

​See Also

Trace Your LangChain Agent

Simulation Overview

Regression Testing Deployments

What You’ll Learn

Why Simulate Customer Support?

Step 1: Select Simulation Evaluators

Step 2: Create a Multi-Turn Dataset

Step 3: Add More Scenarios

Step 4: Implement the BaseTask Wrapper

Step 5: Trigger the Simulation

Step 6: Analyze Results

Summary Metrics

Conversation Transcripts

Exit Reasons

Evaluation Scores

Debugging with Traces

Interpreting Scores and Improving Quality

Continuous Simulation Strategy

See Also