BaseTask, running simulations, and interpreting conversation transcripts.
Prerequisite: You need a customer support agent integrated with Netra. If you haven’t set this up yet, follow the Tracing LangChain Agents cookbook first.
What You’ll Learn
Configure Simulation Evaluators
Select session-level evaluators that measure goal achievement, fact accuracy, and conversation quality
Create Multi-Turn Scenarios
Build scenarios with goals, user data, fact checkers, and different personas
Wrap Your Agent in BaseTask
Implement the BaseTask interface to connect your agent to the simulation engine
Compare Persona Performance
Run the same scenario with Neutral, Friendly, Frustrated, and Confused personas and compare results
Why Simulate Customer Support?
Customer support agents engage in goal-oriented, multi-turn conversations where subtle failures compound:| Failure Mode | What Goes Wrong | Why Single-Turn Evaluation Misses It |
|---|---|---|
| Incomplete resolution | Agent answers the question but never confirms the action was taken | Each individual response looks correct, but the goal is never achieved |
| Fact miscommunication | Agent states the wrong refund timeline or return policy | Only detectable when you define specific facts the agent must communicate |
| Persona sensitivity | Agent handles friendly users well but breaks down with frustrated customers | Single-turn tests don’t model emotional progression across a conversation |
| Premature closure | Agent ends the conversation before the customer’s issue is fully resolved | Only visible in multi-turn context with goal tracking |
Now, let’s walk through the process of simulating customer support conversations:
Step 1: Select Simulation Evaluators
Go to Evaluation → Evaluators, switch to the Library tab, and filter by Multi turn. Add the following four evaluators:| Evaluator | What It Measures |
|---|---|
| Goal Fulfillment | Did the conversation achieve the customer’s objective (e.g., process the refund)? |
| Factual Accuracy | Did the agent communicate the correct refund timeline, return policy, and other facts? |
| Conversation Completeness | Were all of the customer’s questions and intents addressed during the conversation? |
| Guideline Adherence | Did the agent follow its instructions throughout — staying professional, not making promises it shouldn’t? |
Step 2: Create a Multi-Turn Dataset
Go to Evaluation → Datasets and click Create Dataset. Select Multi-turn as the type.Configure basics
Set the dataset name to “Customer Support Scenarios” and add tags like
support, refunds. Select Multi-turn as the type and Add manually as the data source.Configure the scenario
Define the first scenario:Scenario Goal:Behavior Instructions (optional):Max Turns: 5User Persona: FrustratedProvider and Model: Choose the LLM that will generate simulated user responses (e.g., OpenAI / GPT-4.1).
Add user data and facts
Simulated User Data — context the simulated user can reference:
Fact Checker — facts the agent must communicate correctly:
| Key | Value |
|---|---|
| order_number | ORD-12345 |
| purchase_date | 2024-01-15 |
| product_name | Wireless Headphones |
| order_total | $79.99 |
| issue | Arrived damaged — left earcup cracked |
| Fact | Expected Value |
|---|---|
| refund_processing_time | 5-7 business days |
| refund_method | Original payment method |
| return_label_delivery | Within 24 hours via email |
Select evaluators
Add the four evaluators from Step 1 — Goal Fulfillment, Factual Accuracy, Conversation Completeness, and Guideline Adherence. Configure variable mappings to connect evaluator inputs to scenario fields, agent responses, and conversation metadata.
Step 3: Add More Scenarios
Add two more scenarios to the same dataset, each with a different persona and goal: Scenario 2 — Order Status Inquiry (Neutral persona)| Field | Value |
|---|---|
| Goal | The customer wants to know the current status of their order and the expected delivery date. |
| Max Turns | 4 |
| Persona | Neutral |
| User Data | order_number: ORD-67890, product_name: Standing Desk |
| Facts | estimated_delivery: March 15, carrier: FedEx, tracking_available: Yes |
| Field | Value |
|---|---|
| Goal | The customer wants to understand the return policy for an item they bought three weeks ago. They are unsure whether they are still within the return window. |
| Max Turns | 6 |
| Persona | Confused |
| User Data | order_number: ORD-11111, product_name: Bluetooth Speaker, purchase_date: 2024-02-01 |
| Facts | return_window: 30 days from purchase, return_condition: Item must be unused and in original packaging |
Step 4: Implement the BaseTask Wrapper
Wrap your customer support agent in aBaseTask so the simulation engine can call it turn by turn. The run() method receives the simulated user’s message and a session_id for conversation continuity.
Step 5: Trigger the Simulation
Copy the Dataset ID from the dataset page and run the simulation.Step 6: Analyze Results
Go to Evaluation → Test Runs and filter by Multi turn to find your simulation run.Summary Metrics
The top of the detail view shows aggregated data — total scenarios, pass/fail counts, total cost, and average latency. Use this for a quick health check before diving into individual scenarios.Conversation Transcripts
Click on any scenario to open the detail view. The Conversation tab shows the full turn-by-turn dialogue between the simulated user and your agent. Look for:- Where the conversation stalled — did the agent ask for information the user already provided?
- Fact accuracy — did the agent state the correct refund timeline?
- Resolution confirmation — did the agent explicitly confirm the action before ending?
Exit Reasons
Each scenario ends with one of four exit reasons:| Exit Reason | What It Means |
|---|---|
| Goal Achieved | The customer’s objective was successfully completed |
| Goal Failed | The conversation ended without achieving the goal |
| Abandoned | The simulated user gave up or stopped engaging |
| Max Turns Reached | Hit the turn limit before goal completion |
Evaluation Scores
The Evaluation Results tab shows scores for each evaluator. Compare scores across the three scenarios to spot persona-specific weaknesses:| Scenario | Persona | Goal Fulfillment | Factual Accuracy | Completeness | Guideline Adherence |
|---|---|---|---|---|---|
| Refund request | Frustrated | 0.8 | 1.0 | 0.75 | 0.6 |
| Order status | Neutral | 1.0 | 1.0 | 1.0 | 1.0 |
| Return policy | Confused | 0.6 | 0.75 | 0.5 | 0.8 |
Debugging with Traces
Click View Trace on any conversation turn to inspect the full execution — LLM inputs, tool calls (if applicable), token usage, and latency. This connects simulation results directly to your observability traces.Interpreting Scores and Improving Quality
When evaluator scores are low, use this table to identify the likely cause and fix:| Low Score In | Likely Cause | How to Fix |
|---|---|---|
| Goal Fulfillment | Agent answered questions but never completed the action (e.g., processed the refund) | Add explicit instructions in the system prompt to confirm resolution before ending |
| Factual Accuracy | Agent stated incorrect policy details or timelines | Include accurate policy data in the system prompt or connect to a knowledge base tool |
| Conversation Completeness | Agent addressed the primary question but missed follow-up intents | Improve the system prompt to check whether the customer has additional questions |
| Guideline Adherence | Agent deviated from tone or made unauthorized promises | Tighten the system prompt guidelines and add guardrails for what the agent should not promise |
Continuous Simulation Strategy
For production support agents, run simulations regularly:- On every prompt change — Verify that updated instructions don’t break existing conversation patterns
- After adding new tools — Ensure the agent correctly integrates new capabilities into conversations
- After model upgrades — Compare conversation quality across model versions
- Weekly regression runs — Catch gradual degradation in goal achievement or fact accuracy