Test Runs for simulation show the execution results of your multi-turn datasets. Each run provides a complete conversation transcript between the simulated user and your agent, along with evaluation results, scenario details, and performance metrics.
Why Simulation Test Runs Matter
Simulation test runs provide deep insights into conversational agent performance:
| Capability | Benefit |
|---|
| Conversation Transcripts | See the full multi-turn dialogue to understand how your agent performed |
| Scenario Details | View goal, persona, user data, and fact checker configuration |
| Turn-by-Turn Tracing | Jump directly to execution traces for each conversation turn |
| Evaluation Results | Review turn-level and session-level evaluator scores |
| Exit Reason Tracking | Understand why conversations ended (goal achieved, failed, abandoned, max turns) |
| Aggregated Metrics | Monitor cost, latency, and success rates across simulations |
Test Runs Dashboard
Navigate to Evaluation → Test Runs from the left navigation panel. Filter by Multi turn type to see simulation test runs.
| Column | Description |
|---|
| Name | The agent that was tested in the simulation |
| Type | Multi-turn for simulation test runs |
| Started At | Timestamp when the simulation began |
| Status | Current state: Completed, In Progress, or Failed |
| Dataset | The dataset used for this simulation |
Filtering and Search
- Turn Type: Filter to show only Multi-turn (simulation) test runs
- Date Range: Filter runs by time period to compare performance over time
- Search: Find specific test runs by agent or dataset name
- Sort: Order by date, status, or dataset
Viewing Test Run Details
Click on any simulation test run to access detailed results.
Summary Metrics
The top of the detail view shows aggregated performance data:
| Metric | Description |
|---|
| Total Items | Number of scenarios run in this test |
| Passed Items | Count of scenarios that achieved their goals |
| Failed Items | Count of scenarios that did not achieve goals |
| Total Cost | Aggregate token/API cost for all scenarios |
| Total Duration | End-to-end time for the simulation run |
| Average Latency | Mean response time across all turns |
Viewing Scenario Details
Click on any test run item to view the detailed scenario results. This opens a modal with three tabs.
Tab 1: Conversation
The Conversation tab shows the full multi-turn dialogue between the simulated user and your agent.
Features:
- Turn-by-Turn Display: Each conversation turn is clearly separated
- User Messages: Shows what the simulated user said
- Agent Responses: Shows what your agent replied
- Trace Links: Click View Trace on any turn to see detailed execution traces
- Turn Index: Track which turn number you’re viewing (Turn 1, Turn 2, etc.)
- Exit Reason: Shows why the conversation ended
Exit Reasons:
| Exit Reason | Description |
|---|
| Goal Achieved | The scenario objective was successfully completed |
| Goal Failed | The objective could not be achieved |
| Abandoned | The simulated user gave up or stopped engaging |
| Max Turns Reached | Hit the turn limit before goal completion |
Conversation Flow:
Turn 1:
User: "Hi, I need help with a refund for order ORD-123456."
Agent: "I'd be happy to help you with that refund. Let me look up your order..."
[View Trace]
Turn 2:
User: "The product arrived damaged, and I'd like a full refund."
Agent: "I'm sorry to hear that. I've located your order for Wireless Headphones..."
[View Trace]
Turn 3:
...
Use the View Trace link to debug specific turns where the agent’s response
was unexpected or incorrect. Traces show the full LLM call, tool usage, and
latency breakdown.
Tab 2: Evaluation Results
The Evaluation Results tab shows scores from all configured evaluators.
Evaluation Results:
Overall conversation scores:
| Evaluator | Score | Pass/Fail |
|---|
| Goal Fulfillment | 1 | Pass |
| Factual Accuracy | 1 | Pass |
| Conversation Completeness | 1 | Pass |
| Profile Utilization | 0.75 | Pass |
| Guidline Adherence | 0.5 | Fail |
Tab 3: Scenario Details
The Scenario Details tab shows the complete configuration used for this simulation.
Scenario Section:
| Field | Value |
|---|
| Goal | The scenario objective (e.g., “Get a refund for a damaged product”) |
| Max Turns | Maximum turns allowed (e.g., 5) |
| User Persona | The persona used (e.g., Frustrated 😤) |
User Data Section:
Shows all context data provided to the simulated user:
{
"order id": "3",
"product name": "laptop stand"
}
Fact Checker Section:
Shows facts the agent needed to communicate:
{
"item usage": "unused",
"refund window": "7 days",
"days since delivery": "28"
}
Provider Configuration Section:
| Field | Value |
|---|
| Provider | The LLM provider used (e.g., openai) |
| Model | The model used for simulation (e.g., gpt-5) |
The Scenario Details tab is crucial for understanding the context of each
simulation. It shows exactly what data the simulated user had access to and
what facts the agent was expected to communicate.
Analyzing Simulation Results
Identifying Patterns
When reviewing simulation test runs, look for:
- Goal achievement rates: What percentage of simulations achieved their goals?
- Persona differences: Does your agent perform better with certain personas?
- Turn efficiency: Are conversations longer than necessary?
- Common failure points: Which turns typically cause issues?
- Fact accuracy: Are specific facts consistently missed?
Debugging Failed Simulations
For each failed scenario:
- Review the Conversation tab: Identify where the conversation went wrong
- Check the Evaluation Results tab: See which evaluators failed and why
- Examine the Scenario Details tab: Verify the user data and facts were correct
- Click View Trace: Inspect the full execution flow for problematic turns
- Review LLM inputs: Check what context and prompts were sent to the LLM
Comparing Across Runs
To track improvement or regression:
- Run simulations after each agent update (abilities, constraints, or prompt changes)
- Compare goal achievement rates across runs
- Investigate scenarios that changed from pass to fail
- Track turn efficiency and cost trends over time
Use Cases
Agent Capability Testing
Validate new agent abilities:
- Create scenarios that exercise the new capability
- Run simulations with multiple personas
- Verify goal achievement and fact accuracy
- Analyze conversation quality across different user types
Constraint Compliance Testing
Ensure your agent respects boundaries:
- Create scenarios that attempt to violate constraints
- Use turn-level constraint adherence evaluators
- Verify the agent refuses or escalates appropriately
- Check that violations are caught in every turn
Persona Optimization
Optimize performance across user types:
- Run the same scenarios with all personas (neutral, friendly, frustrated, confused)
- Compare goal achievement rates by persona
- Identify which personas cause the most issues
- Refine agent abilities to handle challenging personas
Cost and Efficiency Monitoring
Track simulation costs and turn counts:
- Monitor total cost trends across test runs
- Compare turn counts for successful vs failed scenarios
- Identify scenarios that consume excessive turns
- Optimize prompts to reduce turn count while maintaining quality
Best Practices
Regular Simulation Testing
- Test after every agent change: Run simulations when updating abilities or constraints
- Create baseline runs: Establish performance benchmarks before making changes
- Track metrics over time: Monitor goal achievement rates, costs, and turn counts
Scenario Coverage
- Test all personas: Run key scenarios with each persona type
- Include edge cases: Create scenarios that challenge your agent’s boundaries
- Vary complexity: Mix simple (2-3 turn) and complex (7-10 turn) scenarios
- Update scenarios: Add new scenarios as you discover failure patterns
Trace-Level Debugging
- Always check traces for failures: Don’t just look at the conversation—inspect the execution
- Verify LLM context: Ensure the agent received the right information
- Check tool calls: Verify the agent used tools correctly
- Review latency: Identify slow turns that might frustrate real users
- Simulation Overview - Understand the full simulation framework
- Datasets - Create scenarios that generate test runs
- Evaluators - Configure scoring logic for simulations
- Agents - Define agents to test
- Traces - Debug simulation turns with execution traces