Test Runs

Test Runs for simulation show the execution results of your multi-turn datasets. Each run provides a complete conversation transcript between the simulated user and your agent, along with evaluation results, scenario details, and performance metrics.

Why Simulation Test Runs Matter

Simulation test runs provide deep insights into conversational agent performance:

Capability	Benefit
Conversation Transcripts	See the full multi-turn dialogue to understand how your agent performed
Scenario Details	View goal, persona, user data, and fact checker configuration
Turn-by-Turn Tracing	Jump directly to execution traces for each conversation turn
Evaluation Results	Review turn-level and session-level evaluator scores
Exit Reason Tracking	Understand why conversations ended (goal achieved, failed, abandoned, max turns)
Aggregated Metrics	Monitor cost, latency, and success rates across simulations

Test Runs Dashboard

Navigate to Evaluation → Test Runs from the left navigation panel. Filter by Multi turn type to see simulation test runs.

Column	Description
Name	The agent that was tested in the simulation
Type	Multi-turn for simulation test runs
Started At	Timestamp when the simulation began
Status	Current state: Completed, In Progress, or Failed
Dataset	The dataset used for this simulation

Filtering and Search

Turn Type: Filter to show only Multi-turn (simulation) test runs
Date Range: Filter runs by time period to compare performance over time
Search: Find specific test runs by agent or dataset name
Sort: Order by date, status, or dataset

Viewing Test Run Details

Click on any simulation test run to access detailed results.

Summary Metrics

The top of the detail view shows aggregated performance data:

Metric	Description
Total Items	Number of scenarios run in this test
Passed Items	Count of scenarios that achieved their goals
Failed Items	Count of scenarios that did not achieve goals
Total Cost	Aggregate token/API cost for all scenarios
Total Duration	End-to-end time for the simulation run
Average Latency	Mean response time across all turns

Viewing Scenario Details

Click on any test run item to view the detailed scenario results. This opens a modal with three tabs.

Tab 1: Conversation

The Conversation tab shows the full multi-turn dialogue between the simulated user and your agent.

Features:

Turn-by-Turn Display: Each conversation turn is clearly separated
User Messages: Shows what the simulated user said
Agent Responses: Shows what your agent replied
Trace Links: Click View Trace on any turn to see detailed execution traces
Turn Index: Track which turn number you’re viewing (Turn 1, Turn 2, etc.)
Exit Reason: Shows why the conversation ended

Exit Reasons:

Exit Reason	Description
Goal Achieved	The scenario objective was successfully completed
Goal Failed	The objective could not be achieved
Abandoned	The simulated user gave up or stopped engaging
Max Turns Reached	Hit the turn limit before goal completion

Conversation Flow:

Turn 1:
User: "Hi, I need help with a refund for order ORD-123456."
Agent: "I'd be happy to help you with that refund. Let me look up your order..."
[View Trace]

Turn 2:
User: "The product arrived damaged, and I'd like a full refund."
Agent: "I'm sorry to hear that. I've located your order for Wireless Headphones..."
[View Trace]

Turn 3:
...

Use the View Trace link to debug specific turns where the agent’s response was unexpected or incorrect. Traces show the full LLM call, tool usage, and latency breakdown.

Tab 2: Evaluation Results

The Evaluation Results tab shows scores from all configured evaluators.

Evaluation Results: Overall conversation scores:

Evaluator	Score	Pass/Fail
Goal Fulfillment	1	Pass
Factual Accuracy	1	Pass
Conversation Completeness	1	Pass
Profile Utilization	0.75	Pass
Guidline Adherence	0.5	Fail

Tab 3: Scenario Details

The Scenario Details tab shows the complete configuration used for this simulation.

Scenario Section:

Field	Value
Goal	The scenario objective (e.g., “Get a refund for a damaged product”)
Max Turns	Maximum turns allowed (e.g., 5)
User Persona	The persona used (e.g., Frustrated 😤)

User Data Section: Shows all context data provided to the simulated user:

{
  "order id": "3",
  "product name": "laptop stand"
}

Fact Checker Section: Shows facts the agent needed to communicate:

{
  "item usage": "unused",
  "refund window": "7 days",
  "days since delivery": "28"
}

Provider Configuration Section:

Field	Value
Provider	The LLM provider used (e.g., openai)
Model	The model used for simulation (e.g., gpt-5)

The Scenario Details tab is crucial for understanding the context of each simulation. It shows exactly what data the simulated user had access to and what facts the agent was expected to communicate.

Analyzing Simulation Results

Identifying Patterns

When reviewing simulation test runs, look for:

Goal achievement rates: What percentage of simulations achieved their goals?
Persona differences: Does your agent perform better with certain personas?
Turn efficiency: Are conversations longer than necessary?
Common failure points: Which turns typically cause issues?
Fact accuracy: Are specific facts consistently missed?

Debugging Failed Simulations

For each failed scenario:

Review the Conversation tab: Identify where the conversation went wrong
Check the Evaluation Results tab: See which evaluators failed and why
Examine the Scenario Details tab: Verify the user data and facts were correct
Click View Trace: Inspect the full execution flow for problematic turns
Review LLM inputs: Check what context and prompts were sent to the LLM

Comparing Across Runs

To track improvement or regression:

Run simulations after each agent update (abilities, constraints, or prompt changes)
Compare goal achievement rates across runs
Investigate scenarios that changed from pass to fail
Track turn efficiency and cost trends over time

Use Cases

Agent Capability Testing

Validate new agent abilities:

Create scenarios that exercise the new capability
Run simulations with multiple personas
Verify goal achievement and fact accuracy
Analyze conversation quality across different user types

Constraint Compliance Testing

Ensure your agent respects boundaries:

Create scenarios that attempt to violate constraints
Use turn-level constraint adherence evaluators
Verify the agent refuses or escalates appropriately
Check that violations are caught in every turn

Persona Optimization

Optimize performance across user types:

Run the same scenarios with all personas (neutral, friendly, frustrated, confused)
Compare goal achievement rates by persona
Identify which personas cause the most issues
Refine agent abilities to handle challenging personas

Cost and Efficiency Monitoring

Track simulation costs and turn counts:

Monitor total cost trends across test runs
Compare turn counts for successful vs failed scenarios
Identify scenarios that consume excessive turns
Optimize prompts to reduce turn count while maintaining quality

Best Practices

Regular Simulation Testing

Test after every agent change: Run simulations when updating abilities or constraints
Create baseline runs: Establish performance benchmarks before making changes
Track metrics over time: Monitor goal achievement rates, costs, and turn counts

Scenario Coverage

Test all personas: Run key scenarios with each persona type
Include edge cases: Create scenarios that challenge your agent’s boundaries
Vary complexity: Mix simple (2-3 turn) and complex (7-10 turn) scenarios
Update scenarios: Add new scenarios as you discover failure patterns

Trace-Level Debugging

Always check traces for failures: Don’t just look at the conversation—inspect the execution
Verify LLM context: Ensure the agent received the right information
Check tool calls: Verify the agent used tools correctly
Review latency: Identify slow turns that might frustrate real users

Simulation Overview - Understand the full simulation framework
Datasets - Create scenarios that generate test runs
Evaluators - Configure scoring logic for simulations
Agents - Define agents to test
Traces - Debug simulation turns with execution traces

Get Started

Observability

Evaluation

Simulation

Monitoring & Dashboard

Account settings

Why Simulation Test Runs Matter

Test Runs Dashboard

Filtering and Search

Viewing Test Run Details

Summary Metrics

Viewing Scenario Details

Tab 1: Conversation

Tab 2: Evaluation Results

Tab 3: Scenario Details

Analyzing Simulation Results

Identifying Patterns

Debugging Failed Simulations

Comparing Across Runs

Use Cases

Agent Capability Testing

Constraint Compliance Testing

Persona Optimization

Cost and Efficiency Monitoring

Best Practices

Regular Simulation Testing

Scenario Coverage

Trace-Level Debugging

Get Started

Observability

Evaluation

Simulation

Monitoring & Dashboard

Account settings

​Why Simulation Test Runs Matter

​Test Runs Dashboard

​Filtering and Search

​Viewing Test Run Details

​Summary Metrics

​Viewing Scenario Details

​Tab 1: Conversation

​Tab 2: Evaluation Results

​Tab 3: Scenario Details

​Analyzing Simulation Results

​Identifying Patterns

​Debugging Failed Simulations

​Comparing Across Runs

​Use Cases

​Agent Capability Testing

​Constraint Compliance Testing

​Persona Optimization

​Cost and Efficiency Monitoring

​Best Practices

​Regular Simulation Testing

​Scenario Coverage

​Trace-Level Debugging

​Related

Why Simulation Test Runs Matter

Test Runs Dashboard

Filtering and Search

Viewing Test Run Details

Summary Metrics

Viewing Scenario Details

Tab 1: Conversation

Tab 2: Evaluation Results

Tab 3: Scenario Details

Analyzing Simulation Results

Identifying Patterns

Debugging Failed Simulations

Comparing Across Runs

Use Cases

Agent Capability Testing

Constraint Compliance Testing

Persona Optimization

Cost and Efficiency Monitoring

Best Practices

Regular Simulation Testing

Scenario Coverage

Trace-Level Debugging

Related