Skip to main content
Test Runs are the execution results of your Datasets through the evaluation pipeline. Each run provides a point-in-time snapshot of your AI system’s performance—showing exactly which test cases passed, which failed, and why. Use them to track quality over time, catch regressions, and debug issues.

Why Test Runs Matter

Test Runs transform raw evaluation data into actionable insights:
CapabilityBenefit
Historical TrackingCompare results across releases to detect regressions
Deep DiagnosticsSee expected vs. actual output for every test case
Trace IntegrationJump directly to execution traces to debug failures
Aggregated MetricsMonitor cost, latency, and pass rates at a glance

Test Runs Dashboard

Navigate to Evaluation → Test Runs from the left navigation panel. Test Runs dashboard showing list of evaluations
ColumnDescription
Agent NameThe agent or application that was evaluated
DatasetThe dataset used for this evaluation
StatusCurrent state: Completed, In Progress, or Failed
Started AtTimestamp when the evaluation began
  • Date Range: Filter runs by time period to compare performance over time
  • Search: Find specific test runs by agent or dataset name
  • Sort: Order by date, status, or dataset to find what you need quickly

Viewing Test Run Details

Click on any test run to access detailed results and diagnostics.

Summary Metrics

The top of the detail view shows aggregated performance data:
MetricDescription
Total CostAggregate token/API cost for all test cases
Total DurationEnd-to-end time for the evaluation run
Average LatencyMean response time across test cases
Pass/Fail RatePercentage and count of passing vs. failing cases

Per-Test-Case Results

Each test case displays:
FieldDescription
InputThe prompt or query sent to your AI system
Expected OutputThe reference answer defined in your dataset
Task OutputThe actual response generated by your AI
StatusPass or Fail based on evaluator criteria
Evaluator ScoresIndividual scores from each configured evaluator
View TraceLink to the full execution trace for debugging
Click View Trace on any failed test case to see the complete execution timeline, including LLM calls, tool invocations, and latency breakdowns.

Managing Datasets from Test Runs

Test Run details provide direct access to the underlying dataset configuration.

Items Tab

View and manage test cases in the dataset:
FieldDescription
Input/OutputThe test case prompt and expected response
MetadataAdditional context attached to the item
SourceWhere the test case originated (manual, trace, import)
TagsLabels for filtering and organization
Created AtWhen the test case was added

Evaluators Tab

View and modify evaluators attached to the dataset:
  • See all active evaluators and their configurations
  • Edit variable mappings
  • Adjust pass/fail thresholds

Adding to Existing Datasets

Enhance your datasets directly from the Test Run view:

Add New Test Cases

1

Click Add Item

Opens the test case creation form.
2

Provide Test Data

  • Enter the input prompt
  • Define the expected output
  • Add optional metadata and tags
3

Save

The new item is added to the dataset and included in future runs.

Add New Evaluators

1

Click Add Evaluator

Opens the evaluator selection modal.
2

Select or Create

Choose from the Library, My Evaluators, or create a new one.
3

Configure Mappings

Map evaluator variables to dataset fields, agent responses, or trace data.
4

Save

The evaluator is added and will score all test cases in future runs.

Analyzing Results

Identifying Patterns

When reviewing test runs, look for:
  • Consistent failures: Same test cases failing across multiple runs may indicate a systematic issue
  • New failures: Test cases that previously passed but now fail signal a regression
  • Score trends: Declining evaluator scores over time suggest gradual quality degradation

Debugging Failures

For each failed test case:
  1. Compare Expected Output vs Task Output to understand the discrepancy
  2. Check Evaluator Scores to see which criteria failed
  3. Click View Trace to inspect the full execution flow
  4. Review LLM inputs, tool calls, and intermediate steps in the trace view

Comparing Across Runs

To track regression or improvement:
  1. Run evaluations after each significant change (model update, prompt revision, code release)
  2. Compare pass rates and evaluator scores across runs
  3. Investigate any test cases that changed from pass to fail

Use Cases

CI/CD Integration

Run evaluations as part of your deployment pipeline:
  1. Trigger evaluation when code is pushed
  2. Block deployment if pass rate drops below threshold
  3. Review failed cases before merging

Model Comparison

Evaluate different models objectively:
  1. Run the same dataset with different model configurations
  2. Compare test runs side-by-side
  3. Make data-driven decisions about which model to deploy

Prompt Iteration

Measure the impact of prompt changes:
  1. Create a baseline test run with your current prompt
  2. Update your prompt and run again
  3. Compare results to validate improvement
Last modified on January 28, 2026