Results and Analysis

Benchmark results provide valuable insights into your chatbot’s performance across individual test cases and over time. Blast’s results interface helps you understand evaluation outcomes, track performance trends, and identify specific areas for improvement.

Benchmarks Overview Page

The main benchmarks page provides a high-level view of all your benchmarks:

Dashboard Columns

Name: Benchmark title and identifier
Description: Purpose and scope of the benchmark
Number of Rows: Total test cases in the benchmark
Tags: Organizational tags for filtering and grouping
Last Updated: Most recent modification date
Created: Original creation date

Dashboard Actions

View Benchmark: Click to access individual benchmark results
Create New: Add new benchmarks to your collection
Filter/Search: Find specific benchmarks by name or tags

Individual Benchmark Results Page

Click into any benchmark to view detailed test results:

Test Results Table

Prompt: The specific user input being tested
Scores on Last Run: Pass/fail ratio (e.g., “4/5 pass”, “2/6 pass”)
Runs: Total number of times this test has been executed
Last Run: Timestamp of most recent execution
Created: When this test was added to the benchmark

Test Results Actions

Delete Test: Remove individual test cases from the benchmark
Run All: Execute all tests in the benchmark
Bulk Delete: Remove multiple selected test cases
Export Results: Download test results in CSV format

Individual Test Result Details

Accessing Detailed Results

Click on any test row to view comprehensive evaluation details:

Run History: Navigate between different execution runs (e.g., “Run 7 of 7”)
Timestamp: Exact execution time and date
Re-run Test: Execute the test again

Evaluation Results Breakdown

Overall Score Summary

Pass/Fail Count: Clear indication like “1 Failed (5 passed)”
Evaluator Status: Individual pass/fail for each of the 6 evaluators
Show/Hide Details: Toggle detailed critique visibility

Evaluators

Each test is assessed by the same evaluators used in simulations:

Language Detection: Validates consistency between input and output language
Product Relevance: Ensures product recommendations are relevant to user queries
Product Specs Contradiction: Detects inconsistencies with vendor data sheet
Response Style: Evaluates stylistic elements of the output (e.g that it is prose rather than numbered lists when it should be)
Search Term Relevance: Validates appropriateness of recommended product categories

Detailed Critiques

Each evaluator provides detailed feedback that includes a clear pass/fail status indicator along with comprehensive reasoning explaining the evaluation decision. This feedback helps you understand exactly why an evaluator passed or failed a particular test case.

Example Detailed Analysis

Input: "What types of pet-safe weed control products do you offer?"

Evaluation Results: 1 Failed (5 passed)

✗ Product Specs Contradiction: Failed
  "The assistant inaccurately describes the weed killers as 'natural,' 
   but the vendor data sheet indicates they are not categorized as 
   natural products."

✓ Product Relevance: Passed
  "Both products are explicitly labeled as pet-friendly weed killers 
   in the vendor data sheet, directly matching the user's request for 
   pet-safe weed control options."

✓ Response Style: Passed
  "The product (Just For Pets Pet-Friendly Weed Killer) is mentioned 
   in plain text with sizes, and all information is presented in 
   continuous prose without bullets, numbers, or markdown formatting."

Comparing Performance Over Time

Historical Performance Tracking

Monitor benchmark performance across multiple runs:

Run Comparison

Previous Runs: Access historical execution results
Score Trends: Track pass/fail ratios over time
Performance Changes: Identify improvements or regressions
Consistency Monitoring: Observe result stability

Next Steps

Create additional benchmarks based on analysis insights
Run targeted tests to validate specific improvements
Compare with simulation results for comprehensive understanding

Getting Started

Simulations

Benchmarks

Benchmark Results

Results and Analysis

Benchmark Dashboard Navigation

Benchmarks Overview Page

Dashboard Columns

Dashboard Actions

Individual Benchmark Results Page

Test Results Table

Test Results Actions

Individual Test Result Details

Accessing Detailed Results

Run Navigation

Evaluation Results Breakdown

Overall Score Summary

Evaluators

Detailed Critiques

Example Detailed Analysis

Comparing Performance Over Time

Historical Performance Tracking

Run Comparison

Next Steps

Getting Started

Simulations

Benchmarks

​Results and Analysis

​Benchmark Dashboard Navigation

​Benchmarks Overview Page

​Dashboard Columns

​Dashboard Actions

​Individual Benchmark Results Page

​Test Results Table

​Test Results Actions

​Individual Test Result Details

​Accessing Detailed Results

​Run Navigation

​Evaluation Results Breakdown

​Overall Score Summary

​Evaluators

​Detailed Critiques

​Example Detailed Analysis

​Comparing Performance Over Time

​Historical Performance Tracking

​Run Comparison

​Next Steps

Results and Analysis

Benchmark Dashboard Navigation

Benchmarks Overview Page

Dashboard Columns

Dashboard Actions

Individual Benchmark Results Page

Test Results Table

Test Results Actions

Individual Test Result Details

Accessing Detailed Results

Run Navigation

Evaluation Results Breakdown

Overall Score Summary

Evaluators

Detailed Critiques

Example Detailed Analysis

Comparing Performance Over Time

Historical Performance Tracking

Run Comparison

Next Steps