Skip to main content

Results and Analysis

Benchmark results provide valuable insights into your chatbot’s performance across individual test cases and over time. Blast’s results interface helps you understand evaluation outcomes, track performance trends, and identify specific areas for improvement.

Benchmark Dashboard Navigation

Benchmarks Overview Page

The main benchmarks page provides a high-level view of all your benchmarks:

Dashboard Columns

  • Name: Benchmark title and identifier
  • Description: Purpose and scope of the benchmark
  • Number of Rows: Total test cases in the benchmark
  • Tags: Organizational tags for filtering and grouping
  • Last Updated: Most recent modification date
  • Created: Original creation date

Dashboard Actions

  • View Benchmark: Click to access individual benchmark results
  • Create New: Add new benchmarks to your collection
  • Filter/Search: Find specific benchmarks by name or tags

Individual Benchmark Results Page

Click into any benchmark to view detailed test results:

Test Results Table

  • Prompt: The specific user input being tested
  • Scores on Last Run: Pass/fail ratio (e.g., “4/5 pass”, “2/6 pass”)
  • Runs: Total number of times this test has been executed
  • Last Run: Timestamp of most recent execution
  • Created: When this test was added to the benchmark

Test Results Actions

  • Delete Test: Remove individual test cases from the benchmark
  • Run All: Execute all tests in the benchmark
  • Bulk Delete: Remove multiple selected test cases
  • Export Results: Download test results in CSV format

Individual Test Result Details

Accessing Detailed Results

Click on any test row to view comprehensive evaluation details:

Run Navigation

  • Run History: Navigate between different execution runs (e.g., “Run 7 of 7”)
  • Timestamp: Exact execution time and date
  • Re-run Test: Execute the test again

Evaluation Results Breakdown

Overall Score Summary

  • Pass/Fail Count: Clear indication like “1 Failed (5 passed)”
  • Evaluator Status: Individual pass/fail for each of the 6 evaluators
  • Show/Hide Details: Toggle detailed critique visibility

Evaluators

Each test is assessed by the same evaluators used in simulations:
  1. Language Detection: Validates consistency between input and output language
  2. Product Relevance: Ensures product recommendations are relevant to user queries
  3. Product Specs Contradiction: Detects inconsistencies with vendor data sheet
  4. Response Style: Evaluates stylistic elements of the output (e.g that it is prose rather than numbered lists when it should be)
  5. Search Term Relevance: Validates appropriateness of recommended product categories

Detailed Critiques

Each evaluator provides detailed feedback that includes a clear pass/fail status indicator along with comprehensive reasoning explaining the evaluation decision. This feedback helps you understand exactly why an evaluator passed or failed a particular test case.

Example Detailed Analysis

Input: "What types of pet-safe weed control products do you offer?"

Evaluation Results: 1 Failed (5 passed)

✗ Product Specs Contradiction: Failed
  "The assistant inaccurately describes the weed killers as 'natural,' 
   but the vendor data sheet indicates they are not categorized as 
   natural products."

✓ Product Relevance: Passed
  "Both products are explicitly labeled as pet-friendly weed killers 
   in the vendor data sheet, directly matching the user's request for 
   pet-safe weed control options."

✓ Response Style: Passed
  "The product (Just For Pets Pet-Friendly Weed Killer) is mentioned 
   in plain text with sizes, and all information is presented in 
   continuous prose without bullets, numbers, or markdown formatting."

Comparing Performance Over Time

Historical Performance Tracking

Monitor benchmark performance across multiple runs:

Run Comparison

  • Previous Runs: Access historical execution results
  • Score Trends: Track pass/fail ratios over time
  • Performance Changes: Identify improvements or regressions
  • Consistency Monitoring: Observe result stability

Next Steps