Running Benchmark Tests

Once you’ve created benchmarks and added test cases, you can execute them to validate your chatbot’s performance. Blast provides straightforward execution options for both individual test cases and entire benchmarks, with immediate result feedback.

Individual Test Execution

Running Single Tests

Execute individual test cases for focused validation:

Navigate to Benchmark: Go to the relevant benchmark
Select Test: Find the test case you want to run
Run Test: Click on the row and hit the run test button
View Results: Review the immediate evaluation results

Running Entire Benchmarks

Benchmark Execution

Execute all test cases in a benchmark:

Select Benchmark: Choose the benchmark from your dashboard
Run All: Click the “Run All” button to execute all tests
Monitor Progress: Watch real-time execution progress
Review Results: Analyze overall benchmark performance

Execution Process

Parallel Processing: Tests run efficiently in parallel, while avoiding excessive load on the AI endpoint being tested.
Real-time Updates: See results as individual tests complete
Progress Tracking: Monitor overall completion status
Automatic Evaluation: All tests evaluated using the same 6 evaluators

Understanding Test Results

Evaluation Scoring

Each test is evaluated by the following evaluators:

Available Evaluators

Language Detection: Detects and validates language switching in conversations
Product Relevance: Evaluates if recommended products match user requirements
Product Specs Contradiction: Detects contradictions in product specifications
Response Style: Evaluates tone, clarity, and style of responses
Search Term Relevance: Validates that search terms match user intent

Score Display

Results show pass/fail ratios similar to simulations:

Format: “4/5 pass” means 4 out of 5 evaluators passed
Individual Results: Each evaluator’s pass/fail status
Overall Status: Test passes if majority of evaluators pass

Detailed Results Analysis

Individual Test Details

Click on any test result to see comprehensive feedback:

Response Information

Input: The exact prompt that was tested
Output: Complete chatbot response
Run Information: Timestamp and run number

Evaluation Breakdown

Pass/Fail Status: Clear indication for each evaluator
Detailed Critiques: Explanatory text for each evaluation decision
Failure Reasons: Specific explanations when evaluators fail
Success Validation: Confirmation of what worked well

Example Detailed Result

Input: "What categories of pet-safe weed control products do you offer?"
Output: [Full chatbot response with product recommendations]

Evaluation Results: 1 Failed (5 passed)

✗ Product Specs Contradiction: Failed
  "The assistant inaccurately describes the weed killers as 'natural,' 
   but the vendor data sheet indicates they are not categorized as natural products."

✓ Product Relevance: Passed
  "Both products are explicitly labeled as pet-friendly weed killers in the vendor 
   data sheet, directly matching the user's request."

✓ Response Style: Passed

Result Filtering

Use the benchmark page filters to focus on specific results:

Show All: Display all test cases with their latest results
Show Failed Only: Filter to display only tests that failed evaluation

Test List: Scroll through all tests in the benchmark
Result History: View previous runs for each test
Export Options: Download results for external analysis
Refresh: Update results after making chatbot changes

Next Steps

Analyze detailed results to understand performance trends
Create additional benchmarks based on findings
Compare results over time to track improvements

Getting Started

Simulations

Benchmarks

Running Tests

Running Benchmark Tests

Individual Test Execution

Running Single Tests

Running Entire Benchmarks

Benchmark Execution

Execution Process

Understanding Test Results

Evaluation Scoring

Available Evaluators

Score Display

Detailed Results Analysis

Individual Test Details

Response Information

Evaluation Breakdown

Example Detailed Result

Filtering and Navigation

Result Filtering

Navigation Features

Next Steps

Getting Started

Simulations

Benchmarks

​Running Benchmark Tests

​Individual Test Execution

​Running Single Tests

​Running Entire Benchmarks

​Benchmark Execution

​Execution Process

​Understanding Test Results

​Evaluation Scoring

​Available Evaluators

​Score Display

​Detailed Results Analysis

​Individual Test Details

​Response Information

​Evaluation Breakdown

​Example Detailed Result

​Filtering and Navigation

​Result Filtering

​Navigation Features

​Next Steps

Running Benchmark Tests

Individual Test Execution

Running Single Tests

Running Entire Benchmarks

Benchmark Execution

Execution Process

Understanding Test Results

Evaluation Scoring

Available Evaluators

Score Display

Detailed Results Analysis

Individual Test Details

Response Information

Evaluation Breakdown

Example Detailed Result

Filtering and Navigation

Result Filtering

Navigation Features

Next Steps