Skip to main content

Running Benchmark Tests

Once you’ve created benchmarks and added test cases, you can execute them to validate your chatbot’s performance. Blast provides straightforward execution options for both individual test cases and entire benchmarks, with immediate result feedback.

Individual Test Execution

Running Single Tests

Execute individual test cases for focused validation:
  1. Navigate to Benchmark: Go to the relevant benchmark
  2. Select Test: Find the test case you want to run
  3. Run Test: Click on the row and hit the run test button
  4. View Results: Review the immediate evaluation results

Running Entire Benchmarks

Benchmark Execution

Execute all test cases in a benchmark:
  1. Select Benchmark: Choose the benchmark from your dashboard
  2. Run All: Click the “Run All” button to execute all tests
  3. Monitor Progress: Watch real-time execution progress
  4. Review Results: Analyze overall benchmark performance

Execution Process

  • Parallel Processing: Tests run efficiently in parallel, while avoiding excessive load on the AI endpoint being tested.
  • Real-time Updates: See results as individual tests complete
  • Progress Tracking: Monitor overall completion status
  • Automatic Evaluation: All tests evaluated using the same 6 evaluators

Understanding Test Results

Evaluation Scoring

Each test is evaluated by the following evaluators:

Available Evaluators

  • Language Detection: Detects and validates language switching in conversations
  • Product Relevance: Evaluates if recommended products match user requirements
  • Product Specs Contradiction: Detects contradictions in product specifications
  • Response Style: Evaluates tone, clarity, and style of responses
  • Search Term Relevance: Validates that search terms match user intent

Score Display

Results show pass/fail ratios similar to simulations:
  • Format: “4/5 pass” means 4 out of 5 evaluators passed
  • Individual Results: Each evaluator’s pass/fail status
  • Overall Status: Test passes if majority of evaluators pass

Detailed Results Analysis

Individual Test Details

Click on any test result to see comprehensive feedback:

Response Information

  • Input: The exact prompt that was tested
  • Output: Complete chatbot response
  • Run Information: Timestamp and run number

Evaluation Breakdown

  • Pass/Fail Status: Clear indication for each evaluator
  • Detailed Critiques: Explanatory text for each evaluation decision
  • Failure Reasons: Specific explanations when evaluators fail
  • Success Validation: Confirmation of what worked well

Example Detailed Result

Input: "What categories of pet-safe weed control products do you offer?"
Output: [Full chatbot response with product recommendations]

Evaluation Results: 1 Failed (5 passed)

✗ Product Specs Contradiction: Failed
  "The assistant inaccurately describes the weed killers as 'natural,' 
   but the vendor data sheet indicates they are not categorized as natural products."

✓ Product Relevance: Passed
  "Both products are explicitly labeled as pet-friendly weed killers in the vendor 
   data sheet, directly matching the user's request."

✓ Response Style: Passed

Filtering and Navigation

Result Filtering

Use the benchmark page filters to focus on specific results:
  • Show All: Display all test cases with their latest results
  • Show Failed Only: Filter to display only tests that failed evaluation
  • Test List: Scroll through all tests in the benchmark
  • Result History: View previous runs for each test
  • Export Options: Download results for external analysis
  • Refresh: Update results after making chatbot changes

Next Steps