Running Benchmark Tests
Once you’ve created benchmarks and added test cases, you can execute them to validate your chatbot’s performance. Blast provides straightforward execution options for both individual test cases and entire benchmarks, with immediate result feedback.Individual Test Execution
Running Single Tests
Execute individual test cases for focused validation:- Navigate to Benchmark: Go to the relevant benchmark
- Select Test: Find the test case you want to run
- Run Test: Click on the row and hit the run test button
- View Results: Review the immediate evaluation results
Running Entire Benchmarks
Benchmark Execution
Execute all test cases in a benchmark:- Select Benchmark: Choose the benchmark from your dashboard
- Run All: Click the “Run All” button to execute all tests
- Monitor Progress: Watch real-time execution progress
- Review Results: Analyze overall benchmark performance
Execution Process
- Parallel Processing: Tests run efficiently in parallel, while avoiding excessive load on the AI endpoint being tested.
- Real-time Updates: See results as individual tests complete
- Progress Tracking: Monitor overall completion status
- Automatic Evaluation: All tests evaluated using the same 6 evaluators
Understanding Test Results
Evaluation Scoring
Each test is evaluated by the following evaluators:Available Evaluators
- Language Detection: Detects and validates language switching in conversations
- Product Relevance: Evaluates if recommended products match user requirements
- Product Specs Contradiction: Detects contradictions in product specifications
- Response Style: Evaluates tone, clarity, and style of responses
- Search Term Relevance: Validates that search terms match user intent
Score Display
Results show pass/fail ratios similar to simulations:- Format: “4/5 pass” means 4 out of 5 evaluators passed
- Individual Results: Each evaluator’s pass/fail status
- Overall Status: Test passes if majority of evaluators pass
Detailed Results Analysis
Individual Test Details
Click on any test result to see comprehensive feedback:Response Information
- Input: The exact prompt that was tested
- Output: Complete chatbot response
- Run Information: Timestamp and run number
Evaluation Breakdown
- Pass/Fail Status: Clear indication for each evaluator
- Detailed Critiques: Explanatory text for each evaluation decision
- Failure Reasons: Specific explanations when evaluators fail
- Success Validation: Confirmation of what worked well
Example Detailed Result
Filtering and Navigation
Result Filtering
Use the benchmark page filters to focus on specific results:- Show All: Display all test cases with their latest results
- Show Failed Only: Filter to display only tests that failed evaluation
Navigation Features
- Test List: Scroll through all tests in the benchmark
- Result History: View previous runs for each test
- Export Options: Download results for external analysis
- Refresh: Update results after making chatbot changes
Next Steps
- Analyze detailed results to understand performance trends
- Create additional benchmarks based on findings
- Compare results over time to track improvements