Results and Analysis
Benchmark results provide valuable insights into your chatbot’s performance across individual test cases and over time. Blast’s results interface helps you understand evaluation outcomes, track performance trends, and identify specific areas for improvement.Benchmark Dashboard Navigation
Benchmarks Overview Page
The main benchmarks page provides a high-level view of all your benchmarks:Dashboard Columns
- Name: Benchmark title and identifier
- Description: Purpose and scope of the benchmark
- Number of Rows: Total test cases in the benchmark
- Tags: Organizational tags for filtering and grouping
- Last Updated: Most recent modification date
- Created: Original creation date
Dashboard Actions
- View Benchmark: Click to access individual benchmark results
- Create New: Add new benchmarks to your collection
- Filter/Search: Find specific benchmarks by name or tags
Individual Benchmark Results Page
Click into any benchmark to view detailed test results:Test Results Table
- Prompt: The specific user input being tested
- Scores on Last Run: Pass/fail ratio (e.g., “4/5 pass”, “2/6 pass”)
- Runs: Total number of times this test has been executed
- Last Run: Timestamp of most recent execution
- Created: When this test was added to the benchmark
Test Results Actions
- Delete Test: Remove individual test cases from the benchmark
- Run All: Execute all tests in the benchmark
- Bulk Delete: Remove multiple selected test cases
- Export Results: Download test results in CSV format
Individual Test Result Details
Accessing Detailed Results
Click on any test row to view comprehensive evaluation details:Run Navigation
- Run History: Navigate between different execution runs (e.g., “Run 7 of 7”)
- Timestamp: Exact execution time and date
- Re-run Test: Execute the test again
Evaluation Results Breakdown
Overall Score Summary
- Pass/Fail Count: Clear indication like “1 Failed (5 passed)”
- Evaluator Status: Individual pass/fail for each of the 6 evaluators
- Show/Hide Details: Toggle detailed critique visibility
Evaluators
Each test is assessed by the same evaluators used in simulations:- Language Detection: Validates consistency between input and output language
- Product Relevance: Ensures product recommendations are relevant to user queries
- Product Specs Contradiction: Detects inconsistencies with vendor data sheet
- Response Style: Evaluates stylistic elements of the output (e.g that it is prose rather than numbered lists when it should be)
- Search Term Relevance: Validates appropriateness of recommended product categories
Detailed Critiques
Each evaluator provides detailed feedback that includes a clear pass/fail status indicator along with comprehensive reasoning explaining the evaluation decision. This feedback helps you understand exactly why an evaluator passed or failed a particular test case.Example Detailed Analysis
Comparing Performance Over Time
Historical Performance Tracking
Monitor benchmark performance across multiple runs:Run Comparison
- Previous Runs: Access historical execution results
- Score Trends: Track pass/fail ratios over time
- Performance Changes: Identify improvements or regressions
- Consistency Monitoring: Observe result stability
Next Steps
- Create additional benchmarks based on analysis insights
- Run targeted tests to validate specific improvements
- Compare with simulation results for comprehensive understanding