Evaluations
Automated testing for your AI workflows
Evaluations let you systematically test your workflows by running them against predefined test cases and measuring the results with scorers.
Evaluations is currently in beta. Contact us to get it enabled for your account.
What is an Evaluation?
An evaluation is a reusable test suite that:
This enables you to catch regressions before they reach production and measure improvements as you iterate on your prompts and logic.
Key Concepts
Test Cases
Each test case defines:
- Input: The data to send to your workflow
- Expected Output: What the workflow should produce (optional, depending on scorer)
- Scorers: Which scorers to run against the output
Evaluation Runs
When you run an evaluation, Scout executes each test case and records:
- Pass/fail status based on scorer thresholds
- Individual scorer results and composite scores
- Latency metrics (average, p50, p95)
Metrics
After a run completes, you get aggregate metrics:
Example Use Cases
Regression Testing
Run evaluations before deploying prompt changes to ensure quality doesn’t degrade.
Prompt Iteration
Compare evaluation results across different prompt versions to measure improvements.