Evaluations

Automated testing for your AI workflows

Evaluations let you systematically test your workflows by running them against predefined test cases and measuring the results with scorers.

Evaluations is currently in beta. Contact us to get it enabled for your account.

What is an Evaluation?

An evaluation is a reusable test suite that:

1

Run test cases

Execute your workflow against a set of test cases (input/expected output pairs).

2

Score outputs

Use scorers to evaluate each output against your quality criteria.

3

Track metrics

Record metrics like pass rate, latency, and composite scores.

This enables you to catch regressions before they reach production and measure improvements as you iterate on your prompts and logic.

Key Concepts

Test Cases

Each test case defines:

  • Input: The data to send to your workflow
  • Expected Output: What the workflow should produce (optional, depending on scorer)
  • Scorers: Which scorers to run against the output

Evaluation Runs

When you run an evaluation, Scout executes each test case and records:

  • Pass/fail status based on scorer thresholds
  • Individual scorer results and composite scores
  • Latency metrics (average, p50, p95)

Metrics

After a run completes, you get aggregate metrics:

MetricDescription
Pass RatePercentage of test cases that passed all scorers
Avg Composite ScoreWeighted average of all scorer results
LatencyAverage and percentile response times

Example Use Cases

Regression Testing

Run evaluations before deploying prompt changes to ensure quality doesn’t degrade.

Prompt Iteration

Compare evaluation results across different prompt versions to measure improvements.

Next Steps