Evaluations | Scout | Documentation

Evaluations let you systematically test your workflows by running them against predefined test cases and measuring the results with scorers.

Evaluations is currently in beta. Contact us to get it enabled for your account.

What is an Evaluation?

An evaluation is a reusable test suite that:

Execute your workflow against a set of test cases (input/expected output pairs).

Use scorers to evaluate each output against your quality criteria.

Record metrics like pass rate, latency, and composite scores.

This enables you to catch regressions before they reach production and measure improvements as you iterate on your prompts and logic.

Each test case defines:

Input: The data to send to your workflow
Expected Output: What the workflow should produce (optional, depending on scorer)
Scorers: Which scorers to run against the output

When you run an evaluation, Scout executes each test case and records:

After a run completes, you get aggregate metrics:

Metric	Description
Pass Rate	Percentage of test cases that passed all scorers
Avg Composite Score	Weighted average of all scorer results
Latency	Average and percentile response times

Regression Testing

Run evaluations before deploying prompt changes to ensure quality doesn’t degrade.

Prompt Iteration

Compare evaluation results across different prompt versions to measure improvements.

Learn about the built-in scorers you can use in your evaluations