Evaluate | Scout | Documentation

Evaluation is how you understand and improve the quality of your AI outputs. Scout provides three tools that work together to create a continuous improvement loop.

Feedback

Feedback captures human signals about your workflow outputs. When you deploy a workflow, your end-users or internal team can rate outputs with thumbs up/down and leave comments explaining what went wrong.

This is your source of truth for real-world quality. Positive feedback tells you what’s working. Negative feedback surfaces issues you might not have anticipated, and the comments explain why.

Scorers

Scorers define what “good” looks like. They’re the criteria used to evaluate outputs, whether by humans or automated systems.

Scout provides built-in scorers for common checks like exact matching, substring containment, regex patterns, and type validation. You can also create custom scorers, including LLM-based scoring for subjective quality assessment.

Scorers are used in two places: with feedback (the thumbs scorer for human ratings) and with evaluations (automated scoring of test cases).

Evaluations

Evaluations let you run automated test suites against your workflows. You define test cases with inputs and expected outputs, attach scorers to measure quality, and run them to get pass rates and metrics.

Use evaluations during development as a TDD-style workflow: define what success looks like, iterate until your workflow passes, then deploy with confidence. In production, evaluations become your regression tests that run before each deployment.

How They Work Together

These tools form a continuous improvement loop:

Deploy

Deploy your workflow and collect feedback from real users.

Identify patterns

Analyze negative feedback - what’s failing and why?

Create test cases

Turn failures into test cases and add them to your evaluations.

Define scorers

Create scorers that catch those failure modes.

Iterate

Refine your workflow until evaluations pass.

Deploy improvements

Ship the fix and continue collecting feedback.

Over time, your evaluation suite grows to cover more edge cases, and feedback keeps surfacing new ones. This flywheel drives continuous quality improvement.