Evaluate
Evaluation is how you understand and improve the quality of your AI outputs. Scout provides three tools that work together to create a continuous improvement loop.
Feedback
Feedback captures human signals about your workflow outputs. When you deploy a workflow, your end-users or internal team can rate outputs with thumbs up/down and leave comments explaining what went wrong.
This is your source of truth for real-world quality. Positive feedback tells you what’s working. Negative feedback surfaces issues you might not have anticipated, and the comments explain why.
Scorers
Scorers define what “good” looks like. They’re the criteria used to evaluate outputs, whether by humans or automated systems.
Scout provides built-in scorers for common checks like exact matching, substring containment, regex patterns, and type validation. You can also create custom scorers, including LLM-based scoring for subjective quality assessment.
Scorers are used in two places: with feedback (the thumbs scorer for human ratings) and with evaluations (automated scoring of test cases).
Evaluations
Evaluations let you run automated test suites against your workflows. You define test cases with inputs and expected outputs, attach scorers to measure quality, and run them to get pass rates and metrics.
Use evaluations during development as a TDD-style workflow: define what success looks like, iterate until your workflow passes, then deploy with confidence. In production, evaluations become your regression tests that run before each deployment.
How They Work Together
These tools form a continuous improvement loop:
Over time, your evaluation suite grows to cover more edge cases, and feedback keeps surfacing new ones. This flywheel drives continuous quality improvement.