> ## Documentation Index > Fetch the complete documentation index at: https://relevanceai.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Evals > Test and evaluate your AI Agents with scenario-based evaluations and reusable Checks **Rollout Status**: Evals is rolling out progressively, starting with Enterprise customers. If you don't see this feature in your account yet, reach out to your account manager to discuss access. The Evals section is your command center for testing and evaluating AI Agent performance. Located in the **Evaluate** tab (next to the Build and Use tabs) in the Agent builder, Evals lets you create test sets, define reusable Checks, run automated evaluations, and monitor live Agent quality — all without manual testing. Evals apply to both individual Agents and Workforces, including the sub-agents and tools inside a Workforce. Evaluate tab showing the Evals sidebar (Test, Runs, Checks, Publish, Monitor) and a Monitor dashboard with overall score, total runs, and Checks breakdown

Evaluate tab showing the Evals sidebar (Test, Runs, Checks, Publish, Monitor) and a Monitor dashboard with overall score, total runs, and Checks breakdown

## What you can do with Evals Build test sets with scenarios that simulate real user interactions, then attach Checks to score every conversation automatically. Define evaluation criteria once in the Checks tab and attach them to scenarios, Monitor dashboards, or ad-hoc evaluations of completed tasks. Create Monitor dashboards that score live Agent tasks against your Checks, with sample-rate controls and per-Check trend charts over time. *** ## Evals sections The Evals area has five sections, shown in the left sidebar of the Evaluate tab: * **Test** — Create and manage test sets. Each test set holds scenarios that simulate users; running a scenario produces a conversation with your Agent that gets scored by attached Checks. * **Runs** — Past evaluation run results. Browse average scores, tasks evaluated, progress status, cost (Credits and Actions), and creation date for every run. * **Checks** — The reusable set of evaluation criteria. Create a Check once, then attach it to scenarios, to Monitor dashboards, or to one-off evaluations of completed tasks. * **Publish** — Choose which test sets must pass before your Agent can be published. Set a minimum pass rate and optionally block publishing on failure. * **Monitor** — Track live Agent quality on real tasks. Create one or more Monitor dashboards, attach Checks, set a sample rate, and watch scores trend over time. *** ## Understanding Checks Checks are the reusable evaluation criteria that score Agent conversations. You create a Check once in the **Checks** tab and then attach it wherever you need it: * **To a scenario** in a test set — the Check runs every time that scenario is evaluated. * **To a Monitor dashboard** — the Check runs on a sampled portion of live Agent tasks. * **To a one-off evaluation** of already-completed tasks selected from the Agent's task list. The Checks tab has filters that show where each Check is currently used — **All checks**, **Scenarios**, **Dashboard**, and **Unused** — so you can quickly find Checks that aren't attached anywhere yet. ### Check types When creating a Check, you choose one of the following types: Uses an LLM to evaluate conversations against a prompt you define. | Field | Description | | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Evaluation Prompt** | Describe the criteria for passing | | **Judge model** | Select which model evaluates the conversation | | **Truncate long conversations** | When enabled, conversations that exceed the judge model's context window are trimmed from the oldest messages first, and the eval runs on the remaining portion. When disabled, oversized conversations fail with an error instead. Note that trimming removes early context, which can affect score accuracy if your evaluation criteria depend on the beginning of the conversation. | Checks whether the Agent's response includes specific text. | Field | Description | | ----------------- | ----------------------------------------- | | **Required text** | The text that must appear in the response | Checks whether the Agent's response exactly matches an expected value. | Field | Description | | ------------------ | -------------------------------------------- | | **Expected value** | The exact message the Agent should have sent | Checks whether a specific tool was used during the conversation. | Field | Description | | -------------- | ---------------------------------------------------------------- | | **Tool** | Select the tool to check for | | **Position** | Whether the tool was used anywhere, used first, or used last | | **Comparison** | Check if the tool was used at least, exactly, or at most X times | When evaluating a Workforce, you can scope a Tool Usage Check to a specific node — a sub-agent or tool in the Workforce — so you can assert that a particular sub-agent used a given tool. To create a Check from the Checks tab: