> ## Documentation Index
> Fetch the complete documentation index at: https://relevanceai.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evals

> Test and evaluate your AI Agents with scenario-based evaluations and automated Evaluators

<Info>
  **Rollout Status**: Evals is currently being rolled out progressively, starting with Enterprise customers. If you're an Enterprise customer and don't see this feature in your account yet, reach out to your account manager to discuss access.
</Info>

The Evals section is your command center for testing and evaluating AI Agent performance. Located in the **Monitor** tab (next to the Run tab) in the Agent builder, Evals enables you to create Test Suites, define evaluation criteria (Evaluators), run automated evaluations, and monitor ongoing performance—all without manual testing.

<img src="https://mintcdn.com/relevanceai/smi3kUH7crY_f5Ld/images/agent/agent-evals.png?fit=max&auto=format&n=smi3kUH7crY_f5Ld&q=85&s=771efc45c07b38c37a37dc5121bb2fe5" alt="Evals section showing Test Suites, Evaluators, Runs, and Performance" width="3696" height="2248" data-path="images/agent/agent-evals.png" />

## What you can do with Evals

<CardGroup cols={3}>
  <Card title="Conduct Tests" icon="flask-vial">
    Create Test Suites with scenarios that simulate real user interactions. Combine scenarios with Evaluators to measure accuracy and evaluate Agent performance automatically.
  </Card>

  <Card title="Create Evaluators" icon="scale-balanced">
    Define evaluation criteria that automatically assess Agent responses. Evaluators look for specific conditions and score conversations based on your defined rules.
  </Card>

  <Card title="Monitor Performance" icon="chart-line">
    Automatically evaluate live Agent conversations using global Evaluators. Track scores, view insights, and monitor quality over time without running manual tests.
  </Card>
</CardGroup>

***

## Evals sections

The Evals section contains five main sections, accessible from the left sidebar:

* **Test Suites** — Create and manage groups of Test scenarios for your Agent. Each Test Suite can contain multiple scenarios with different prompts and evaluation criteria.
* **Evaluators** — Configure global evaluation criteria that can be applied across any Test Suite or scenario without needing to set them up each time.
* **Runs** — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, credit spend, and creation dates for all past runs.
* **Publish Checks** — Configure which Test Suites must pass before your Agent can be published. Set a pass threshold and optionally block publishing if evaluations fail.
* **Performance** — Automatically evaluate live Agent conversations by selecting a global Evaluator, setting a sample rate, and filtering by conversation status.

***

## Understanding Evaluators

Evaluators are evaluation criteria that automatically assess Agent conversations. There are two types of Evaluators:

### Scenario Evaluators

Scenario Evaluators are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario's prompt.

* Created inside a Test scenario
* Only apply to the scenario they're defined in
* Scenario-specific evaluation criteria

### Global Evaluators

Global Evaluators are configured in the **Evaluators** tab. They can be selected to run on any Test Suite or scenario without needing to configure them each time — think of them as reusable defaults.

* Created in the **Evaluators** tab (separate from Test Suites)
* Can be selected when running any Test Suite or individual scenario
* Useful for standard criteria you want checked across scenarios, such as professional tone, no hallucinations, or brand voice compliance
* Also used in the **Performance** tab to automatically evaluate live conversations

### Evaluator types

When creating an Evaluator (either scenario-level or global), you choose from the following types:

<AccordionGroup>
  <Accordion title="LLM Judge" icon="brain-circuit">
    Uses an LLM to evaluate conversations against a prompt you define.

    | Field                           | Description                                                                                                                                                                                                                                                                                                                                                                            |
    | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | **Evaluation Prompt**           | Describe the criteria for passing                                                                                                                                                                                                                                                                                                                                                      |
    | **Judge model**                 | Select which model evaluates the conversation                                                                                                                                                                                                                                                                                                                                          |
    | **Truncate long conversations** | When enabled, conversations that exceed the judge model's context window are trimmed from the oldest messages first, and the eval runs on the remaining portion. When disabled, oversized conversations fail with an error instead. Note that trimming removes early context, which can affect score accuracy if your evaluation criteria depend on the beginning of the conversation. |
  </Accordion>

  <Accordion title="String Contains" icon="text">
    Checks whether the Agent's response includes specific text.

    | Field             | Description                               |
    | ----------------- | ----------------------------------------- |
    | **Required text** | The text that must appear in the response |
  </Accordion>

  <Accordion title="String Equals" icon="equals">
    Checks whether the Agent's response exactly matches an expected value.

    | Field              | Description                                  |
    | ------------------ | -------------------------------------------- |
    | **Expected value** | The exact message the Agent should have sent |
  </Accordion>

  <Accordion title="Tool Usage" icon="screwdriver-wrench">
    Checks whether a specific tool was used during the conversation.

    | Field          | Description                                                      |
    | -------------- | ---------------------------------------------------------------- |
    | **Tool**       | Select the tool to check for                                     |
    | **Position**   | Whether the tool was used anywhere, used first, or used last     |
    | **Comparison** | Check if the tool was used at least, exactly, or at most X times |
  </Accordion>
</AccordionGroup>

To create a global Evaluator:

<div style={{ width: '100%', position: 'relative', paddingTop: '56.25%' }}>
  <iframe src="https://app.supademo.com/embed/cmmmtwq7z1lsj9cvj5kwwifwi" frameBorder="0" title="Creating a global Evaluator" allow="clipboard-write; fullscreen" webkitAllowFullscreen="true" mozAllowFullscreen="true" allowFullscreen style={{ position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', border: '3px solid #5E43CE', borderRadius: '10px' }} />
</div>

1. Go to the **Monitor** tab and select **Evals**, then select **Evaluators**
2. Click **+ New Evaluator**
3. Select a **Type** (LLM Judge, String Contains, String Equals, or Tool Usage)
4. Enter a **Name** for the Evaluator (e.g., "Professional Tone")
5. Configure the type-specific settings (see table above)
6. Click **Create Evaluator**

<Note>
  When you run a Test scenario, scenario-level Evaluators are always included automatically. You can also add or remove global Evaluators (from the Evaluators tab) before each run, allowing you to mix standard criteria with scenario-specific evaluation rules.
</Note>

***

## Creating a Test Suite with a scenario

<div style={{ width: '100%', position: 'relative', paddingTop: '56.25%' }}>
  <iframe src="https://app.supademo.com/embed/cmmmvldns1nlq9cvjzy4nkpe0" frameBorder="0" title="Creating a Test Suite" allow="clipboard-write; fullscreen" webkitAllowFullscreen="true" mozAllowFullscreen="true" allowFullscreen style={{ position: 'absolute', top: 0, left: 0, width: '100%', height: '100%', border: '3px solid #5E43CE', borderRadius: '10px' }} />
</div>

Follow these steps to create your first evaluation Test Suite:

1. Open your Agent in the builder and click the **Monitor** tab (next to the Run tab). Select **Evals** from the left sidebar, then select **Test Suites**.

2. Click the **+ New test suite** button. Enter a name for your Test Suite and click **Create**.

3. Click on the Test Suite you just created to open it.

4. Click the **+ New Test** button to create a scenario within your Test Suite.

5. Fill in the scenario details:

   | Field                   | Description                                                                              | Example                                                                   |
   | ----------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
   | **Scenario name**       | A descriptive name for this Test case                                                    | "Response Empathy"                                                        |
   | **Persona & situation** | The persona or situation the simulated user will adopt                                   | "You are an impatient customer who wants quick answers about their bill." |
   | **First message**       | A fixed message the simulated user sends to your Agent as the opening message (optional) | "Hi, I need help with my bill."                                           |
   | **Max turns**           | Maximum conversation turns (1-50)                                                        | 10                                                                        |
   | **Number of runs**      | How many times this scenario should be executed                                          | 3                                                                         |

6. Add Evaluators to define how this specific scenario should be evaluated:

   | Field                    | Description                                                                 | Example                                                                                                                    |
   | ------------------------ | --------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
   | **Type**                 | The Evaluator type                                                          | LLM Judge                                                                                                                  |
   | **Name**                 | Name of the evaluation criterion                                            | "Empathy Shown"                                                                                                            |
   | **Type-specific config** | Settings based on the chosen type (see [Evaluator types](#evaluator-types)) | *Evaluation Prompt*: "Did the Agent acknowledge the customer's frustration and express empathy before offering solutions?" |

   Click **Create Evaluator** to save it. You can then create additional Evaluators to add more evaluation criteria to the scenario.

7. (Optional) Add **Tool simulations** to emulate tool usage without actually calling the tools. Tool simulations are configured at the scenario level:
   * Select a tool to simulate
   * Provide a prompt describing what the tool should return (a fake response is generated based on your prompt)
   * In the **Advanced** dropdown, you can select a **Simulation model** to control which model generates the simulated response

8. Click **Save Test scenario** to save your configuration.

<Tip>
  You can add multiple scenarios to a single Test Suite to evaluate different aspects of your Agent's behavior. Each scenario can have its own prompt, max turns, number of runs, Evaluators, and tool simulations.
</Tip>

### Managing test scenarios

Test scenarios can be reorganized across Test Suites as your testing strategy evolves. Each scenario has a dropdown menu (accessible via the three-dot icon next to the scenario name) that provides three operations:

| Operation     | What it does                                              | When to use it                                         |
| ------------- | --------------------------------------------------------- | ------------------------------------------------------ |
| **Move**      | Relocates the scenario to another Test Suite              | Reorganizing suites or consolidating related scenarios |
| **Copy**      | Creates a duplicate of the scenario in another Test Suite | Reusing a scenario as a baseline in a different suite  |
| **Duplicate** | Creates a copy of the scenario in the same Test Suite     | Quickly creating a variation of an existing scenario   |

### Example scenarios

Here are some example Test scenarios you might create:

<AccordionGroup>
  <Accordion title="Customer Support - Empathy test">
    **Scenario name**: Response Empathy

    **Persona & situation**: You are a long-time customer who was recently charged twice for the same order. You've already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help. Express your concerns clearly and see if the Agent acknowledges your situation before jumping to solutions.

    **Max turns**: 10

    **Evaluator**: Empathy Shown (LLM Judge)

    * *Evaluation Prompt*: Did the Agent acknowledge the customer's frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.
  </Accordion>

  <Accordion title="Sales - Product knowledge test">
    **Scenario name**: Product Expertise

    **Persona & situation**: You are a procurement manager at a mid-sized company evaluating solutions for your team. You need specific details about enterprise pricing tiers, integration capabilities with existing tools like Salesforce and HubSpot, and data security certifications. Ask clarifying questions and compare features against competitors you're also considering.

    **Max turns**: 15

    **Evaluator**: Accurate Information (LLM Judge)

    * *Evaluation Prompt*: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.
  </Accordion>

  <Accordion title="Support - Escalation handling">
    **Scenario name**: Escalation Request

    **Persona & situation**: You are a paying customer who has experienced a service outage affecting your business operations. You've already troubleshooted with the knowledge base articles and need to speak with a senior support engineer or account manager. Be firm but professional in your request, and provide context about the business impact.

    **Max turns**: 5

    **Evaluator**: Appropriate Escalation (LLM Judge)

    * *Evaluation Prompt*: Did the Agent acknowledge the severity of the situation, validate the customer's need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?
  </Accordion>
</AccordionGroup>

***

## Running evaluations

You can run an entire Test Suite or an individual Test scenario from within a Test Suite by clicking the **Run** button on either.

You can select specific Test scenarios within a Test Suite to run certain ones at once, or run all scenarios in the Test Suite together. Note that you cannot bulk select and run multiple Test Suites at the same time.

1. Enter a name for the evaluation run (e.g., "Scenario Run - Jan 14, 12:14 PM"). A default name with timestamp is provided.
2. Select which global Evaluators to include in the run — you can add or remove global Evaluators before starting. Scenario-level Evaluators are always included automatically.
3. Click **Run** to begin. The system will simulate conversations with your Agent based on your scenario prompts and evaluate them with your selected Evaluators.

***

## Understanding results

After running an evaluation, you'll see a detailed results screen:

### Run summary

The top of the results page shows key metrics:

| Metric                      | Description                                           |
| --------------------------- | ----------------------------------------------------- |
| **Average Score**           | Overall pass rate across all scenarios and Evaluators |
| **Number of Conversations** | How many Test conversations were evaluated            |
| **Agent Version**           | The version of the Agent that was tested              |

### Scenario results

Each scenario displays:

| Column         | Description                                                    |
| -------------- | -------------------------------------------------------------- |
| **Status**     | Running, Completed, or Failed                                  |
| **Name**       | The scenario name                                              |
| **Score**      | Percentage of Evaluators that passed (shown with progress bar) |
| **Evaluators** | Pass/fail count (e.g., "1/1 passed")                           |
| **Credits**    | Credits consumed for this scenario                             |

### Viewing conversation details

Click **View Conversation** on any scenario to see:

1. **The full conversation** between the simulated user and your Agent
2. **Evaluator verdicts** from all Evaluators included in the run, with detailed explanations of why each Evaluator passed or failed

For example, an "Empathy Shown" Evaluator might show:

> **Pass**: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer's frustration with being transferred multiple times ("I completely understand how upsetting it must be to feel like you're not getting the help you need"), validating her experience with the double charge ("I truly understand how frustrating it is to be charged twice"), and directly addressing her skepticism by saying "I completely understand your concerns, especially given your previous experience."

***

## Performance tab

The Performance tab lets you automatically evaluate live Agent conversations without manually running Test Suites. This is useful for ongoing quality monitoring.

### Setting up Performance monitoring

1. Go to the **Monitor** tab, select **Evals**, then select **Performance**.
2. Select a **global Evaluator** you've created in the Evaluators tab.
3. Set a **Sample rate** — the percentage of conversations to evaluate.
4. (Optional) Set a **Conversation status filter** to only evaluate conversations with specific statuses (e.g., completed, escalated). Leave blank to evaluate all conversations.
5. Save your settings.

Once configured, conversations matching your criteria will be automatically evaluated at the sample rate you've set.

### Viewing Performance insights

After the Performance evaluator has processed conversations, you can view:

| Metric            | Description                                        |
| ----------------- | -------------------------------------------------- |
| **Overall Score** | Aggregate score across all evaluated conversations |
| **Total Runs**    | Number of conversations evaluated                  |
| **Evaluators**    | Which Evaluators are active                        |

The Performance tab also includes:

* **Data points** for the overall score over time
* **Evaluator breakdown** showing individual scoring per Evaluator
* **Graphs** visualizing Evaluator performance trends
* **List of evaluation runs** with score, name, and the ability to view the full conversation

<Tip>
  To adjust Performance settings after initial setup, click the **Settings** button in the top right corner of the Performance tab.
</Tip>

***

## Publish Checks

Publish Checks let you choose which Test Suites to run before your Agent is published. If the results don't meet your threshold, publishing can be blocked.

You can configure Publish Checks from the **Publish Checks** section in Evals.

### Test sets to run

Select which Test Suites to run before publishing. Click **Add test sets** to choose them — all scenarios in the selected Test Suites will be evaluated.

### Publish settings

Configure how evaluations affect the publish process:

| Setting                               | Description                                                                                                                                                                                       |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Pass threshold (%)**                | The minimum score percentage required for the evaluation to pass (e.g., 100%)                                                                                                                     |
| **Block publish if evaluation fails** | When checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold. |

Once configured, click **Save**. When you next publish your Agent, the selected Test Suites will run automatically and the results will be checked against your threshold.

***

## Best practices

<CardGroup cols={2}>
  <Card title="Start simple" icon="seedling">
    Begin with a few core scenarios that test your Agent's primary use cases. Add complexity as you learn what matters most.
  </Card>

  <Card title="Be specific with Evaluators" icon="bullseye">
    Write detailed evaluation rules. Vague criteria lead to inconsistent results. Include specific examples of what passing looks like.
  </Card>

  <Card title="Test edge cases" icon="triangle-exclamation">
    Create scenarios for difficult situations: angry customers, off-topic requests, requests to bypass rules, etc.
  </Card>

  <Card title="Use Performance monitoring" icon="chart-line">
    Set up global Evaluators in the Performance tab to continuously monitor live Agent conversations without manual testing.
  </Card>

  <Card title="Organize scenarios across suites" icon="folder-tree">
    Group related scenarios into dedicated Test Suites (e.g., one for edge cases, one for happy paths). Use Move, Copy, and Duplicate to reorganize as your testing strategy evolves.
  </Card>
</CardGroup>

***

## Frequently asked questions (FAQs)

<AccordionGroup>
  <Accordion title="How many scenarios can I have in a Test Suite?">
    You can add as many scenarios as needed to a single Test Suite. Each scenario is evaluated independently and can have its own Evaluators.
  </Accordion>

  <Accordion title="How are credits calculated for evaluations?">
    Credits consumed for each scenario are calculated by adding together:

    * The Agent task run (the conversation with your Agent)
    * The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
    * The Evaluator evaluations (both scenario Evaluators and global Evaluators) - each Evaluator uses an LLM to evaluate the conversation

    Each scenario shows its total credit usage in the results.
  </Accordion>

  <Accordion title="Can I rerun a previous evaluation?">
    Yes, you can run the same Test scenarios again at any time. Each run is saved in your Runs history, allowing you to compare results across different Agent versions.
  </Accordion>

  <Accordion title="What's the difference between scenario Evaluators and global Evaluators?">
    Scenario Evaluators are created within Test scenarios and only evaluate conversations generated by that specific scenario. Global Evaluators are created in the Evaluators tab and can be selected to run on any Test Suite or scenario, providing reusable evaluation criteria across all your tests. Global Evaluators are also used in the Performance tab for monitoring live conversations.
  </Accordion>

  <Accordion title="Can the LLM Judge evaluate long conversations?">
    Yes, with configuration. The LLM Judge Evaluator includes a **Truncate long conversations** toggle in the **Advanced** section when creating an Evaluator. When enabled, conversations that exceed the judge model's context window are trimmed and evaluated. When disabled, those conversations fail with an error rather than producing a partial result.
  </Accordion>

  <Accordion title="What happens when a conversation is truncated?">
    The oldest messages are removed from the start of the conversation until it fits within the judge model's context window. The judge is notified that truncation occurred and evaluates the remaining portion. If your evaluation criteria depend on early context — such as the user's original request or instructions given at the start of the conversation — the result may be less accurate. In those cases, disabling truncation and selecting a model with a larger context window is preferable.
  </Accordion>

  <Accordion title="Can I move test scenarios between test suites?">
    Yes. Each scenario has a dropdown menu (three-dot icon) with three options: **Move** relocates the scenario to another Test Suite, **Copy** creates a duplicate in another suite, and **Duplicate** creates a copy in the same suite.
  </Accordion>

  <Accordion title="I don't see the Evals section. How do I get access?">
    Evals is being rolled out progressively, starting with Enterprise customers. If you're an Enterprise customer and don't see the Evals section in the Monitor tab yet, reach out to your account manager to discuss access.
  </Accordion>
</AccordionGroup>
