Evals

Rollout Status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature in your account yet, reach out to your account manager to discuss access.

The Evals section is your command center for testing and evaluating AI Agent performance. Located in the Monitor tab (next to the Run tab) in the Agent builder, Evals enables you to create Test scenarios, define evaluation criteria (Judges), and run automated evaluations—all without manual testing.

Monitor tab showing Evals section with Tests, Judges, and Runs

What you can do with Evals

Conduct Tests

Create Test scenarios that simulate real user interactions. Combine scenarios with Judges to measure accuracy and evaluate Agent performance automatically.

Create Judges

Define evaluation criteria that automatically assess Agent responses. Judges look for specific conditions and score conversations based on your defined rules.

View Test history

Access your complete evaluation run history. Review past Test results, compare performance across runs, and track improvements over time.

Evals sections

The Evals section contains three main sections, accessible from the left sidebar:

Tests — Create and manage Test scenarios for your Agent. Each Test can contain multiple scenarios with different prompts and evaluation criteria.
Judges — Configure standard evaluation criteria that automatically run on all Test scenarios.
Runs — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, and creation dates for all past runs.

Understanding Judges

Judges are evaluation criteria that automatically assess Agent conversations. There are two types of Judges:

Scenario Judges

Scenario Judges are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario’s prompt.

Created inside a Test scenario using the + Add Judge button
Only apply to the scenario they’re defined in
Scenario-specific evaluation criteria

Agent-level Judges

Agent-level Judges are standard evaluators configured in the Judges tab. They automatically run on every Test scenario you execute.

Created in the Judges tab (separate from Tests)
Automatically applied to all Test scenarios when you run evaluations
Useful for standard criteria you want checked across all scenarios, such as professional tone, no hallucinations, or brand voice compliance

To create an Agent-level Judge:

Go to the Monitor tab and select Evals, then select Judges
Click + New Judge
Enter a Name for the Judge (e.g., “Professional Tone”)
Enter a Rule describing the criteria for passing
Save the Judge

When you run a Test scenario, both the scenario-specific Judges (defined within that scenario) and all Agent-level Judges (from the Judges tab) will evaluate the conversation. This allows you to have standard criteria checked on every Test while also having scenario-specific evaluation rules.

Creating a Test

Follow these steps to create your first evaluation Test:

Open your Agent in the builder and click the Monitor tab (next to the Run tab). Select Evals from the left sidebar, then select Tests.
Click the + New Test button. Enter a name for your Test and click Create.
Click on the Test you just created to open it.
Click the + New Test scenario button to create a scenario within your Test.

Fill in the scenario details:

Field	Description	Example
Scenario name	A descriptive name for this Test case	”Response Empathy”
Scenario prompt	The persona or situation the simulated user will adopt	”You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help.”
Max turns	Maximum conversation turns (1-50)	10

Add scenario Judges to define how this specific scenario should be evaluated:

Field	Description	Example
Judge name	Name of the evaluation criterion	”Empathy Shown”
Evaluation rule	Detailed criteria for passing/failing	”Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.”

Click + Add Judge to add more evaluation criteria to the scenario.

Click Save Test scenario to save your configuration.

You can add multiple scenarios to a single Test to evaluate different aspects of your Agent’s behavior. Each scenario can have its own prompt, max turns, and Judges.

Example scenarios

Here are some example Test scenarios you might create:

Customer Support - Empathy test

Scenario name: Response EmpathyScenario prompt: You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help. Express your concerns clearly and see if the Agent acknowledges your situation before jumping to solutions.Max turns: 10Judge: Empathy Shown

Evaluation rule: Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.

Sales - Product knowledge test

Scenario name: Product ExpertiseScenario prompt: You are a procurement manager at a mid-sized company evaluating solutions for your team. You need specific details about enterprise pricing tiers, integration capabilities with existing tools like Salesforce and HubSpot, and data security certifications. Ask clarifying questions and compare features against competitors you’re also considering.Max turns: 15Judge: Accurate Information

Evaluation rule: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.

Support - Escalation handling

Scenario name: Escalation RequestScenario prompt: You are a paying customer who has experienced a service outage affecting your business operations. You’ve already troubleshooted with the knowledge base articles and need to speak with a senior support engineer or account manager. Be firm but professional in your request, and provide context about the business impact.Max turns: 5Judge: Appropriate Escalation

Evaluation rule: Did the Agent acknowledge the severity of the situation, validate the customer’s need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?

Running evaluations

Run Test scenarios to simulate conversations with your Agent and verify behavior before real interactions.

Running a Test scenario

Test scenarios simulate conversations with your Agent using the personas you’ve defined. Each conversation is evaluated by both scenario Judges (defined within the scenario) and Agent-level Judges (from the Judges tab):

From the Tests tab, select the Test scenarios you want to include in the evaluation run. Running multiple scenarios at once groups them all under one evaluation result.
Click the Run button to start the evaluation.
Enter a name for the evaluation run (e.g., “Scenario Run - Jan 14, 12:14 PM”). A default name with timestamp is provided.
Click Run to begin. The system will simulate conversations with your Agent based on your scenario prompts. Both scenario Judges (defined within each scenario) and Agent-level Judges (from the Judges tab) will evaluate each conversation.

Understanding results

After running an evaluation, you’ll see a detailed results screen:

Run summary

The top of the results page shows key metrics:

Metric	Description
Average Score	Overall pass rate across all scenarios and Judges
Number of Conversations	How many Test conversations were evaluated
Agent Version	The version of the Agent that was tested

Scenario results

Each scenario displays:

Column	Description
Status	Running, Completed, or Failed
Name	The scenario name
Score	Percentage of Judges that passed (shown with progress bar)
Judges	Pass/fail count (e.g., “1/1 passed”)
Credits	Credits consumed for this scenario

Viewing conversation details

Click View Conversation on any scenario to see:

The full conversation between the simulated user and your Agent
Judge verdicts from all Judges added to the evaluation run, with detailed explanations of why each Judge passed or failed

For example, an “Empathy Shown” Judge might show:

Pass: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer’s frustration with being transferred multiple times (“I completely understand how upsetting it must be to feel like you’re not getting the help you need”), validating her experience with the double charge (“I truly understand how frustrating it is to be charged twice”), and directly addressing her skepticism by saying “I completely understand your concerns, especially given your previous experience.”

Saving an Agent with Evals configured

When you have Evals configured on an Agent, the save flow changes. When you click Publish to save your Agent, you’ll be prompted to run an evaluation before publishing.

Selecting Tests and scenarios

You can choose which evaluations to run in two ways:

Select entire Tests — Check the boxes next to Test names to include all scenarios within those Tests
Select specific scenarios — Click on a Test name to expand it, then select individual scenarios from within that Test

All associated Judges (both scenario Judges and Agent-level Judges) will be checked against the threshold.

Pass threshold and block save

Configure how evaluations affect the save process:

Setting	Description
Pass Threshold (%)	The minimum score percentage required for the evaluation to pass (e.g., 70%)
Block save if evaluation fails	When checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold.

These settings work together: if “Block save if evaluation fails” is enabled and your evaluation score is below the threshold, publishing will be blocked.

Force publish

To bypass evaluation entirely, click the dropdown arrow on the Publish button and select Force Publish. This will save and publish your Agent without running any evaluations.

Runs history

Access your complete evaluation history from the Runs section. This shows:

Column	Description
Run name	Name of the evaluation run
Average score	Overall pass rate with visual progress bar
# Conversations	Number of conversations in the run
Progress	Completion status (e.g., “1/1”)
Date created	When the run was executed

Click on any past run to view its detailed results.

Best practices

Start simple

Begin with a few core scenarios that test your Agent’s primary use cases. Add complexity as you learn what matters most.

Be specific with Judges

Write detailed evaluation rules. Vague criteria lead to inconsistent results. Include specific examples of what passing looks like.

Test edge cases

Create scenarios for difficult situations: angry customers, off-topic requests, requests to bypass rules, etc.

Run regularly

Evaluate your Agent after making changes to prompts, tools, or knowledge. Use runs history to track improvements over time.

Frequently asked questions (FAQs)

How many scenarios can I have in a Test?

You can add as many scenarios as needed to a single Test. Each scenario is evaluated independently and can have its own Judges.

How are credits calculated for evaluations?

Credits consumed for each scenario are calculated by adding together:

The Agent task run (the conversation with your Agent)
The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
The Judge evaluations (both scenario Judges and Agent-level Judges) - each Judge uses an LLM to evaluate the conversation

Each scenario shows its total credit usage in the results.

Can I rerun a previous evaluation?

Yes, you can run the same Test scenarios again at any time. Each run is saved in your Runs history, allowing you to compare results across different Agent versions.

What's the difference between scenario Judges and Agent-level Judges?

Scenario Judges are created within Test scenarios and only evaluate conversations generated by that specific scenario. Agent-level Judges are created in the Judges tab and automatically run on every Test scenario you execute, providing standard evaluation criteria across all your Tests.

I don't see the Evals section. How do I get access?

Evals is being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see the Evals section in the Monitor tab yet, reach out to your account manager to discuss access.

Get started

Chat

Marketplace

Workforce

Agents

Tools

Knowledge

Integrations

Use cases

What you can do with Evals

Conduct Tests

Create Judges

View Test history

Evals sections

Understanding Judges

Scenario Judges

Agent-level Judges

Creating a Test

Example scenarios

Running evaluations

Running a Test scenario

Understanding results

Run summary

Scenario results

Viewing conversation details

Saving an Agent with Evals configured

Selecting Tests and scenarios

Pass threshold and block save

Force publish

Runs history

Best practices

Start simple

Be specific with Judges

Test edge cases

Run regularly

Frequently asked questions (FAQs)

Get started

Chat

Marketplace

Workforce

Agents

Tools

Knowledge

Integrations

Use cases

​What you can do with Evals

Conduct Tests

Create Judges

View Test history

​Evals sections

​Understanding Judges

​Scenario Judges

​Agent-level Judges

​Creating a Test

​Example scenarios

​Running evaluations

​Running a Test scenario

​Understanding results

​Run summary

​Scenario results

​Viewing conversation details

​Saving an Agent with Evals configured

​Selecting Tests and scenarios

​Pass threshold and block save

​Force publish

​Runs history

​Best practices

Start simple

Be specific with Judges

Test edge cases

Run regularly

​Frequently asked questions (FAQs)

What you can do with Evals

Evals sections

Understanding Judges

Scenario Judges

Agent-level Judges

Creating a Test

Example scenarios

Running evaluations

Running a Test scenario

Understanding results

Run summary

Scenario results

Viewing conversation details

Saving an Agent with Evals configured

Selecting Tests and scenarios

Pass threshold and block save

Force publish

Runs history

Best practices

Frequently asked questions (FAQs)