Rollout Status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature in your account yet, reach out to your account manager to discuss access.

What you can do with Evals
Conduct Tests
Create Test scenarios that simulate real user interactions. Combine scenarios with Judges to measure accuracy and evaluate Agent performance automatically.
Create Judges
Define evaluation criteria that automatically assess Agent responses. Judges look for specific conditions and score conversations based on your defined rules.
View Test history
Access your complete evaluation run history. Review past Test results, compare performance across runs, and track improvements over time.
Evals sections
The Evals section contains three main sections, accessible from the left sidebar:- Tests — Create and manage Test scenarios for your Agent. Each Test can contain multiple scenarios with different prompts and evaluation criteria.
- Judges — Configure standard evaluation criteria that automatically run on all Test scenarios.
- Runs — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, and creation dates for all past runs.
Understanding Judges
Judges are evaluation criteria that automatically assess Agent conversations. There are two types of Judges:Scenario Judges
Scenario Judges are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario’s prompt.- Created inside a Test scenario using the + Add Judge button
- Only apply to the scenario they’re defined in
- Scenario-specific evaluation criteria
Agent-level Judges
Agent-level Judges are standard evaluators configured in the Judges tab. They automatically run on every Test scenario you execute.- Created in the Judges tab (separate from Tests)
- Automatically applied to all Test scenarios when you run evaluations
- Useful for standard criteria you want checked across all scenarios, such as professional tone, no hallucinations, or brand voice compliance
- Go to the Monitor tab and select Evals, then select Judges
- Click + New Judge
- Enter a Name for the Judge (e.g., “Professional Tone”)
- Enter a Rule describing the criteria for passing
- Save the Judge
When you run a Test scenario, both the scenario-specific Judges (defined within that scenario) and all Agent-level Judges (from the Judges tab) will evaluate the conversation. This allows you to have standard criteria checked on every Test while also having scenario-specific evaluation rules.
Creating a Test
- Open your Agent in the builder and click the Monitor tab (next to the Run tab). Select Evals from the left sidebar, then select Tests.
- Click the + New Test button. Enter a name for your Test and click Create.
- Click on the Test you just created to open it.
- Click the + New Test scenario button to create a scenario within your Test.
-
Fill in the scenario details:
Field Description Example Scenario name A descriptive name for this Test case ”Response Empathy” Scenario prompt The persona or situation the simulated user will adopt ”You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help.” Max turns Maximum conversation turns (1-50) 10 -
Add scenario Judges to define how this specific scenario should be evaluated:
Click + Add Judge to add more evaluation criteria to the scenario.
Field Description Example Judge name Name of the evaluation criterion ”Empathy Shown” Evaluation rule Detailed criteria for passing/failing ”Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.” - Click Save Test scenario to save your configuration.
Example scenarios
Here are some example Test scenarios you might create:Customer Support - Empathy test
Customer Support - Empathy test
Scenario name: Response EmpathyScenario prompt: You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help. Express your concerns clearly and see if the Agent acknowledges your situation before jumping to solutions.Max turns: 10Judge: Empathy Shown
- Evaluation rule: Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.
Sales - Product knowledge test
Sales - Product knowledge test
Scenario name: Product ExpertiseScenario prompt: You are a procurement manager at a mid-sized company evaluating solutions for your team. You need specific details about enterprise pricing tiers, integration capabilities with existing tools like Salesforce and HubSpot, and data security certifications. Ask clarifying questions and compare features against competitors you’re also considering.Max turns: 15Judge: Accurate Information
- Evaluation rule: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.
Support - Escalation handling
Support - Escalation handling
Scenario name: Escalation RequestScenario prompt: You are a paying customer who has experienced a service outage affecting your business operations. You’ve already troubleshooted with the knowledge base articles and need to speak with a senior support engineer or account manager. Be firm but professional in your request, and provide context about the business impact.Max turns: 5Judge: Appropriate Escalation
- Evaluation rule: Did the Agent acknowledge the severity of the situation, validate the customer’s need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?
Running evaluations
Run Test scenarios to simulate conversations with your Agent and verify behavior before real interactions.Running a Test scenario
- From the Tests tab, select the Test scenarios you want to include in the evaluation run. Running multiple scenarios at once groups them all under one evaluation result.
- Click the Run button to start the evaluation.
- Enter a name for the evaluation run (e.g., “Scenario Run - Jan 14, 12:14 PM”). A default name with timestamp is provided.
- Click Run to begin. The system will simulate conversations with your Agent based on your scenario prompts. Both scenario Judges (defined within each scenario) and Agent-level Judges (from the Judges tab) will evaluate each conversation.
Understanding results
After running an evaluation, you’ll see a detailed results screen:Run summary
The top of the results page shows key metrics:| Metric | Description |
|---|---|
| Average Score | Overall pass rate across all scenarios and Judges |
| Number of Conversations | How many Test conversations were evaluated |
| Agent Version | The version of the Agent that was tested |
Scenario results
Each scenario displays:| Column | Description |
|---|---|
| Status | Running, Completed, or Failed |
| Name | The scenario name |
| Score | Percentage of Judges that passed (shown with progress bar) |
| Judges | Pass/fail count (e.g., “1/1 passed”) |
| Credits | Credits consumed for this scenario |
Viewing conversation details
Click View Conversation on any scenario to see:- The full conversation between the simulated user and your Agent
- Judge verdicts from all Judges added to the evaluation run, with detailed explanations of why each Judge passed or failed
Pass: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer’s frustration with being transferred multiple times (“I completely understand how upsetting it must be to feel like you’re not getting the help you need”), validating her experience with the double charge (“I truly understand how frustrating it is to be charged twice”), and directly addressing her skepticism by saying “I completely understand your concerns, especially given your previous experience.”
Saving an Agent with Evals configured
When you have Evals configured on an Agent, the save flow changes. When you click Publish to save your Agent, you’ll be prompted to run an evaluation before publishing.Selecting Tests and scenarios
You can choose which evaluations to run in two ways:- Select entire Tests — Check the boxes next to Test names to include all scenarios within those Tests
- Select specific scenarios — Click on a Test name to expand it, then select individual scenarios from within that Test
Pass threshold and block save
Configure how evaluations affect the save process:| Setting | Description |
|---|---|
| Pass Threshold (%) | The minimum score percentage required for the evaluation to pass (e.g., 70%) |
| Block save if evaluation fails | When checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold. |
Force publish
To bypass evaluation entirely, click the dropdown arrow on the Publish button and select Force Publish. This will save and publish your Agent without running any evaluations.Runs history
Access your complete evaluation history from the Runs section. This shows:| Column | Description |
|---|---|
| Run name | Name of the evaluation run |
| Average score | Overall pass rate with visual progress bar |
| # Conversations | Number of conversations in the run |
| Progress | Completion status (e.g., “1/1”) |
| Date created | When the run was executed |
Best practices
Start simple
Begin with a few core scenarios that test your Agent’s primary use cases. Add complexity as you learn what matters most.
Be specific with Judges
Write detailed evaluation rules. Vague criteria lead to inconsistent results. Include specific examples of what passing looks like.
Test edge cases
Create scenarios for difficult situations: angry customers, off-topic requests, requests to bypass rules, etc.
Run regularly
Evaluate your Agent after making changes to prompts, tools, or knowledge. Use runs history to track improvements over time.
Frequently asked questions (FAQs)
How many scenarios can I have in a Test?
How many scenarios can I have in a Test?
You can add as many scenarios as needed to a single Test. Each scenario is evaluated independently and can have its own Judges.
How are credits calculated for evaluations?
How are credits calculated for evaluations?
Credits consumed for each scenario are calculated by adding together:
- The Agent task run (the conversation with your Agent)
- The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
- The Judge evaluations (both scenario Judges and Agent-level Judges) - each Judge uses an LLM to evaluate the conversation
Can I rerun a previous evaluation?
Can I rerun a previous evaluation?
Yes, you can run the same Test scenarios again at any time. Each run is saved in your Runs history, allowing you to compare results across different Agent versions.
What's the difference between scenario Judges and Agent-level Judges?
What's the difference between scenario Judges and Agent-level Judges?
Scenario Judges are created within Test scenarios and only evaluate conversations generated by that specific scenario. Agent-level Judges are created in the Judges tab and automatically run on every Test scenario you execute, providing standard evaluation criteria across all your Tests.
I don't see the Evals section. How do I get access?
I don't see the Evals section. How do I get access?
Evals is being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see the Evals section in the Monitor tab yet, reach out to your account manager to discuss access.

