Skip to main content
Rollout Status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature in your account yet, reach out to your account manager to discuss access.
The Evals section is your command center for testing and evaluating AI Agent performance. Located in the Monitor tab (next to the Run tab) in the Agent builder, Evals enables you to create Test scenarios, define evaluation criteria (Judges), and run automated evaluations—all without manual testing.
Monitor tab showing Evals section with Tests, Judges, and Runs

What you can do with Evals

Conduct Tests

Create Test scenarios that simulate real user interactions. Combine scenarios with Judges to measure accuracy and evaluate Agent performance automatically.

Create Judges

Define evaluation criteria that automatically assess Agent responses. Judges look for specific conditions and score conversations based on your defined rules.

View Test history

Access your complete evaluation run history. Review past Test results, compare performance across runs, and track improvements over time.

Evals sections

The Evals section contains three main sections, accessible from the left sidebar:
  • Tests — Create and manage Test scenarios for your Agent. Each Test can contain multiple scenarios with different prompts and evaluation criteria.
  • Judges — Configure standard evaluation criteria that automatically run on all Test scenarios.
  • Runs — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, and creation dates for all past runs.

Understanding Judges

Judges are evaluation criteria that automatically assess Agent conversations. There are two types of Judges:

Scenario Judges

Scenario Judges are created within individual Test scenarios. They evaluate the specific conversation generated by that scenario’s prompt.
  • Created inside a Test scenario using the + Add Judge button
  • Only apply to the scenario they’re defined in
  • Scenario-specific evaluation criteria

Agent-level Judges

Agent-level Judges are standard evaluators configured in the Judges tab. They automatically run on every Test scenario you execute.
  • Created in the Judges tab (separate from Tests)
  • Automatically applied to all Test scenarios when you run evaluations
  • Useful for standard criteria you want checked across all scenarios, such as professional tone, no hallucinations, or brand voice compliance
To create an Agent-level Judge:
  1. Go to the Monitor tab and select Evals, then select Judges
  2. Click + New Judge
  3. Enter a Name for the Judge (e.g., “Professional Tone”)
  4. Enter a Rule describing the criteria for passing
  5. Save the Judge
When you run a Test scenario, both the scenario-specific Judges (defined within that scenario) and all Agent-level Judges (from the Judges tab) will evaluate the conversation. This allows you to have standard criteria checked on every Test while also having scenario-specific evaluation rules.

Creating a Test

Follow these steps to create your first evaluation Test:
  1. Open your Agent in the builder and click the Monitor tab (next to the Run tab). Select Evals from the left sidebar, then select Tests.
  2. Click the + New Test button. Enter a name for your Test and click Create.
  3. Click on the Test you just created to open it.
  4. Click the + New Test scenario button to create a scenario within your Test.
  5. Fill in the scenario details:
    FieldDescriptionExample
    Scenario nameA descriptive name for this Test case”Response Empathy”
    Scenario promptThe persona or situation the simulated user will adopt”You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help.”
    Max turnsMaximum conversation turns (1-50)10
  6. Add scenario Judges to define how this specific scenario should be evaluated:
    FieldDescriptionExample
    Judge nameName of the evaluation criterion”Empathy Shown”
    Evaluation ruleDetailed criteria for passing/failing”Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.”
    Click + Add Judge to add more evaluation criteria to the scenario.
  7. Click Save Test scenario to save your configuration.
You can add multiple scenarios to a single Test to evaluate different aspects of your Agent’s behavior. Each scenario can have its own prompt, max turns, and Judges.

Example scenarios

Here are some example Test scenarios you might create:
Scenario name: Response EmpathyScenario prompt: You are a long-time customer who was recently charged twice for the same order. You’ve already contacted support once without resolution and are feeling frustrated but willing to give the Agent a chance to help. Express your concerns clearly and see if the Agent acknowledges your situation before jumping to solutions.Max turns: 10Judge: Empathy Shown
  • Evaluation rule: Did the Agent acknowledge the customer’s frustration and express empathy before offering solutions? The response should show understanding of the emotional state and validate their concerns.
Scenario name: Product ExpertiseScenario prompt: You are a procurement manager at a mid-sized company evaluating solutions for your team. You need specific details about enterprise pricing tiers, integration capabilities with existing tools like Salesforce and HubSpot, and data security certifications. Ask clarifying questions and compare features against competitors you’re also considering.Max turns: 15Judge: Accurate Information
  • Evaluation rule: Did the Agent provide accurate product information without making claims that cannot be verified? Responses should be factual, reference actual product capabilities, and acknowledge when information needs to be confirmed by a sales representative.
Scenario name: Escalation RequestScenario prompt: You are a paying customer who has experienced a service outage affecting your business operations. You’ve already troubleshooted with the knowledge base articles and need to speak with a senior support engineer or account manager. Be firm but professional in your request, and provide context about the business impact.Max turns: 5Judge: Appropriate Escalation
  • Evaluation rule: Did the Agent acknowledge the severity of the situation, validate the customer’s need for escalation, and initiate a handoff to a human representative while maintaining a professional and empathetic tone throughout?

Running evaluations

Run Test scenarios to simulate conversations with your Agent and verify behavior before real interactions.

Running a Test scenario

Test scenarios simulate conversations with your Agent using the personas you’ve defined. Each conversation is evaluated by both scenario Judges (defined within the scenario) and Agent-level Judges (from the Judges tab):
  1. From the Tests tab, select the Test scenarios you want to include in the evaluation run. Running multiple scenarios at once groups them all under one evaluation result.
  2. Click the Run button to start the evaluation.
  3. Enter a name for the evaluation run (e.g., “Scenario Run - Jan 14, 12:14 PM”). A default name with timestamp is provided.
  4. Click Run to begin. The system will simulate conversations with your Agent based on your scenario prompts. Both scenario Judges (defined within each scenario) and Agent-level Judges (from the Judges tab) will evaluate each conversation.

Understanding results

After running an evaluation, you’ll see a detailed results screen:

Run summary

The top of the results page shows key metrics:
MetricDescription
Average ScoreOverall pass rate across all scenarios and Judges
Number of ConversationsHow many Test conversations were evaluated
Agent VersionThe version of the Agent that was tested

Scenario results

Each scenario displays:
ColumnDescription
StatusRunning, Completed, or Failed
NameThe scenario name
ScorePercentage of Judges that passed (shown with progress bar)
JudgesPass/fail count (e.g., “1/1 passed”)
CreditsCredits consumed for this scenario

Viewing conversation details

Click View Conversation on any scenario to see:
  1. The full conversation between the simulated user and your Agent
  2. Judge verdicts from all Judges added to the evaluation run, with detailed explanations of why each Judge passed or failed
For example, an “Empathy Shown” Judge might show:
Pass: The Agent demonstrated strong empathy throughout the conversation. Key examples include: acknowledging the customer’s frustration with being transferred multiple times (“I completely understand how upsetting it must be to feel like you’re not getting the help you need”), validating her experience with the double charge (“I truly understand how frustrating it is to be charged twice”), and directly addressing her skepticism by saying “I completely understand your concerns, especially given your previous experience.”

Saving an Agent with Evals configured

When you have Evals configured on an Agent, the save flow changes. When you click Publish to save your Agent, you’ll be prompted to run an evaluation before publishing.

Selecting Tests and scenarios

You can choose which evaluations to run in two ways:
  • Select entire Tests — Check the boxes next to Test names to include all scenarios within those Tests
  • Select specific scenarios — Click on a Test name to expand it, then select individual scenarios from within that Test
All associated Judges (both scenario Judges and Agent-level Judges) will be checked against the threshold.

Pass threshold and block save

Configure how evaluations affect the save process:
SettingDescription
Pass Threshold (%)The minimum score percentage required for the evaluation to pass (e.g., 70%)
Block save if evaluation failsWhen checked, the Agent will only be published if the evaluation score meets or exceeds the pass threshold. If unchecked, the Agent will be published even if the evaluation fails the threshold.
These settings work together: if “Block save if evaluation fails” is enabled and your evaluation score is below the threshold, publishing will be blocked.

Force publish

To bypass evaluation entirely, click the dropdown arrow on the Publish button and select Force Publish. This will save and publish your Agent without running any evaluations.

Runs history

Access your complete evaluation history from the Runs section. This shows:
ColumnDescription
Run nameName of the evaluation run
Average scoreOverall pass rate with visual progress bar
# ConversationsNumber of conversations in the run
ProgressCompletion status (e.g., “1/1”)
Date createdWhen the run was executed
Click on any past run to view its detailed results.

Best practices

Start simple

Begin with a few core scenarios that test your Agent’s primary use cases. Add complexity as you learn what matters most.

Be specific with Judges

Write detailed evaluation rules. Vague criteria lead to inconsistent results. Include specific examples of what passing looks like.

Test edge cases

Create scenarios for difficult situations: angry customers, off-topic requests, requests to bypass rules, etc.

Run regularly

Evaluate your Agent after making changes to prompts, tools, or knowledge. Use runs history to track improvements over time.

Frequently asked questions (FAQs)

You can add as many scenarios as needed to a single Test. Each scenario is evaluated independently and can have its own Judges.
Credits consumed for each scenario are calculated by adding together:
  • The Agent task run (the conversation with your Agent)
  • The simulator (the persona/user simulation) - uses an LLM to simulate the user persona
  • The Judge evaluations (both scenario Judges and Agent-level Judges) - each Judge uses an LLM to evaluate the conversation
Each scenario shows its total credit usage in the results.
Yes, you can run the same Test scenarios again at any time. Each run is saved in your Runs history, allowing you to compare results across different Agent versions.
Scenario Judges are created within Test scenarios and only evaluate conversations generated by that specific scenario. Agent-level Judges are created in the Judges tab and automatically run on every Test scenario you execute, providing standard evaluation criteria across all your Tests.
Evals is being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see the Evals section in the Monitor tab yet, reach out to your account manager to discuss access.