The reason enterprise is struggling to scale agents
The agent architecture winning demos isn't the one that wins at enterprise scale. Here's the trade-off most teams miss, and how we build agents that hold up across millions of tasks a month.
Does this sound familiar?
Have you been wondering what everyone else has: "Tools like Cowork and Codex are so obviously powerful! So why aren't they working for enterprise scale?"
The pattern keeps making headlines: enterprises spend heavily on general-purpose AI, then can't tie it to a return, or watch the bill outrun the value it delivers.
Running the work on AI is turning out to cost more than the employees it was meant to replace.
Enterprises are scrambling to rein in agent token bills now running into tens of millions a month.
Only one in four enterprise AI initiatives has delivered the ROI leaders expected.
The share of companies scrapping most of their AI initiatives more than doubled in a single year.
We call it Claude Chaos.
The leap from a fantastic copilot to an agent you can trust to run unattended, at volume, isn't a capability problem the next release will solve. It's architectural. The very design that makes these systems extraordinary with a human beside them is what works against you the moment the human steps away and the numbers get large.
You've seen what a great agent can do
You've watched Codex refactor an entire codebase. You've messaged an OpenClaw agent and watched it triage your inbox like a personal assistant. You've seen a teammate use Cowork to do the work of a team in an afternoon.
The next thought is obvious: can we scale this up? Let's point that same autonomy at a high volume function like our inbound funnel and let it process millions of tasks a month.
The answer so many companies are realizing is, "not quite". The non-obvious reason is: the way those agents are designed doesn't work well at scale.
These systems share one architecture: a general harness. The latest frontier model, a general-purpose prompt, a flexible toolset, and skills that unlock capability on demand. One powerful generalist, handed a goal and the freedom to figure out the rest.
It's a brilliant design that is incredibly flexible. Leveraging the power of frontier models, an agent built like this can do almost anything. It can pull different skills together to solve novel problems, and leverage its general purpose tooling to create files, run custom scripts and browse the internet freely.
The game it's built for
It's built for long-horizon work with a human in the loop. Hand it a sprawling refactor or an open research thread and let it run: explore, backtrack, pull in a skill the moment it needs one, and hold the whole problem in one context. A person sits beside it the whole time, steering, reviewing the work, and catching the rare wrong turn before it costs anything.
This is where these tools genuinely shine, and where the labs are pushing hardest:
- A refactor that touches a hundred files across a codebase no one fully remembers.
- Open-ended research, pulled from dozens of sources and synthesized into a single memo.
- Exploratory data analysis, chasing the non-obvious cut that actually explains the number.
- A complex or time-consuming task you'll never need to repeat quite the same way.
The common thread: the work is open-ended, every run is different, and what you want is the best possible outcome on this one attempt. This is where agents can demonstrate the most brilliance, carving out a path through uncertainty.
The trade-off: it is far less predictable. Depending on what approach it takes to a task, and what it finds along the way, it might consume anywhere from a couple of dollars to thousands. It might succeed, or it might end up spinning its wheels.
The beauty is in the possibility, and the risk is in the unknown.
It's the kind of work an agent has to master to reach anything like AGI. That's why it's the focus of the frontier labs.
At enterprise scale, the math inverts
For early stage businesses, brilliance pays. The upside of a breakthrough is huge and the downside of a misfire is small, so you chase the ceiling.
Run the same work across thousands of repetitions and the calculus flips. A flawlessly personalized outbound email is worth barely more than a merely competent one, but a single message that leaks customer data, hits the wrong segment, or misstates a price can be catastrophic. The upside is capped. The downside compounds.
That's the nature of high-volume enterprise work: asymmetric consequences. Which means the thing worth optimizing isn't frontier capability, the ceiling of what's possible on a great run. It's reliability, the floor you can guarantee on every run.
Most enterprise work isn't open-ended
Now picture the work you actually want to hand to agents at volume: qualifying inbound leads, triaging support tickets, screening invoices, onboarding a vendor, reconciling a payment. None of it is open-ended. The path is largely known and written down somewhere already: a sequence of steps, a policy, a definition of done.
But it isn't pure software either. If it were, you'd have scripted it years ago. What stops you is the long tail: the edge cases where the rules conflict, the data is ambiguous, or an exception needs a genuine call. That tail requires reasoning, made possible by agents.
So the shape is distinctive: a well-defined process wrapped around a small, irreducible core of judgement. That is nothing like a sprawling refactor or an open research thread, and it doesn't need an agent free to improvise the whole job. It needs most of the work pinned down as structure, with reasoning reserved for the cases that genuinely require it.
How we build agents at Relevance
Instead of one generalist improvising the whole job, we break the workflow into specialist agents, each with a single, bounded responsibility. An enrichment agent. A scoring agent. A writer agent. A sending agent.
Each specialist carries its own evals: a test set that defines what "correct" means for that one job. Those evals do real work: they drive which model each agent runs on and how its prompt is tuned. The writer agent and the sending agent can sit on different models, because each is chosen against its own bar.
And each gets specialist tools with bounded scopes. The drafting agent can read a prospect and write a draft. It has no ability to send, not by policy, but by construction. The capability simply isn't wired to it.
This might sound obvious, but it actually goes against the conventional wisdom. Building this sort of structure and scope into your agents limits their flexibility. It makes these agents actively worse copilots, and the majority of agent discourse is focused on copilots. But it's required to run them autonomously at scale.
This is easier to optimize
When an agent has one job, its traces are clean: you can see exactly what came in, what it decided, and whether it was right. So improvement has somewhere specific to land.
Each agent is versioned and evaluated on its own, which means you can set a different posture for each. Adopt a new frontier model on the writer agent the day it ships. Its evals tell you instantly if quality held. Keep the sending agent pinned until it clears its own bar. You capture new intelligence where it's safe and hold the line where it isn't, instead of betting the whole pipeline on one upgrade.
The same surface lets you optimize cost. You can right-size the model to each job: run the heavy reasoning steps on a frontier model and the simple, well-defined ones on something smaller and cheaper.
And it's easier to govern
Specialist boundaries are real, addressable objects, so they give you control a single harness can't.
- Bounded scope. An agent that can't send can't blast your list, no matter how badly it reasons. The dangerous combination of reading customer data and sending external email is split across two agents, making it structurally impossible rather than merely discouraged.
- Incremental permissioning. Authority is granted, widened, or revoked one agent at a time, the way a new hire earns scope instead of getting root access on day one.
- Attribution. Every action traces to a specific agent and the team that owns it, so you can fix one component in isolation without touching the rest.
This is the shape enterprises already trust: named owners, bounded roles, clear audit trails, and scope that's earned over time. Specialist agents conform to it. A general harness asks you to abandon it.
You've seen this before
This isn't a new way to organize work. It's the way every enterprise already organizes its people, for exactly the same reasons.
No company hands one brilliant generalist the keys to the whole business and tells them to figure it out. It hires into bounded roles. An analyst enriches the data, a rep qualifies the lead, a writer owns the message, and ops owns the send, each with a manager, a remit, and a clear definition of done.
Authority is earned, not granted on day one. A new hire gets read access before write access, and write access long before the ability to email the entire customer base. And every action carries a name, so when something breaks you know which seat to fix without reorganizing the company.
We build agents the same way. The reasons enterprises arrived at this structure don't go away just because the worker is on a server: accountability, safety, and the freedom to improve one part without disturbing the rest.
Reliability beats brilliance
The frontier labs are chasing AGI: open-ended work that, if agents can master it, changes what's possible for humanity. The general harness is the right tool for that.
But enterprise process is the opposite, and it's the work that has to get done today: well-defined, run at volume, with a long tail of rules that must be honored exactly and no human checking every output. The goal isn't to push the ceiling. It's to solidify the floor.
Markets have always rewarded the firms that make their processes predictable, reliable, measurable, and repeatable. We're betting the same holds for agents. So while the labs build for the world that's coming, we're building specialist agents for how enterprise works right now.
See what that looks like for your team. Book a demo.