The reason enterprise is struggling to scale agents

The agent architecture winning demos isn't the one that wins at enterprise scale. Here's the trade-off most teams miss, and how we build agents that hold up across millions of tasks a month.

Does this sound familiar?

Have you been wondering what everyone else has: "Tools like Cowork and Codex are so obviously powerful! So why aren't they working for enterprise scale?"

The pattern keeps making headlines: enterprises spend heavily on general-purpose AI, then can't tie it to a return, or watch the bill outrun the value it delivers.

Running the work on AI is turning out to cost more than the employees it was meant to replace.

Fortune, 2026

Enterprises are scrambling to rein in agent token bills now running into tens of millions a month.

TechCrunch, 2026

Only one in four enterprise AI initiatives has delivered the ROI leaders expected.

IBM CEO study of 2,000 leaders — IBM, 2025

The share of companies scrapping most of their AI initiatives more than doubled in a single year.

S&P Global, 2025

95% of enterprise AI pilots show no measurable P&L impact MIT, 2025

14% of agent pilots reach organization-wide production 2026 industry survey

40% of agent projects will be canceled by 2027, on cost, unclear value, or weak risk controls Gartner, 2025

We call it Claude Chaos.

The leap from a fantastic copilot to an agent you can trust to run unattended, at volume, isn't a capability problem the next release will solve. It's architectural. The very design that makes these systems extraordinary with a human beside them is what works against you the moment the human steps away and the numbers get large.

You've seen what a great agent can do

You've watched Codex refactor an entire codebase. You've messaged an OpenClaw agent and watched it triage your inbox like a personal assistant. You've seen a teammate use Cowork to do the work of a team in an afternoon.

The next thought is obvious: can we scale this up? Let's point that same autonomy at a high volume function like our inbound funnel and let it process millions of tasks a month.

The answer so many companies are realizing is, "not quite". The non-obvious reason is: the way those agents are designed doesn't work well at scale.

These systems share one architecture: a general harness. The latest frontier model, a general-purpose prompt, a flexible toolset, and skills that unlock capability on demand. One powerful generalist, handed a goal and the freedom to figure out the rest.

A frontier model with one general prompt, plus skills, tools, and MCP connections on demand.

One harness, one skill library, recomposed for each job. The same skills get reused across very different workflows.

Claude Cowork loads each skill in turn, doing every job in one shared context.

It's a brilliant design that is incredibly flexible. Leveraging the power of frontier models, an agent built like this can do almost anything. It can pull different skills together to solve novel problems, and leverage its general purpose tooling to create files, run custom scripts and browse the internet freely.

The game it's built for

It's built for long-horizon work with a human in the loop. Hand it a sprawling refactor or an open research thread and let it run: explore, backtrack, pull in a skill the moment it needs one, and hold the whole problem in one context. A person sits beside it the whole time, steering, reviewing the work, and catching the rare wrong turn before it costs anything.

This is where these tools genuinely shine, and where the labs are pushing hardest:

A refactor that touches a hundred files across a codebase no one fully remembers.
Open-ended research, pulled from dozens of sources and synthesized into a single memo.
Exploratory data analysis, chasing the non-obvious cut that actually explains the number.
A complex or time-consuming task you'll never need to repeat quite the same way.

One model, one context window: free to branch, hit dead ends, backtrack, and discover a path that's different every run. That open-ended freedom is what the architecture is built for.

The common thread: the work is open-ended, every run is different, and what you want is the best possible outcome on this one attempt. This is where agents can demonstrate the most brilliance, carving out a path through uncertainty.

The trade-off: it is far less predictable. Depending on what approach it takes to a task, and what it finds along the way, it might consume anywhere from a couple of dollars to thousands. It might succeed, or it might end up spinning its wheels.

The beauty is in the possibility, and the risk is in the unknown.

It's the kind of work an agent has to master to reach anything like AGI. That's why it's the focus of the frontier labs.

At enterprise scale, the math inverts

For early stage businesses, brilliance pays. The upside of a breakthrough is huge and the downside of a misfire is small, so you chase the ceiling.

Run the same work across thousands of repetitions and the calculus flips. A flawlessly personalized outbound email is worth barely more than a merely competent one, but a single message that leaks customer data, hits the wrong segment, or misstates a price can be catastrophic. The upside is capped. The downside compounds.

That's the nature of high-volume enterprise work: asymmetric consequences. Which means the thing worth optimizing isn't frontier capability, the ceiling of what's possible on a great run. It's reliability, the floor you can guarantee on every run.

Capped gains and compounding losses, across thousands of runs.

Most enterprise work isn't open-ended

Now picture the work you actually want to hand to agents at volume: qualifying inbound leads, triaging support tickets, screening invoices, onboarding a vendor, reconciling a payment. None of it is open-ended. The path is largely known and written down somewhere already: a sequence of steps, a policy, a definition of done.

But it isn't pure software either. If it were, you'd have scripted it years ago. What stops you is the long tail: the edge cases where the rules conflict, the data is ambiguous, or an exception needs a genuine call. That tail requires reasoning, made possible by agents.

So the shape is distinctive: a well-defined process wrapped around a small, irreducible core of judgement. That is nothing like a sprawling refactor or an open research thread, and it doesn't need an agent free to improvise the whole job. It needs most of the work pinned down as structure, with reasoning reserved for the cases that genuinely require it.

Mostly codified rules, with a long tail of edge cases that genuinely need judgement. Pin down the head as structure; reserve reasoning for the tail.

How we build agents at Relevance

Instead of one generalist improvising the whole job, we break the workflow into specialist agents, each with a single, bounded responsibility. An enrichment agent. A scoring agent. A writer agent. A sending agent.

Each specialist carries its own evals: a test set that defines what "correct" means for that one job. Those evals do real work: they drive which model each agent runs on and how its prompt is tuned. The writer agent and the sending agent can sit on different models, because each is chosen against its own bar.

And each gets specialist tools with bounded scopes. The drafting agent can read a prospect and write a draft. It has no ability to send, not by policy, but by construction. The capability simply isn't wired to it.

One workforce: a manager delegates to bounded specialists that hand work along in turn.

Zoom into each specialist: every one owns its own evals and its own bounded tools, and nothing more.

This might sound obvious, but it actually goes against the conventional wisdom. Building this sort of structure and scope into your agents limits their flexibility. It makes these agents actively worse copilots, and the majority of agent discourse is focused on copilots. But it's required to run them autonomously at scale.

This is easier to optimize

When an agent has one job, its traces are clean: you can see exactly what came in, what it decided, and whether it was right. So improvement has somewhere specific to land.

Each agent is versioned and evaluated on its own, which means you can set a different posture for each. Adopt a new frontier model on the writer agent the day it ships. Its evals tell you instantly if quality held. Keep the sending agent pinned until it clears its own bar. You capture new intelligence where it's safe and hold the line where it isn't, instead of betting the whole pipeline on one upgrade.

The same surface lets you optimize cost. You can right-size the model to each job: run the heavy reasoning steps on a frontier model and the simple, well-defined ones on something smaller and cheaper.

Upgrade where it's safe; hold the line where it isn't.

And it's easier to govern

Specialist boundaries are real, addressable objects, so they give you control a single harness can't.

Bounded scope. An agent that can't send can't blast your list, no matter how badly it reasons. The dangerous combination of reading customer data and sending external email is split across two agents, making it structurally impossible rather than merely discouraged.
Incremental permissioning. Authority is granted, widened, or revoked one agent at a time, the way a new hire earns scope instead of getting root access on day one.
Attribution. Every action traces to a specific agent and the team that owns it, so you can fix one component in isolation without touching the rest.

This is the shape enterprises already trust: named owners, bounded roles, clear audit trails, and scope that's earned over time. Specialist agents conform to it. A general harness asks you to abandon it.

Every agent gets the same treatment: its own bounded scope, inside its own permissions and logs.

You've seen this before

This isn't a new way to organize work. It's the way every enterprise already organizes its people, for exactly the same reasons.

No company hands one brilliant generalist the keys to the whole business and tells them to figure it out. It hires into bounded roles. An analyst enriches the data, a rep qualifies the lead, a writer owns the message, and ops owns the send, each with a manager, a remit, and a clear definition of done.

Authority is earned, not granted on day one. A new hire gets read access before write access, and write access long before the ability to email the entire customer base. And every action carries a name, so when something breaks you know which seat to fix without reorganizing the company.

We build agents the same way. The reasons enterprises arrived at this structure don't go away just because the worker is on a server: accountability, safety, and the freedom to improve one part without disturbing the rest.

The same shape every org chart already uses: one owner up top, bounded roles beneath, each accountable for a single job.

Reliability beats brilliance

The frontier labs are chasing AGI: open-ended work that, if agents can master it, changes what's possible for humanity. The general harness is the right tool for that.

But enterprise process is the opposite, and it's the work that has to get done today: well-defined, run at volume, with a long tail of rules that must be honored exactly and no human checking every output. The goal isn't to push the ceiling. It's to solidify the floor.

Markets have always rewarded the firms that make their processes predictable, reliable, measurable, and repeatable. We're betting the same holds for agents. So while the labs build for the world that's coming, we're building specialist agents for how enterprise works right now.

See what that looks like for your team. Book a demo.