Adaptive Context Management for Production AI Agents

At Relevance AI we run agents for the largest organizations in the world. Some agents can trigger any of more than 100 tools in a single task and call each other when configured as a workforce.

At this scale, context management is not an implementation detail. It is infrastructure.

Keeping agents reliable, performant, and accurate over long-running, tool-heavy tasks depends on how well they manage context. It is an active field of research, and the industry is still searching for the silver bullet approach.

In our experience, the difference between a toy agent and a production agent is almost always context management.

Why Do We Need Context Management?

Large language models (LLMs) have a hard limit on how much information they can process in a single call. Exceed it, and the request fails.

More importantly, LLMs do not retain memory between calls. Every invocation starts from a blank slate. If you want the model to remember previous turns, you must resend the relevant context each time.

For example:

User: “Hi, can we talk about apples?”
LLM: “Sure, what would you like to know?”
User: “How many should a child eat every week?”

When generating the final answer, the model must see the entire exchange. The last question makes no sense without the earlier turns.

In simple demos, the naive approach works. You resend the entire conversation every time.

In production systems, this quickly breaks down for two reasons.

1. Hard Limits

Eventually you hit the model’s context window limit and the call fails. How quickly depends on the size of inputs and outputs. A single turn involving multiple large tool outputs, such as transcripts, tables, or generated documents, can exhaust the window immediately.

2. Performance Degradation

Even before hitting the limit, too much irrelevant history degrades model performance. The model loses focus. This is closely related to phenomena like context rot and the lost in the middle effect.

Context management is therefore the art and science of constructing the minimal, most relevant context for each model call, no more and no less.

Beyond Truncation and Basic Compaction

When we set out to improve our agents’ context handling, we already had a basic truncation and compaction mechanism in place. Once the context window reached a threshold, older history would be compressed.

In practice, this was not sufficient.

We observed:

Significant performance degradation once compaction kicked in
Loss of task-critical details
Agents crashing when a single turn with multiple large tool outputs overwhelmed the context window

We needed something adaptive. The system had to treat context not as a log to trim, but as a constrained resource to manage deliberately.

We redesigned the architecture around two mechanisms.

1. Real-Time Smart Tool Output Compaction

Large tool outputs are one of the biggest threats to context health.

Imagine retrieving a full meeting transcript just to extract a few key details. Injecting the entire transcript into context is wasteful and destabilizing.

We introduced a goal-aware, real-time compaction mechanism.

If a tool output exceeds 4,000 tokens, we pause before committing it to history and:

Show the agent a short preview
Offer three options

Option 1 - Compact the Output

The agent can request compaction and provide explicit instructions, for example:

“Extract the following details: A, B, C.”

The full output is then sent to a fast model for targeted compression based on the agent’s instructions.

Instead of blindly summarizing tool output, we let the agent define what is relevant to its current goal.

Option 2 - Keep the Preview and Proceed

Sometimes the preview is sufficient. For example, in the case of a large stack trace, the agent may only need to know that an error occurred and why.

Option 3 - Keep the Full Output

If the agent genuinely needs the entire output, such as when combining rows from a large dataset, it can opt to include it. This option is available only if sufficient context window space remains.

Importantly:

This interaction is ephemeral
Only the selected output, whether compacted, preview, or full, is committed to history
Compaction runs asynchronously and does not block parallel tool calls

Even when multiple large tool outputs require compaction, the additional latency is capped at up to 10 seconds per parallel batch, regardless of how many tools are compacted within it.

The impact was immediate.

For agents with heavy tool usage, we achieved:

Over 80% reduction in tool output tokens
Previously failing workflows now completing successfully
Heavy tasks that used to exhaust the context window now staying under 30% utilization

This alone unlocked workflows that were previously unstable or impossible to complete.

2. Two-Phase History Compaction via Observational Memory

Tool outputs are only part of the problem. Conversation history itself grows over time.

Instead of treating history as a continuous log to shrink, we adopted a layered compaction strategy inspired by the idea of Observational Memory and adapted it to our platform.

The guiding principles are straightforward.

1. Never Compact the Previous Turn, Unless in Panic Mode

Follow-up prompts often depend heavily on the most recent output:

“Edit the second paragraph.”
“Keep everything the same except the introduction.”

Compacting the last turn too early causes avoidable degradation.

We only compact the previous turn if we are in panic mode, meaning failure is imminent without intervention.

2. Compact Out of Band

Rather than compacting mid-turn when a threshold is crossed, we compact at the end of a turn while the user is reading or typing their next message.

This avoids disrupting the agent’s reasoning and keeps the user experience smooth.

3. Preserve Critical Details

During compaction we explicitly preserve:

File names, uploaded or generated
Artifacts created during the session
Key constraints or decisions
Lessons learned that help avoid repeated mistakes

This prevents context amnesia, a common failure mode in naive summarization systems.

4. Two Compaction Thresholds

We separate raw history from structured observations.

Low Threshold

When raw history exceeds 30% of the context window, we:

Compact unobserved history into a dense, prioritized list of facts, referred to as observations
Append those observations to existing structured entries

High Threshold

When structured observations exceed 50% of the context window, we:

Reflect on all accumulated observations
Remove repetition and redundancy
Further compress them into a refined state

This creates a layered abstraction:

Raw Turns -> Observations -> Refined Observations

If the fast and small compaction model cannot fit the required context, whether from a large tool output, raw history during observation, or accumulated observations during reflection, we fall back to a 1M context window model. If that still is not enough, we apply hard truncation. This fallback path is needed in less than 5% of sessions.

Results

Across heavy, tool-driven workflows, this architecture fundamentally changed what our agents could handle.

Workflows that previously:

Crashed due to context overflow
Degraded unpredictably mid-task
Required manual intervention

Now complete reliably, often using less than 30% of the available context window.

Fallback to large-context models is rare, at less than 5% of sessions, and tool-heavy agents see over 80% reduction in tool output tokens.

Most importantly, this system enabled a new class of general-purpose agents internally. These agents were failing our internal benchmarks prior to these changes. Without adaptive context management, they simply could not sustain long, tool-rich workflows.

With it, they can.

Closing Thoughts

Larger context windows will continue to raise the ceiling of what agents can process.

Production reliability is not achieved by increasing limits. It is achieved by managing them intelligently.

As agents become more capable, calling tools, generating artifacts, and coordinating with other agents, naive history replay stops working. Systems need adaptive context construction, goal-aware compaction, and deliberate window management.

In our experience, context management is one of the core engineering challenges in applied AI systems. It is also one of the clearest differences between demos and dependable production agents.

This is our approach. And it is only the beginning.

Why Do We Need Context Management?

1. Hard Limits

2. Performance Degradation

Beyond Truncation and Basic Compaction

1. Real-Time Smart Tool Output Compaction

Option 1 - Compact the Output

Option 2 - Keep the Preview and Proceed

Option 3 - Keep the Full Output

2. Two-Phase History Compaction via Observational Memory

1. Never Compact the Previous Turn, Unless in Panic Mode

2. Compact Out of Band

3. Preserve Critical Details

4. Two Compaction Thresholds

Low Threshold

High Threshold

Results

Closing Thoughts

You might also like

Get started with Relevance AI

Use Cases

Product

Competitors

Support

Learn

Company

Legal