At Relevance AI we run more than 6,000 agents per day, performing over 30,000 complex tasks. Our agents can trigger any of more than 9,000 tools and call each other when configured as a workforce.
At this scale, context management is not an implementation detail. It is infrastructure.
Keeping agents reliable, performant, and accurate over long-running, tool-heavy tasks depends on how well they manage context. It is an active field of research, and the industry is still searching for the silver bullet approach.
In our experience, the difference between a toy agent and a production agent is almost always context management.
Why Do We Need Context Management?
Large language models (LLMs) have a hard limit on how much information they can process in a single call. Exceed it, and the request fails.
More importantly, LLMs do not retain memory between calls. Every invocation starts from a blank slate. If you want the model to remember previous turns, you must resend the relevant context each time.
For example:
User: "Hi, can we talk about apples?"
LLM: "Sure, what would you like to know?"
User: "How many should a child eat every week?"
When generating the final answer, the model must see the entire exchange. The last question makes no sense without the earlier turns.
In simple demos, the naive approach works. You resend the entire conversation every time.
In production systems, this quickly breaks down for two reasons.
1. Hard Limits
Eventually you hit the model's context window limit and the call fails. How quickly depends on the size of inputs and outputs. A single turn involving multiple large tool outputs, such as transcripts, tables, or generated documents, can exhaust the window immediately.
2. Performance Degradation
Even before hitting the limit, too much irrelevant history degrades model performance. The model loses focus. This is closely related to phenomena like context rot and the lost in the middle effect.
Context management is therefore the art and science of constructing the minimal, most relevant context for each model call, no more and no less.
Beyond Truncation and Basic Compaction
When we set out to improve our agents' context handling, we already had a basic truncation and compaction mechanism in place. Once the context window reached a threshold, older history would be compressed.
In practice, this was not sufficient.
We observed:
- Significant performance degradation once compaction kicked in
- Loss of task-critical details
- Agents crashing when a single turn with multiple large tool outputs overwhelmed the context window
We needed something adaptive. The system had to treat context not as a log to trim, but as a constrained resource to manage deliberately.
We redesigned the architecture around two mechanisms.
1. Real-Time Smart Tool Output Compaction
Large tool outputs are one of the biggest threats to context health.
Imagine retrieving a full meeting transcript just to extract a few key details. Injecting the entire transcript into context is wasteful and destabilizing.
We introduced a goal-aware, real-time compaction mechanism.
If a tool output exceeds 4,000 tokens, we pause before committing it to history and:
- Show the agent a short preview
- Offer three options
Option 1 - Compact the Output
The agent can request compaction and provide explicit instructions, for example:
"Extract the following details: A, B, C."
The full output is then sent to a fast model for targeted compression based on the agent's instructions.
Instead of blindly summarizing tool output, we let the agent define what is relevant to its current goal.
Option 2 - Keep the Preview and Proceed
Sometimes the preview is sufficient. For example, in the case of a large stack trace, the agent may only need to know that an error occurred and why.
Option 3 - Keep the Full Output
If the agent genuinely needs the entire output, such as when combining rows from a large dataset, it can opt to include it. This option is available only if sufficient context window space remains.
Importantly:
- This interaction is ephemeral
- Only the selected output, whether compacted, preview, or full, is committed to history
- Compaction runs asynchronously and does not block parallel tool calls
Even when multiple large tool outputs require compaction, the additional latency is capped at up to 10 seconds per parallel batch, regardless of how many tools are compacted within it.
The impact was immediate.
For agents with heavy tool usage, we achieved:
- Over 80% reduction in tool output tokens
- Previously failing workflows now completing successfully
- Heavy tasks that used to exhaust the context window now staying under 30% utilization
This alone unlocked workflows that were previously unstable or impossible to complete.
2. Two-Phase History Compaction via Observational Memory
Tool outputs are only part of the problem. Conversation history itself grows over time.
Instead of treating history as a continuous log to shrink, we adopted a layered compaction strategy inspired by the idea of Observational Memory and adapted it to our platform.
The guiding principles are straightforward.
1. Never Compact the Previous Turn, Unless in Panic Mode
Follow-up prompts often depend heavily on the most recent output:
- "Edit the second paragraph."
- "Keep everything the same except the introduction."
Compacting the last turn too early causes avoidable degradation.
We only compact the previous turn if we are in panic mode, meaning failure is imminent without intervention.
2. Compact Out of Band
Rather than compacting mid-turn when a threshold is crossed, we compact at the end of a turn while the user is reading or typing their next message.
This avoids disrupting the agent's reasoning and keeps the user experience smooth.
3. Preserve Critical Details
During compaction we explicitly preserve:
- File names, uploaded or generated
- Artifacts created during the session
- Key constraints or decisions
- Lessons learned that help avoid repeated mistakes
This prevents context amnesia, a common failure mode in naive summarization systems.
4. Two Compaction Thresholds
We separate raw history from structured observations.
Low Threshold
When raw history exceeds 30% of the context window, we:
- Compact unobserved history into a dense, prioritized list of facts, referred to as observations
- Append those observations to existing structured entries
High Threshold
When structured observations exceed 50% of the context window, we:
- Reflect on all accumulated observations
- Remove repetition and redundancy
- Further compress them into a refined state
This creates a layered abstraction:
Raw Turns -> Observations -> Refined Observations
If the fast and small compaction model cannot fit the required context, whether from a large tool output, raw history during observation, or accumulated observations during reflection, we fall back to a 1M context window model. If that still is not enough, we apply hard truncation. This fallback path is needed in less than 5% of sessions.
Results
Across heavy, tool-driven workflows, this architecture fundamentally changed what our agents could handle.
Workflows that previously:
- Crashed due to context overflow
- Degraded unpredictably mid-task
- Required manual intervention
Now complete reliably, often using less than 30% of the available context window.
Fallback to large-context models is rare, at less than 5% of sessions, and tool-heavy agents see over 80% reduction in tool output tokens.
Most importantly, this system enabled a new class of general-purpose agents internally. These agents were failing our internal benchmarks prior to these changes. Without adaptive context management, they simply could not sustain long, tool-rich workflows.
With it, they can.
Closing Thoughts
Larger context windows will continue to raise the ceiling of what agents can process.
Production reliability is not achieved by increasing limits. It is achieved by managing them intelligently.
As agents become more capable, calling tools, generating artifacts, and coordinating with other agents, naive history replay stops working. Systems need adaptive context construction, goal-aware compaction, and deliberate window management.
In our experience, context management is one of the core engineering challenges in applied AI systems. It is also one of the clearest differences between demos and dependable production agents.
This is our approach. And it is only the beginning.

