Skip to content

Token Budgeting

Every AI coding agent operates within a fixed context window — the total number of tokens the model can see in a single request. A 200k-token model sounds generous until you start filling it: the system prompt takes a slice, the repo map takes another, chat history accumulates, read-only files add bulk, and each tool result (file contents, shell output, diagnostics) consumes more. Without deliberate budget management, the agent silently drops information or hits a hard API error when the window overflows.

Token budgeting is the discipline of partitioning this finite resource across competing demands. The core questions are:

  1. How much space does each component get? — Fixed allocation vs. dynamic adjustment vs. first-come-first-served.
  2. What gets cut first when the budget overflows? — Oldest chat history? Repo map shrinkage? Tool output truncation?
  3. How are tokens counted? — Exact tokenization (tiktoken) vs. byte-based heuristics vs. character estimation.
  4. When does compaction trigger? — Pre-emptive before the API call, or reactive after an overflow error?

Getting this wrong produces some of the worst user experiences in coding agents: context silently drifts (the model “forgets” earlier edits), the repo map disappears when files are large, or the model hallucinates because it lost track of what was already changed. For history compaction strategies and message-pruning behavior, see Chat History. For provider-side prefix reuse economics, see Prompt Caching.


Reference commit: b9050e1d5faf8096eae7a46a9ecc05a86231384b

Aider uses a chunked message assembly model with explicit sub-budgets for the repo map and chat history, while letting other components (system prompt, file content, examples) consume space on a first-come basis. Token counting is delegated to litellm.

Aider counts tokens via litellm.token_counter(), which dispatches to the correct tokenizer (tiktoken for OpenAI models, model-specific tokenizers for others):

# aider/models.py, line 635
def token_count(self, messages):
if type(messages) is list:
try:
return litellm.token_counter(model=self.name, messages=messages)
except Exception as err:
print(f"Unable to count tokens: {err}")

This is exact tokenization — not a heuristic — making Aider’s budget calculations the most accurate of the three tools.

All messages are assembled in format_chat_chunks() (aider/coders/base_coder.py, line 1226) into a ChatChunks dataclass (aider/coders/chat_chunks.py):

@dataclass
class ChatChunks:
system: List # System prompt (main_sys + system_reminder)
examples: List # Few-shot example conversations
readonly_files: List # --read files (AGENTS.md, etc.)
repo: List # Repo map output
done: List # Completed chat history (may be summarized)
chat_files: List # Files currently in the chat (/add'd files)
cur: List # Current user message
reminder: List # System reminder (repeated at end)

The all_messages() method concatenates them in a fixed order: system → examples → readonly_files → repo → done → chat_files → cur → reminder. This order is significant for prompt caching — the stable prefix (system + examples + readonly_files + repo) stays cache-hot across turns.

The repo map gets a dedicated token budget computed from the model’s input limit:

# aider/models.py, line 767
def get_repo_map_tokens(self):
map_tokens = 1024
max_inp_tokens = self.info.get("max_input_tokens")
if max_inp_tokens:
map_tokens = max_inp_tokens / 8 # 12.5% of input window
map_tokens = min(map_tokens, 4096) # Hard cap at 4096
map_tokens = max(map_tokens, 1024) # Floor at 1024
return map_tokens

For a 128k-token model: 128000 / 8 = 16000, clamped to 4096. For a 4k-token model: 4000 / 8 = 500, raised to 1024. The repo map generation in RepoMap.get_ranked_tags_map_uncached() (aider/repomap.py, line 629) uses binary search to find the maximum number of ranked tags that fit within this budget, counting tokens on each candidate tree via self.token_count(tree).

When no files are in the chat, the map budget is multiplied by --map-multiplier-no-files (default 2), giving the model twice the codebase overview.

Chat history gets a separate budget:

# aider/models.py, line 346
self.max_chat_history_tokens = min(max(max_input_tokens / 16, 1024), 8192)

For a 128k model: 128000 / 16 = 8000, clamped to 8192. For a 4k model: floor at 1024. Users can override this via --max-chat-history-tokens.

When done_messages exceeds this budget, summarization kicks in asynchronously. The ChatSummary class (aider/history.py) implements a recursive split-and-summarize algorithm:

  1. Check overflow: too_big() sums token counts of all done messages; if total exceeds max_tokens, proceed.
  2. Split: Walk backward from the end, accumulating tokens until half the budget is consumed. This becomes the “tail” (kept verbatim). Everything before the split is the “head.”
  3. Summarize head: Concatenate head messages into # USER\n{content}\n# ASSISTANT\n{content} blocks, send to the weak model with a summarize prompt. Result is a single user message prefixed with "[chat summary] ".
  4. Recurse: If summary + tail still exceeds the budget, recurse with depth + 1 (max depth 3).

The weak model (self.main_model.weak_model) handles summarization — typically a smaller, cheaper model. Summarization runs in a background thread (summarizer_thread) so it doesn’t block the user’s next message.

Before sending, check_tokens() (line 1396) compares the total assembled message count against max_input_tokens. If it exceeds the limit, Aider warns the user and asks whether to proceed — it does not auto-truncate. The user is expected to /drop files or /clear history.

A system reminder is appended at the end of the message list only if there’s room (total_tokens < max_input_tokens). The reminder can be injected as a system role message or stuffed into the last user message, depending on self.main_model.reminder setting.

Aider implements a cache-warming mechanism (line 1340) that sends minimal requests (max_tokens=1) every ~5 minutes to keep the prefix cached on Anthropic’s servers. Cache control headers are placed at the boundaries of examples, repo map, and chat files.


Reference commit: 4ab44e2c5cc54ed47e47a6729dfd8aa5a3dc2476

Codex uses a byte-based heuristic for token estimation, a percentage-based safety margin on the context window, and an iterative trim-from-front strategy when the window overflows. There is no dedicated repo map, so the budget partitioning is simpler than Aider’s.

Codex estimates tokens from byte counts rather than using a tokenizer:

// codex-rs/core/src/truncate.rs, line 10
const APPROX_BYTES_PER_TOKEN: usize = 4;
pub(crate) fn approx_token_count(text: &str) -> usize {
let len = text.len();
len.saturating_add(APPROX_BYTES_PER_TOKEN.saturating_sub(1)) / APPROX_BYTES_PER_TOKEN
}

This is a deliberate trade-off: no dependency on tiktoken, fast computation, and “close enough” for budget decisions. The 4-byte heuristic is reasonable for GPT tokenizers (typically 3.5–5 bytes per token for English code).

For items in the conversation history, Codex estimates by serializing the item to JSON and counting the resulting bytes:

codex-rs/core/src/context_manager/history.rs
pub(crate) fn estimate_response_item_model_visible_bytes(item: &ResponseItem) -> i64 {
match item {
ResponseItem::GhostSnapshot { .. } => 0, // Not sent to model
ResponseItem::Reasoning { encrypted_content: Some(content), .. }
| ResponseItem::Compaction { encrypted_content: content } => {
// Encrypted reasoning: estimate 75% of base64 length
i64::try_from(estimate_reasoning_length(content.len())).unwrap_or(i64::MAX)
}
item => {
serde_json::to_string(item)
.map(|serialized| i64::try_from(serialized.len()).unwrap_or(i64::MAX))
.unwrap_or_default()
}
}
}

Each model defines a raw context_window and an effective_context_window_percent (default 95%):

codex-rs/core/src/codex.rs
pub(crate) fn model_context_window(&self) -> Option<i64> {
self.model_info.context_window.map(|context_window| {
context_window.saturating_mul(effective_context_window_percent) / 100
})
}

A 200k model becomes 190k effective — the 5% margin absorbs API response overhead and estimation error.

When recording tool outputs, Codex applies a TruncationPolicy with a 20% serialization budget overhead:

// codex-rs/core/src/context_manager/history.rs, line 328
let policy_with_serialization_budget = policy * 1.2;

The truncation algorithm preserves prefix and suffix, cutting from the middle:

// codex-rs/core/src/truncate.rs, line 186
fn truncate_with_byte_estimate(s: &str, policy: TruncationPolicy) -> String {
let (left_budget, right_budget) = split_budget(max_bytes); // 50/50 split
// ... preserve start and end, insert "…N tokens truncated…" marker
}

This middle-out strategy preserves the beginning (often a header or summary) and the end (often the result or error message) while cutting repetitive middle content.

Codex triggers compaction based on auto_compact_token_limit, a per-model threshold:

let auto_compact_limit = model_info.auto_compact_token_limit().unwrap_or(i64::MAX);
if total_usage_tokens >= auto_compact_limit {
run_auto_compact(sess, turn_context).await?;
}

Compaction runs post-sampling (after the model finishes its response). The compaction flow (codex-rs/core/src/compact.rs, line 130):

  1. Extract trailing model-switch updates (preserved across compaction).
  2. Serialize current history as a Prompt and send to the LLM for summarization.
  3. Build compacted history: initial context + compacted summary + ghost snapshots.
  4. If the compacted prompt still exceeds the context window (CodexErr::ContextWindowExceeded), iteratively remove_first_item() from history and retry.

The iterative trim removes the oldest items first, which preserves the prompt prefix for cache benefits.

Codex tracks actual API-reported token usage alongside its byte-based estimates:

codex-rs/protocol/src/protocol.rs
pub struct TokenUsage {
pub input_tokens: i64,
pub cached_input_tokens: i64,
pub output_tokens: i64,
pub reasoning_output_tokens: i64,
pub total_tokens: i64,
}
pub struct TokenUsageInfo {
pub total_token_usage: TokenUsage,
pub last_token_usage: TokenUsage,
pub model_context_window: Option<i64>,
}

The get_total_token_usage() method (line 264) combines the last API response’s total_tokens with byte-estimated tokens for items added since that response — a hybrid approach that uses real counts when available and falls back to heuristics for new items.


Reference commit: 7ed449974864361bad2c1f1405769fd2c2fcdf42

OpenCode uses a character-based estimation for pruning decisions, actual API-reported tokens for overflow detection, and a two-mechanism approach: tool output pruning (remove old tool results) and session compaction (LLM-generated summary).

OpenCode uses the simplest estimation of the three tools:

packages/opencode/src/util/token.ts
export namespace Token {
const CHARS_PER_TOKEN = 4
export function estimate(input: string) {
return Math.max(0, Math.round((input || "").length / CHARS_PER_TOKEN))
}
}

This character-to-token estimate is used only for pruning decisions. For overflow detection, OpenCode uses actual token counts from the LLM API response:

// packages/opencode/src/session/message-v2.ts, line 246
tokens: z.object({
total: z.number().optional(),
input: z.number(),
output: z.number(),
reasoning: z.number(),
cache: z.object({ read: z.number(), write: z.number() }),
})

The isOverflow() function (packages/opencode/src/session/compaction.ts, line 32) determines when compaction is needed:

export async function isOverflow(input: {
tokens: MessageV2.Assistant["tokens"];
model: Provider.Model;
}) {
const config = await Config.get()
if (config.compaction?.auto === false) return false
const context = input.model.limit.context
if (context === 0) return false
const count = input.tokens.total ||
input.tokens.input + input.tokens.output +
input.tokens.cache.read + input.tokens.cache.write
const reserved = config.compaction?.reserved ??
Math.min(COMPACTION_BUFFER, ProviderTransform.maxOutputTokens(input.model))
const usable = input.model.limit.input
? input.model.limit.input - reserved
: context - ProviderTransform.maxOutputTokens(input.model)
return count >= usable
}

Key constants: COMPACTION_BUFFER = 20_000 tokens. OUTPUT_TOKEN_MAX = 32_000. The usable budget is either limit.input - reserved (if the model has a specific input cap) or context - maxOutputTokens(model).

For a 200k-context model with 32k output: usable = 200000 - 32000 - 20000 = 148000. Overflow triggers when accumulated tokens reach 148k.

The prune() function (compaction.ts, line 50) removes old tool output text while preserving tool structure:

export const PRUNE_MINIMUM = 20_000 // Min tokens saved to justify pruning
export const PRUNE_PROTECT = 40_000 // Protect most recent 40k tokens of tool output
const PRUNE_PROTECTED_TOOLS = ["skill"] // Never prune these

The algorithm walks backward through message history, collecting completed tool calls. It protects the most recent 40k tokens of tool output and only applies pruning if at least 20k tokens would be freed. Pruned tool outputs are replaced with "[Old tool result content cleared]" in subsequent LLM calls (line 620 of message-v2.ts), and part.state.time.compacted = Date.now() marks the pruning timestamp.

When overflow is detected after a model response, compaction triggers (prompt.ts, line 542):

if (
lastFinished &&
lastFinished.summary !== true &&
(await SessionCompaction.isOverflow({ tokens: lastFinished.tokens, model }))
) {
await SessionCompaction.create({
sessionID, agent: lastUser.agent,
model: lastUser.model, auto: true,
})
continue
}

The compaction process (compaction.ts, line 101):

  1. Converts the full message history into model messages via MessageV2.toModelMessages().
  2. Sends a structured summarization prompt asking for: Goal, Instructions, Discoveries, Accomplished work, and Relevant files/directories.
  3. The LLM generates a summary with zero tools available (summary-only mode).
  4. The summary becomes a new message in the history, and filterCompacted() ensures subsequent turns only see messages from the last compaction point forward.

Compaction is an explicit conversation artifact — it appears in the message history and the user can see it. This differs from Aider’s transparent summarization.

After compaction, filterCompacted() (message-v2.ts, line 794) filters the message stream:

export async function filterCompacted(stream: AsyncIterable<MessageV2.WithParts>) {
const result = [] as MessageV2.WithParts[]
const completed = new Set<string>()
for await (const msg of stream) {
result.push(msg)
if (
msg.info.role === "user" &&
completed.has(msg.info.id) &&
msg.parts.some((part) => part.type === "compaction")
) break
if (msg.info.role === "assistant" && msg.info.summary && msg.info.finish)
completed.add(msg.info.parentID)
}
result.reverse()
return result
}

This returns only messages from the last compaction point forward, ensuring the LLM sees the summary plus recent context rather than the full uncompacted history.

Users can tune compaction behavior in opencode.json:

{
"compaction": {
"auto": true, // Enable/disable auto-compaction
"reserved": 20000, // Reserved token buffer
"prune": true // Enable/disable tool output pruning
}
}

A plugin hook (experimental.session.compacting) can inject custom context or replace the compaction prompt entirely.


Claude Code’s token budgeting is architecturally distinct from the other three references because it delegates most budget enforcement to the Anthropic API’s server-side features rather than implementing client-side budget partitioning. The client tracks costs and context usage for display purposes, while the API handles overflow prevention.

Claude Code uses the Anthropic token counting API (/v1/messages/count_tokens) — a free endpoint that accepts the same input format as the Messages API and returns the exact input token count:

POST /v1/messages/count_tokens
{ "model": "claude-4-6-opus", "system": "...", "messages": [...], "tools": [...] }
Response: { "input_tokens": 15234 }

Key properties:

  • Free — no cost, separate rate limits from message creation (100-8000 RPM by tier)
  • Supports all input types: system prompts, tools, images, PDFs, extended thinking
  • Extended thinking: previous turn thinking blocks are ignored and don’t count
  • Returns an estimate (may differ slightly from actual usage)
  • Supports context_management parameter to preview token count after context editing is applied

This eliminates the need for client-side token counting heuristics (Codex’s 4-byte estimate) or tokenizer dependencies (Aider’s litellm). The client can validate prompt size before sending.

ModelStandardExtended (Beta)
Opus 4.6, Sonnet 4.6, Sonnet 4.5, Sonnet 4200K1M (tier 4+, beta header context-1m-2025-08-07)
Haiku 4.5200KN/A

Extended context pricing: 2x input, 1.5x output for requests exceeding 200K tokens. Separate rate limits.

Newer models (Sonnet 3.7+) return a validation error on overflow instead of silently truncating. This is a hard boundary — the client must stay within limits.

Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) receive their token budget at session start and usage updates after each tool call:

<!-- At session start -->
<budget:token_budget>200000</budget:token_budget>
<!-- After each tool call -->
<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>

This is a model-training feature — Claude is trained to use this information for planning task execution within available context. The model can decide to be more concise, prioritize remaining work, or request compaction proactively.

Extended thinking has specific budget interactions:

  • Thinking blocks are output tokens, counted toward context window during generation
  • Previous turn thinking blocks are automatically stripped from subsequent turn inputs by the API
  • Effective calculation: context_window = (input_tokens - previous_thinking_tokens) + current_turn_tokens
  • Exception: during a tool use cycle, thinking blocks MUST be preserved until the cycle completes (API verifies via cryptographic signatures)

Thinking block management is handled server-side via the clear_thinking_20251015 context editing strategy:

ConfigurationEffect
Default (no strategy configured)Keep only last turn’s thinking
keep: { type: "thinking_turns", value: N }Keep last N turns
keep: "all"Keep all (maximizes prompt cache hits)

Claude Code uses the API’s context editing (beta context-management-2025-06-27) for graduated context management:

Tier 1: Tool result clearing (clear_tool_uses_20250919)

Clears oldest tool results when context exceeds threshold. Cheapest form of context management — no LLM summary call needed.

ParameterDefaultDescription
trigger100K input tokensWhen to activate
keep3 tool usesRecent pairs to preserve
clear_at_leastNoneMinimum tokens to clear per activation
exclude_toolsNoneNever-clear tool names
clear_tool_inputsfalseAlso clear tool call parameters

Cleared results replaced with placeholder text. The model sees that a tool was called but not what it returned.

Tier 2: Thinking block clearing (described above)

Tier 3: Server-side compaction (compact_20260112)

Full conversation summary when context editing isn’t sufficient. See Chat History for compaction details.

The API response includes what was cleared:

{
"context_management": {
"applied_edits": [
{ "type": "clear_thinking_20251015", "cleared_thinking_turns": 3, "cleared_input_tokens": 15000 },
{ "type": "clear_tool_uses_20250919", "cleared_tool_uses": 8, "cleared_input_tokens": 50000 }
]
}
}

Claude Code exposes real-time context and cost data to users via the status line — a shell script receiving JSON on stdin:

  • context_window.used_percentage — calculated from input tokens only (input + cache_creation + cache_read)
  • context_window.context_window_size — 200K or 1M
  • context_window.current_usage — per-call breakdown (input, output, cache creation, cache read)
  • cost.total_cost_usd — accumulated session cost
  • exceeds_200k_tokens — fixed threshold indicator

The used_percentage reflects the last API call’s context state, not cumulative session totals. Cumulative totals (total_input_tokens, total_output_tokens) can exceed the context window because they sum across all calls in the session.

When context editing is enabled alongside the Anthropic memory tool, Claude receives an automatic warning when approaching the clearing threshold. This allows it to save important tool results to persistent memory files before they’re cleared from context.

Claude Code provides built-in cost tracking and organizational spend governance that integrates directly with token budgeting decisions.

Per-Session Cost Visibility

The /cost command surfaces per-session token usage and cost breakdown including total cost, API duration, wall duration, and code changes (lines added/removed). This gives developers immediate feedback on how context-heavy their session has become.

Observed Cost Baselines

  • Average cost: ~$6/developer/day, with <$12 covering 90th percentile usage
  • Monthly estimates: ~$100–200/developer/month with Sonnet 4.6
  • Background token usage: ~$0.04/session for automatic conversation summarization
  • Agent teams consume ~7x tokens compared to standard sessions (each teammate maintains its own context window as a separate Claude instance)

Workspace Spend Limits

Organization-level spend limits are configurable via the Anthropic Console. These act as a hard ceiling independent of per-session token budgets — even if the context window has room, the API will reject requests once the spend limit is reached.

Rate Limit Recommendations by Team Size

Rate limits (tokens per minute) scale inversely with team size because fewer users are concurrent in larger organizations:

Team SizeRecommended TPM
1–5 users200k–300k TPM
6–50 users80k–120k TPM
51–200 users30k–50k TPM
200–500 users15k–25k TPM
500+ users10k–15k TPM

OpenTelemetry Cost Metrics

When OpenTelemetry export is enabled, Claude Code emits two cost-relevant metrics:

  • claude_code.cost.usage — USD cost with per-model breakdown attributes
  • claude_code.token.usage — token counts broken down by dimension: input, output, cacheRead, cacheCreation

These integrate with standard observability stacks (Prometheus, Grafana, Datadog) for organizational cost dashboards.

Programmatic Cost Reporting APIs

  • Usage & Cost Admin API — enables programmatic cost reporting with configurable aggregation buckets (1-minute, 1-hour, 1-day granularity)
  • Claude Code Usage Report API — provides per-user daily metrics including commits, lines of code, sessions, pull requests, and per-model cost breakdown

Cost Reduction Strategies

Several strategies reduce token consumption and cost, directly tied to token budgeting:

  1. Model selection — use Sonnet over Opus for routine tasks (lower per-token cost)
  2. /clear between tasks — resets context window, avoids paying for stale history
  3. Custom compaction instructions — tune what the summary preserves vs. discards
  4. Extended thinking budget tuning — default is 31,999 tokens; reducing this for simpler tasks saves output tokens
  5. Subagent delegation — offload verbose operations (large file reads, test runs) to subagents, keeping the primary context lean
  6. MCP tool search auto-activation — tools are only loaded into context when needed, not eagerly
  7. Code intelligence plugins — reduce token-heavy file reads by providing targeted symbol information
  8. Preprocessing hooks — filter or transform tool outputs before they enter the context window
AspectAiderCodexOpenCodeClaude Code
Token countingExact (litellm tokenizer)Byte heuristic (4 bytes/token)Char heuristic (4 chars/token)Free API endpoint (exact)
Pre-send validationcheck_tokens() warns user5% safety marginNone (post-response only)Token counting API + validation error
Budget partitioningExplicit (repo map 12.5%, history 6.25%)Effective window 95%COMPACTION_BUFFER 20KServer-side (trigger thresholds)
Overflow responseWarn and ask userIterative oldest-item removalPost-response compactionValidation error + server-side editing/compaction
Tool output managementPart of history summarizationMiddle-out truncationTwo-phase prune + summarizeServer-side clear_tool_uses
Thinking managementN/AEncrypted reasoning (75% estimate)Reasoning partsServer-side clear_thinking
Context awarenessNoneNoneNoneModel-level budget injection
Real-time monitoringNoneToken usage in TUINoneStatus line with JSON session data
Cost trackingNoneTokenUsage structPer-message costStatus line cost.total_cost_usd

Aider’s litellm-based exact counting is the most accurate but adds a dependency and is slower. Codex and OpenCode both use ~4 bytes/chars per token heuristics, which can be 15–25% off for non-English text, base64-encoded content, or heavily punctuated code. Codex mitigates this with the 5% safety margin on the context window. OpenCode mitigates by using actual API-reported tokens for the overflow decision and only using the heuristic for pruning.

Codex’s middle-out truncation inserts a "…N tokens truncated…" marker but doesn’t alert the user. If a large tool output is truncated, the model may not realize crucial information was cut from the middle. Aider warns the user explicitly and asks them to decide.

Aider’s recursive summarization can fail if the weak model is unavailable or returns garbage. The fallback is to keep the full unsummarized history, which may cause the next API call to exceed the context window. Aider logs a warning but doesn’t prevent the send.

OpenCode’s compaction runs after the model response, meaning the response that triggered overflow was already generated. The overflow detection uses the token count from that response. If the model’s next response would be much larger (e.g., a long code generation), the compaction may not free enough space.

Both Codex and OpenCode replace conversation history during compaction, which invalidates prompt caches. Aider’s approach of running summarization in a background thread and replacing done_messages atomically is more cache-friendly — the prefix (system + examples + repo) stays stable.

Aider’s repo map budget is clamped at 4096 tokens — only 3% of a 128k context window. For large repos, this may force the map to show only a handful of the most relevant files. The --map-tokens flag lets users override this, but the default is conservative.

Unlike Aider (which calls check_tokens() before sending), OpenCode doesn’t validate the total prompt size before the API call. It relies on post-response overflow detection. This means the first sign of trouble is either a successful-but-bloated response or an API error.

OpenCode’s tool output pruning replaces content with "[Old tool result content cleared]", but the tool call/result structure remains. The model still sees that a tool was called and returned “something” — it just can’t see what. This can confuse models that try to reference old tool results.

9. Server-Side Compaction Billing Complexity

Section titled “9. Server-Side Compaction Billing Complexity”

Claude Code’s server-side compaction adds a separate sampling iteration for summary generation. The usage.iterations array shows both compaction and message costs, but the top-level input_tokens/output_tokens fields exclude compaction iterations. This means naive cost tracking that only reads top-level usage fields will undercount. Developers must sum across the full iterations array.

10. Token Counting Estimates Are Not Exact

Section titled “10. Token Counting Estimates Are Not Exact”

The Anthropic token counting endpoint returns an estimate that may include system-added tokens (not billed). The actual message creation may use slightly different token counts. For budget-critical decisions near the context limit, the 1-2% margin of error matters.

Server-side tool result clearing invalidates prompt caches at the clearing point. The clear_at_least parameter exists specifically to amortize this cost — clearing only a few tokens isn’t worth the cache rebuild. But the interaction between clearing, caching, and compaction creates non-obvious cost tradeoffs.

Only specific Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) support context awareness with budget injection. Opus models and older Sonnet models don’t receive these signals, so they can’t self-manage their context usage. An agent targeting multiple providers can’t rely on this feature.

Organization-wide rate limits are shared, not per-user. TPM recommendations decrease with team size because fewer users are concurrent in larger orgs. But coordinated events (training sessions, hackathons) break this assumption and require temporary TPM increases. A team of 50 that normally works fine at 100k TPM will hit rate limits hard when 30 people start a workshop simultaneously.

14. Agent Teams Multiply Cost Non-Obviously

Section titled “14. Agent Teams Multiply Cost Non-Obviously”

Agent teams use ~7x more tokens than standard sessions. Each teammate maintains its own context window and runs as a separate Claude instance. The cost is roughly proportional to team size times session length, making teams expensive for long-running tasks. A 3-agent team running for an hour can easily consume more tokens than a full day of single-agent usage.


Use a tiered approach:

  1. Fast path: Byte-based heuristic (len / 4) for real-time budget checks during message assembly. Zero dependencies, constant time.
  2. Accurate path: tiktoken-rs crate for pre-send validation. Call once before the API request with the fully assembled prompt. If the accurate count exceeds the budget, trigger compaction before sending.
pub trait TokenCounter {
fn estimate_fast(&self, text: &str) -> usize;
fn count_exact(&self, messages: &[Message]) -> Result<usize>;
}

The fast path guides assembly decisions; the accurate path gates API calls. This avoids Codex’s problem of only discovering overflow after the API rejects the request.

Define explicit budget slices as a BudgetConfig struct:

pub struct BudgetConfig {
pub system_prompt_reserve: usize, // Fixed: measured at startup
pub repo_map_fraction: f32, // Default: 0.125 (12.5%, matching Aider)
pub repo_map_max: usize, // Default: 8192 tokens
pub repo_map_min: usize, // Default: 1024 tokens
pub chat_history_fraction: f32, // Default: 0.0625 (6.25%)
pub chat_history_max: usize, // Default: 8192 tokens
pub safety_margin_fraction: f32, // Default: 0.05 (5%, matching Codex)
pub output_reserve: usize, // Default: 32000 tokens
}

Budget allocation order:

  1. Output reserve — subtracted first (model needs space to respond).
  2. Safety margin — 5% of remaining for estimation error.
  3. System prompt — measured once, cached.
  4. Repo map — dynamic, computed from remaining space (clamped to min/max).
  5. Chat files — added files consume from the remaining pool.
  6. Chat history — fills remaining space up to its cap, summarized if overflowing.
  7. Current message + reminder — always included; if they don’t fit, error.

Adopt Aider’s ChatChunks pattern with Rust types:

pub struct PromptChunks {
pub system: Vec<Message>,
pub repo_map: Vec<Message>,
pub read_only_files: Vec<Message>,
pub chat_history: Vec<Message>, // May be summarized
pub chat_files: Vec<Message>,
pub current: Vec<Message>,
pub reminder: Vec<Message>,
}

Implement a three-tier approach:

  1. Tier 1: Tool output pruning (cheapest). Same concept as OpenCode — clear old tool results beyond a protected window. No LLM call needed.
  2. Tier 2: History summarization (moderate). Use the weak/editor model to summarize old chat_history messages. Run in a background tokio::task, swap atomically when complete. Mirror Aider’s recursive split-and-summarize.
  3. Tier 3: Session compaction (expensive). Full session summary via primary model. Only triggered if tiers 1–2 are insufficient.
CratePurpose
tiktoken-rsExact token counting for pre-send validation
serde_jsonByte-based estimation via serialization (for fast path)
tokioBackground summarization tasks

Before every API call, run a validation step:

async fn validate_prompt(&self, chunks: &PromptChunks) -> Result<(), BudgetError> {
let total = self.counter.count_exact(&chunks.all_messages())?;
let limit = self.model.effective_context_window();
if total > limit {
return Err(BudgetError::Overflow { total, limit });
}
Ok(())
}

On BudgetError::Overflow, attempt tier-1 pruning, then tier-2 summarization, then tier-3 compaction. If all fail, report the error to the user with actionable suggestions (drop files, clear history).

Provider-Aware Context Management (from Claude Code)

Section titled “Provider-Aware Context Management (from Claude Code)”

Support both server-side and client-side context management depending on the provider:

pub enum ContextManagementStrategy {
/// Provider handles context management server-side (Anthropic)
ServerSide {
tool_clearing: Option<ToolClearingConfig>,
thinking_clearing: Option<ThinkingClearingConfig>,
compaction: Option<ServerCompactionConfig>,
},
/// Client handles context management (OpenAI, local models)
ClientSide {
prune_config: PruneConfig,
summarization: SummarizationConfig,
},
}
pub struct ToolClearingConfig {
pub trigger_tokens: usize, // default 100_000
pub keep_tool_uses: usize, // default 3
pub clear_at_least: Option<usize>,
pub exclude_tools: Vec<String>,
pub clear_tool_inputs: bool, // default false
}
pub struct ThinkingClearingConfig {
pub keep: ThinkingKeepPolicy, // Turns(N) or All
}
pub struct ServerCompactionConfig {
pub trigger_tokens: usize, // default 150_000
pub pause_after: bool,
pub instructions: Option<String>,
}

Use a three-tier approach (expanded from dual-mode):

pub trait TokenCounter {
/// Tier 1: Fast estimate for assembly decisions (no dependencies, constant time)
fn estimate_fast(&self, text: &str) -> usize;
/// Tier 2: Provider API token count (free for Anthropic, may not exist for others)
async fn count_api(&self, messages: &[Message]) -> Result<Option<usize>>;
/// Tier 3: Local exact count via tokenizer (tiktoken-rs fallback)
fn count_exact(&self, messages: &[Message]) -> Result<usize>;
}

Pre-send validation order:

  1. If provider API supports token counting (Anthropic) → use count_api()
  2. Else → use count_exact() with local tokenizer
  3. Fast estimate only used during message assembly, never for final validation

When the provider supports model-level context awareness, inject budget information:

pub struct ContextBudgetInjection {
/// Total context budget (injected once at session start)
pub total_budget: usize,
/// Whether to inject usage updates after tool calls
pub inject_usage_updates: bool,
}

For providers without native context awareness, OpenOxide can simulate it by injecting synthetic system messages with token usage information.

Expose budget state for TUI display (inspired by Claude Code’s status line):

pub struct BudgetSnapshot {
pub context_window_size: usize,
pub used_tokens: usize,
pub used_percentage: f32,
pub remaining_tokens: usize,
pub session_cost_usd: f64,
pub session_duration_ms: u64,
pub total_input_tokens: usize, // cumulative across session
pub total_output_tokens: usize,
pub last_call_usage: Option<CallUsage>,
}
pub struct CallUsage {
pub input_tokens: usize,
pub output_tokens: usize,
pub cache_creation_tokens: usize,
pub cache_read_tokens: usize,
}

Update snapshot after each API response. TUI renders from this struct.

pub struct CostGovernance {
/// Per-session cost tracking
pub session_cost: SessionCost,
/// Organization-level spend limits
pub spend_limits: Option<SpendLimits>,
/// OpenTelemetry metrics export
pub otel_config: Option<OtelConfig>,
}
pub struct SessionCost {
pub total_cost_usd: f64,
pub per_model_breakdown: HashMap<String, ModelCost>,
pub api_duration_ms: u64,
pub wall_duration_ms: u64,
pub lines_added: usize,
pub lines_removed: usize,
}
pub struct ModelCost {
pub input_tokens: usize,
pub output_tokens: usize,
pub cache_read_tokens: usize,
pub cache_creation_tokens: usize,
pub estimated_cost_usd: f64,
}
pub struct OtelConfig {
pub metrics_exporter: Vec<MetricsExporter>, // otlp, prometheus, console
pub logs_exporter: Vec<LogsExporter>, // otlp, console
pub endpoint: String,
pub export_interval_ms: u64, // default 60_000
pub resource_attributes: HashMap<String, String>, // team/dept tagging
}

Design decisions:

  1. Session-level cost tracking as first-class data, not derived. Every API call updates the session’s cost snapshot immediately. The SessionCost struct is the single source of truth for the /cost command equivalent in OpenOxide.
  2. Per-model breakdown in cost tracking. Users switch models mid-session; cost attribution must track which model generated which tokens. The per_model_breakdown map is keyed by model identifier and accumulates across the full session.
  3. OpenTelemetry export as optional module. When enabled, emit 8 metrics (session.count, lines_of_code.count, pull_request.count, commit.count, cost.usage, token.usage, code_edit_tool.decision, active_time.total) matching Claude Code’s metric names for dashboard compatibility. The OtelConfig supports multiple exporter backends and custom resource attributes for team/department tagging.
  4. Rate limit awareness. The token counter should expose TPM consumption rate so the TUI can display rate limit headroom alongside context window usage. This is especially important for teams where shared rate limits create contention that doesn’t show up in per-session context metrics.