Token Budgeting
Feature Definition
Section titled “Feature Definition”Every AI coding agent operates within a fixed context window — the total number of tokens the model can see in a single request. A 200k-token model sounds generous until you start filling it: the system prompt takes a slice, the repo map takes another, chat history accumulates, read-only files add bulk, and each tool result (file contents, shell output, diagnostics) consumes more. Without deliberate budget management, the agent silently drops information or hits a hard API error when the window overflows.
Token budgeting is the discipline of partitioning this finite resource across competing demands. The core questions are:
- How much space does each component get? — Fixed allocation vs. dynamic adjustment vs. first-come-first-served.
- What gets cut first when the budget overflows? — Oldest chat history? Repo map shrinkage? Tool output truncation?
- How are tokens counted? — Exact tokenization (tiktoken) vs. byte-based heuristics vs. character estimation.
- When does compaction trigger? — Pre-emptive before the API call, or reactive after an overflow error?
Getting this wrong produces some of the worst user experiences in coding agents: context silently drifts (the model “forgets” earlier edits), the repo map disappears when files are large, or the model hallucinates because it lost track of what was already changed. For history compaction strategies and message-pruning behavior, see Chat History. For provider-side prefix reuse economics, see Prompt Caching.
Aider Implementation
Section titled “Aider Implementation”Reference commit: b9050e1d5faf8096eae7a46a9ecc05a86231384b
Aider uses a chunked message assembly model with explicit sub-budgets for the repo map and chat history, while letting other components (system prompt, file content, examples) consume space on a first-come basis. Token counting is delegated to litellm.
Token Counting
Section titled “Token Counting”Aider counts tokens via litellm.token_counter(), which dispatches to the correct tokenizer (tiktoken for OpenAI models, model-specific tokenizers for others):
# aider/models.py, line 635def token_count(self, messages): if type(messages) is list: try: return litellm.token_counter(model=self.name, messages=messages) except Exception as err: print(f"Unable to count tokens: {err}")This is exact tokenization — not a heuristic — making Aider’s budget calculations the most accurate of the three tools.
ChatChunks: The Message Assembly Order
Section titled “ChatChunks: The Message Assembly Order”All messages are assembled in format_chat_chunks() (aider/coders/base_coder.py, line 1226) into a ChatChunks dataclass (aider/coders/chat_chunks.py):
@dataclassclass ChatChunks: system: List # System prompt (main_sys + system_reminder) examples: List # Few-shot example conversations readonly_files: List # --read files (AGENTS.md, etc.) repo: List # Repo map output done: List # Completed chat history (may be summarized) chat_files: List # Files currently in the chat (/add'd files) cur: List # Current user message reminder: List # System reminder (repeated at end)The all_messages() method concatenates them in a fixed order: system → examples → readonly_files → repo → done → chat_files → cur → reminder. This order is significant for prompt caching — the stable prefix (system + examples + readonly_files + repo) stays cache-hot across turns.
Repo Map Budget
Section titled “Repo Map Budget”The repo map gets a dedicated token budget computed from the model’s input limit:
# aider/models.py, line 767def get_repo_map_tokens(self): map_tokens = 1024 max_inp_tokens = self.info.get("max_input_tokens") if max_inp_tokens: map_tokens = max_inp_tokens / 8 # 12.5% of input window map_tokens = min(map_tokens, 4096) # Hard cap at 4096 map_tokens = max(map_tokens, 1024) # Floor at 1024 return map_tokensFor a 128k-token model: 128000 / 8 = 16000, clamped to 4096. For a 4k-token model: 4000 / 8 = 500, raised to 1024. The repo map generation in RepoMap.get_ranked_tags_map_uncached() (aider/repomap.py, line 629) uses binary search to find the maximum number of ranked tags that fit within this budget, counting tokens on each candidate tree via self.token_count(tree).
When no files are in the chat, the map budget is multiplied by --map-multiplier-no-files (default 2), giving the model twice the codebase overview.
Chat History Budget and Summarization
Section titled “Chat History Budget and Summarization”Chat history gets a separate budget:
# aider/models.py, line 346self.max_chat_history_tokens = min(max(max_input_tokens / 16, 1024), 8192)For a 128k model: 128000 / 16 = 8000, clamped to 8192. For a 4k model: floor at 1024. Users can override this via --max-chat-history-tokens.
When done_messages exceeds this budget, summarization kicks in asynchronously. The ChatSummary class (aider/history.py) implements a recursive split-and-summarize algorithm:
- Check overflow:
too_big()sums token counts of all done messages; if total exceedsmax_tokens, proceed. - Split: Walk backward from the end, accumulating tokens until half the budget is consumed. This becomes the “tail” (kept verbatim). Everything before the split is the “head.”
- Summarize head: Concatenate head messages into
# USER\n{content}\n# ASSISTANT\n{content}blocks, send to the weak model with a summarize prompt. Result is a single user message prefixed with"[chat summary] ". - Recurse: If summary + tail still exceeds the budget, recurse with
depth + 1(max depth 3).
The weak model (self.main_model.weak_model) handles summarization — typically a smaller, cheaper model. Summarization runs in a background thread (summarizer_thread) so it doesn’t block the user’s next message.
Overflow Check
Section titled “Overflow Check”Before sending, check_tokens() (line 1396) compares the total assembled message count against max_input_tokens. If it exceeds the limit, Aider warns the user and asks whether to proceed — it does not auto-truncate. The user is expected to /drop files or /clear history.
Reminder Insertion
Section titled “Reminder Insertion”A system reminder is appended at the end of the message list only if there’s room (total_tokens < max_input_tokens). The reminder can be injected as a system role message or stuffed into the last user message, depending on self.main_model.reminder setting.
Prompt Cache Warming
Section titled “Prompt Cache Warming”Aider implements a cache-warming mechanism (line 1340) that sends minimal requests (max_tokens=1) every ~5 minutes to keep the prefix cached on Anthropic’s servers. Cache control headers are placed at the boundaries of examples, repo map, and chat files.
Codex Implementation
Section titled “Codex Implementation”Reference commit: 4ab44e2c5cc54ed47e47a6729dfd8aa5a3dc2476
Codex uses a byte-based heuristic for token estimation, a percentage-based safety margin on the context window, and an iterative trim-from-front strategy when the window overflows. There is no dedicated repo map, so the budget partitioning is simpler than Aider’s.
Token Counting
Section titled “Token Counting”Codex estimates tokens from byte counts rather than using a tokenizer:
// codex-rs/core/src/truncate.rs, line 10const APPROX_BYTES_PER_TOKEN: usize = 4;
pub(crate) fn approx_token_count(text: &str) -> usize { let len = text.len(); len.saturating_add(APPROX_BYTES_PER_TOKEN.saturating_sub(1)) / APPROX_BYTES_PER_TOKEN}This is a deliberate trade-off: no dependency on tiktoken, fast computation, and “close enough” for budget decisions. The 4-byte heuristic is reasonable for GPT tokenizers (typically 3.5–5 bytes per token for English code).
For items in the conversation history, Codex estimates by serializing the item to JSON and counting the resulting bytes:
pub(crate) fn estimate_response_item_model_visible_bytes(item: &ResponseItem) -> i64 { match item { ResponseItem::GhostSnapshot { .. } => 0, // Not sent to model ResponseItem::Reasoning { encrypted_content: Some(content), .. } | ResponseItem::Compaction { encrypted_content: content } => { // Encrypted reasoning: estimate 75% of base64 length i64::try_from(estimate_reasoning_length(content.len())).unwrap_or(i64::MAX) } item => { serde_json::to_string(item) .map(|serialized| i64::try_from(serialized.len()).unwrap_or(i64::MAX)) .unwrap_or_default() } }}Effective Context Window
Section titled “Effective Context Window”Each model defines a raw context_window and an effective_context_window_percent (default 95%):
pub(crate) fn model_context_window(&self) -> Option<i64> { self.model_info.context_window.map(|context_window| { context_window.saturating_mul(effective_context_window_percent) / 100 })}A 200k model becomes 190k effective — the 5% margin absorbs API response overhead and estimation error.
Tool Output Truncation
Section titled “Tool Output Truncation”When recording tool outputs, Codex applies a TruncationPolicy with a 20% serialization budget overhead:
// codex-rs/core/src/context_manager/history.rs, line 328let policy_with_serialization_budget = policy * 1.2;The truncation algorithm preserves prefix and suffix, cutting from the middle:
// codex-rs/core/src/truncate.rs, line 186fn truncate_with_byte_estimate(s: &str, policy: TruncationPolicy) -> String { let (left_budget, right_budget) = split_budget(max_bytes); // 50/50 split // ... preserve start and end, insert "…N tokens truncated…" marker}This middle-out strategy preserves the beginning (often a header or summary) and the end (often the result or error message) while cutting repetitive middle content.
Auto-Compaction
Section titled “Auto-Compaction”Codex triggers compaction based on auto_compact_token_limit, a per-model threshold:
let auto_compact_limit = model_info.auto_compact_token_limit().unwrap_or(i64::MAX);
if total_usage_tokens >= auto_compact_limit { run_auto_compact(sess, turn_context).await?;}Compaction runs post-sampling (after the model finishes its response). The compaction flow (codex-rs/core/src/compact.rs, line 130):
- Extract trailing model-switch updates (preserved across compaction).
- Serialize current history as a
Promptand send to the LLM for summarization. - Build compacted history: initial context + compacted summary + ghost snapshots.
- If the compacted prompt still exceeds the context window (
CodexErr::ContextWindowExceeded), iterativelyremove_first_item()from history and retry.
The iterative trim removes the oldest items first, which preserves the prompt prefix for cache benefits.
Token Usage Tracking
Section titled “Token Usage Tracking”Codex tracks actual API-reported token usage alongside its byte-based estimates:
pub struct TokenUsage { pub input_tokens: i64, pub cached_input_tokens: i64, pub output_tokens: i64, pub reasoning_output_tokens: i64, pub total_tokens: i64,}
pub struct TokenUsageInfo { pub total_token_usage: TokenUsage, pub last_token_usage: TokenUsage, pub model_context_window: Option<i64>,}The get_total_token_usage() method (line 264) combines the last API response’s total_tokens with byte-estimated tokens for items added since that response — a hybrid approach that uses real counts when available and falls back to heuristics for new items.
OpenCode Implementation
Section titled “OpenCode Implementation”Reference commit: 7ed449974864361bad2c1f1405769fd2c2fcdf42
OpenCode uses a character-based estimation for pruning decisions, actual API-reported tokens for overflow detection, and a two-mechanism approach: tool output pruning (remove old tool results) and session compaction (LLM-generated summary).
Token Counting
Section titled “Token Counting”OpenCode uses the simplest estimation of the three tools:
export namespace Token { const CHARS_PER_TOKEN = 4
export function estimate(input: string) { return Math.max(0, Math.round((input || "").length / CHARS_PER_TOKEN)) }}This character-to-token estimate is used only for pruning decisions. For overflow detection, OpenCode uses actual token counts from the LLM API response:
// packages/opencode/src/session/message-v2.ts, line 246tokens: z.object({ total: z.number().optional(), input: z.number(), output: z.number(), reasoning: z.number(), cache: z.object({ read: z.number(), write: z.number() }),})Overflow Detection
Section titled “Overflow Detection”The isOverflow() function (packages/opencode/src/session/compaction.ts, line 32) determines when compaction is needed:
export async function isOverflow(input: { tokens: MessageV2.Assistant["tokens"]; model: Provider.Model;}) { const config = await Config.get() if (config.compaction?.auto === false) return false const context = input.model.limit.context if (context === 0) return false
const count = input.tokens.total || input.tokens.input + input.tokens.output + input.tokens.cache.read + input.tokens.cache.write
const reserved = config.compaction?.reserved ?? Math.min(COMPACTION_BUFFER, ProviderTransform.maxOutputTokens(input.model)) const usable = input.model.limit.input ? input.model.limit.input - reserved : context - ProviderTransform.maxOutputTokens(input.model) return count >= usable}Key constants: COMPACTION_BUFFER = 20_000 tokens. OUTPUT_TOKEN_MAX = 32_000. The usable budget is either limit.input - reserved (if the model has a specific input cap) or context - maxOutputTokens(model).
For a 200k-context model with 32k output: usable = 200000 - 32000 - 20000 = 148000. Overflow triggers when accumulated tokens reach 148k.
Tool Output Pruning
Section titled “Tool Output Pruning”The prune() function (compaction.ts, line 50) removes old tool output text while preserving tool structure:
export const PRUNE_MINIMUM = 20_000 // Min tokens saved to justify pruningexport const PRUNE_PROTECT = 40_000 // Protect most recent 40k tokens of tool outputconst PRUNE_PROTECTED_TOOLS = ["skill"] // Never prune theseThe algorithm walks backward through message history, collecting completed tool calls. It protects the most recent 40k tokens of tool output and only applies pruning if at least 20k tokens would be freed. Pruned tool outputs are replaced with "[Old tool result content cleared]" in subsequent LLM calls (line 620 of message-v2.ts), and part.state.time.compacted = Date.now() marks the pruning timestamp.
Session Compaction
Section titled “Session Compaction”When overflow is detected after a model response, compaction triggers (prompt.ts, line 542):
if ( lastFinished && lastFinished.summary !== true && (await SessionCompaction.isOverflow({ tokens: lastFinished.tokens, model }))) { await SessionCompaction.create({ sessionID, agent: lastUser.agent, model: lastUser.model, auto: true, }) continue}The compaction process (compaction.ts, line 101):
- Converts the full message history into model messages via
MessageV2.toModelMessages(). - Sends a structured summarization prompt asking for: Goal, Instructions, Discoveries, Accomplished work, and Relevant files/directories.
- The LLM generates a summary with zero tools available (summary-only mode).
- The summary becomes a new message in the history, and
filterCompacted()ensures subsequent turns only see messages from the last compaction point forward.
Compaction is an explicit conversation artifact — it appears in the message history and the user can see it. This differs from Aider’s transparent summarization.
Message History Filtering
Section titled “Message History Filtering”After compaction, filterCompacted() (message-v2.ts, line 794) filters the message stream:
export async function filterCompacted(stream: AsyncIterable<MessageV2.WithParts>) { const result = [] as MessageV2.WithParts[] const completed = new Set<string>() for await (const msg of stream) { result.push(msg) if ( msg.info.role === "user" && completed.has(msg.info.id) && msg.parts.some((part) => part.type === "compaction") ) break if (msg.info.role === "assistant" && msg.info.summary && msg.info.finish) completed.add(msg.info.parentID) } result.reverse() return result}This returns only messages from the last compaction point forward, ensuring the LLM sees the summary plus recent context rather than the full uncompacted history.
Configuration
Section titled “Configuration”Users can tune compaction behavior in opencode.json:
{ "compaction": { "auto": true, // Enable/disable auto-compaction "reserved": 20000, // Reserved token buffer "prune": true // Enable/disable tool output pruning }}A plugin hook (experimental.session.compacting) can inject custom context or replace the compaction prompt entirely.
Claude Code Implementation
Section titled “Claude Code Implementation”Claude Code’s token budgeting is architecturally distinct from the other three references because it delegates most budget enforcement to the Anthropic API’s server-side features rather than implementing client-side budget partitioning. The client tracks costs and context usage for display purposes, while the API handles overflow prevention.
Token Counting: Free Pre-Send API
Section titled “Token Counting: Free Pre-Send API”Claude Code uses the Anthropic token counting API (/v1/messages/count_tokens) — a free endpoint that accepts the same input format as the Messages API and returns the exact input token count:
POST /v1/messages/count_tokens{ "model": "claude-4-6-opus", "system": "...", "messages": [...], "tools": [...] }
Response: { "input_tokens": 15234 }Key properties:
- Free — no cost, separate rate limits from message creation (100-8000 RPM by tier)
- Supports all input types: system prompts, tools, images, PDFs, extended thinking
- Extended thinking: previous turn thinking blocks are ignored and don’t count
- Returns an estimate (may differ slightly from actual usage)
- Supports
context_managementparameter to preview token count after context editing is applied
This eliminates the need for client-side token counting heuristics (Codex’s 4-byte estimate) or tokenizer dependencies (Aider’s litellm). The client can validate prompt size before sending.
Context Window Sizes
Section titled “Context Window Sizes”| Model | Standard | Extended (Beta) |
|---|---|---|
| Opus 4.6, Sonnet 4.6, Sonnet 4.5, Sonnet 4 | 200K | 1M (tier 4+, beta header context-1m-2025-08-07) |
| Haiku 4.5 | 200K | N/A |
Extended context pricing: 2x input, 1.5x output for requests exceeding 200K tokens. Separate rate limits.
Newer models (Sonnet 3.7+) return a validation error on overflow instead of silently truncating. This is a hard boundary — the client must stay within limits.
Context Awareness (Model-Level)
Section titled “Context Awareness (Model-Level)”Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) receive their token budget at session start and usage updates after each tool call:
<!-- At session start --><budget:token_budget>200000</budget:token_budget>
<!-- After each tool call --><system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>This is a model-training feature — Claude is trained to use this information for planning task execution within available context. The model can decide to be more concise, prioritize remaining work, or request compaction proactively.
Extended Thinking and Token Budget
Section titled “Extended Thinking and Token Budget”Extended thinking has specific budget interactions:
- Thinking blocks are output tokens, counted toward context window during generation
- Previous turn thinking blocks are automatically stripped from subsequent turn inputs by the API
- Effective calculation:
context_window = (input_tokens - previous_thinking_tokens) + current_turn_tokens - Exception: during a tool use cycle, thinking blocks MUST be preserved until the cycle completes (API verifies via cryptographic signatures)
Thinking block management is handled server-side via the clear_thinking_20251015 context editing strategy:
| Configuration | Effect |
|---|---|
| Default (no strategy configured) | Keep only last turn’s thinking |
keep: { type: "thinking_turns", value: N } | Keep last N turns |
keep: "all" | Keep all (maximizes prompt cache hits) |
Server-Side Context Editing
Section titled “Server-Side Context Editing”Claude Code uses the API’s context editing (beta context-management-2025-06-27) for graduated context management:
Tier 1: Tool result clearing (clear_tool_uses_20250919)
Clears oldest tool results when context exceeds threshold. Cheapest form of context management — no LLM summary call needed.
| Parameter | Default | Description |
|---|---|---|
trigger | 100K input tokens | When to activate |
keep | 3 tool uses | Recent pairs to preserve |
clear_at_least | None | Minimum tokens to clear per activation |
exclude_tools | None | Never-clear tool names |
clear_tool_inputs | false | Also clear tool call parameters |
Cleared results replaced with placeholder text. The model sees that a tool was called but not what it returned.
Tier 2: Thinking block clearing (described above)
Tier 3: Server-side compaction (compact_20260112)
Full conversation summary when context editing isn’t sufficient. See Chat History for compaction details.
The API response includes what was cleared:
{ "context_management": { "applied_edits": [ { "type": "clear_thinking_20251015", "cleared_thinking_turns": 3, "cleared_input_tokens": 15000 }, { "type": "clear_tool_uses_20250919", "cleared_tool_uses": 8, "cleared_input_tokens": 50000 } ] }}Budget Monitoring via Status Line
Section titled “Budget Monitoring via Status Line”Claude Code exposes real-time context and cost data to users via the status line — a shell script receiving JSON on stdin:
context_window.used_percentage— calculated from input tokens only (input + cache_creation + cache_read)context_window.context_window_size— 200K or 1Mcontext_window.current_usage— per-call breakdown (input, output, cache creation, cache read)cost.total_cost_usd— accumulated session costexceeds_200k_tokens— fixed threshold indicator
The used_percentage reflects the last API call’s context state, not cumulative session totals. Cumulative totals (total_input_tokens, total_output_tokens) can exceed the context window because they sum across all calls in the session.
Memory Tool Integration
Section titled “Memory Tool Integration”When context editing is enabled alongside the Anthropic memory tool, Claude receives an automatic warning when approaching the clearing threshold. This allows it to save important tool results to persistent memory files before they’re cleared from context.
Cost Tracking and Budget Governance
Section titled “Cost Tracking and Budget Governance”Claude Code provides built-in cost tracking and organizational spend governance that integrates directly with token budgeting decisions.
Per-Session Cost Visibility
The /cost command surfaces per-session token usage and cost breakdown including total cost, API duration, wall duration, and code changes (lines added/removed). This gives developers immediate feedback on how context-heavy their session has become.
Observed Cost Baselines
- Average cost: ~$6/developer/day, with <$12 covering 90th percentile usage
- Monthly estimates: ~$100–200/developer/month with Sonnet 4.6
- Background token usage: ~$0.04/session for automatic conversation summarization
- Agent teams consume ~7x tokens compared to standard sessions (each teammate maintains its own context window as a separate Claude instance)
Workspace Spend Limits
Organization-level spend limits are configurable via the Anthropic Console. These act as a hard ceiling independent of per-session token budgets — even if the context window has room, the API will reject requests once the spend limit is reached.
Rate Limit Recommendations by Team Size
Rate limits (tokens per minute) scale inversely with team size because fewer users are concurrent in larger organizations:
| Team Size | Recommended TPM |
|---|---|
| 1–5 users | 200k–300k TPM |
| 6–50 users | 80k–120k TPM |
| 51–200 users | 30k–50k TPM |
| 200–500 users | 15k–25k TPM |
| 500+ users | 10k–15k TPM |
OpenTelemetry Cost Metrics
When OpenTelemetry export is enabled, Claude Code emits two cost-relevant metrics:
claude_code.cost.usage— USD cost with per-model breakdown attributesclaude_code.token.usage— token counts broken down by dimension:input,output,cacheRead,cacheCreation
These integrate with standard observability stacks (Prometheus, Grafana, Datadog) for organizational cost dashboards.
Programmatic Cost Reporting APIs
- Usage & Cost Admin API — enables programmatic cost reporting with configurable aggregation buckets (1-minute, 1-hour, 1-day granularity)
- Claude Code Usage Report API — provides per-user daily metrics including commits, lines of code, sessions, pull requests, and per-model cost breakdown
Cost Reduction Strategies
Several strategies reduce token consumption and cost, directly tied to token budgeting:
- Model selection — use Sonnet over Opus for routine tasks (lower per-token cost)
/clearbetween tasks — resets context window, avoids paying for stale history- Custom compaction instructions — tune what the summary preserves vs. discards
- Extended thinking budget tuning — default is 31,999 tokens; reducing this for simpler tasks saves output tokens
- Subagent delegation — offload verbose operations (large file reads, test runs) to subagents, keeping the primary context lean
- MCP tool search auto-activation — tools are only loaded into context when needed, not eagerly
- Code intelligence plugins — reduce token-heavy file reads by providing targeted symbol information
- Preprocessing hooks — filter or transform tool outputs before they enter the context window
Comparison Table
Section titled “Comparison Table”| Aspect | Aider | Codex | OpenCode | Claude Code |
|---|---|---|---|---|
| Token counting | Exact (litellm tokenizer) | Byte heuristic (4 bytes/token) | Char heuristic (4 chars/token) | Free API endpoint (exact) |
| Pre-send validation | check_tokens() warns user | 5% safety margin | None (post-response only) | Token counting API + validation error |
| Budget partitioning | Explicit (repo map 12.5%, history 6.25%) | Effective window 95% | COMPACTION_BUFFER 20K | Server-side (trigger thresholds) |
| Overflow response | Warn and ask user | Iterative oldest-item removal | Post-response compaction | Validation error + server-side editing/compaction |
| Tool output management | Part of history summarization | Middle-out truncation | Two-phase prune + summarize | Server-side clear_tool_uses |
| Thinking management | N/A | Encrypted reasoning (75% estimate) | Reasoning parts | Server-side clear_thinking |
| Context awareness | None | None | None | Model-level budget injection |
| Real-time monitoring | None | Token usage in TUI | None | Status line with JSON session data |
| Cost tracking | None | TokenUsage struct | Per-message cost | Status line cost.total_cost_usd |
Pitfalls & Hard Lessons
Section titled “Pitfalls & Hard Lessons”1. Heuristic vs. Exact Token Counting
Section titled “1. Heuristic vs. Exact Token Counting”Aider’s litellm-based exact counting is the most accurate but adds a dependency and is slower. Codex and OpenCode both use ~4 bytes/chars per token heuristics, which can be 15–25% off for non-English text, base64-encoded content, or heavily punctuated code. Codex mitigates this with the 5% safety margin on the context window. OpenCode mitigates by using actual API-reported tokens for the overflow decision and only using the heuristic for pruning.
2. Silent Truncation
Section titled “2. Silent Truncation”Codex’s middle-out truncation inserts a "…N tokens truncated…" marker but doesn’t alert the user. If a large tool output is truncated, the model may not realize crucial information was cut from the middle. Aider warns the user explicitly and asks them to decide.
3. Chat History Summarization Failures
Section titled “3. Chat History Summarization Failures”Aider’s recursive summarization can fail if the weak model is unavailable or returns garbage. The fallback is to keep the full unsummarized history, which may cause the next API call to exceed the context window. Aider logs a warning but doesn’t prevent the send.
4. Compaction Timing
Section titled “4. Compaction Timing”OpenCode’s compaction runs after the model response, meaning the response that triggered overflow was already generated. The overflow detection uses the token count from that response. If the model’s next response would be much larger (e.g., a long code generation), the compaction may not free enough space.
5. Cache Invalidation on Compaction
Section titled “5. Cache Invalidation on Compaction”Both Codex and OpenCode replace conversation history during compaction, which invalidates prompt caches. Aider’s approach of running summarization in a background thread and replacing done_messages atomically is more cache-friendly — the prefix (system + examples + repo) stays stable.
6. Repo Map Squeeze
Section titled “6. Repo Map Squeeze”Aider’s repo map budget is clamped at 4096 tokens — only 3% of a 128k context window. For large repos, this may force the map to show only a handful of the most relevant files. The --map-tokens flag lets users override this, but the default is conservative.
7. No Pre-Send Validation in OpenCode
Section titled “7. No Pre-Send Validation in OpenCode”Unlike Aider (which calls check_tokens() before sending), OpenCode doesn’t validate the total prompt size before the API call. It relies on post-response overflow detection. This means the first sign of trouble is either a successful-but-bloated response or an API error.
8. Pruning Doesn’t Reduce Model Memory
Section titled “8. Pruning Doesn’t Reduce Model Memory”OpenCode’s tool output pruning replaces content with "[Old tool result content cleared]", but the tool call/result structure remains. The model still sees that a tool was called and returned “something” — it just can’t see what. This can confuse models that try to reference old tool results.
9. Server-Side Compaction Billing Complexity
Section titled “9. Server-Side Compaction Billing Complexity”Claude Code’s server-side compaction adds a separate sampling iteration for summary generation. The usage.iterations array shows both compaction and message costs, but the top-level input_tokens/output_tokens fields exclude compaction iterations. This means naive cost tracking that only reads top-level usage fields will undercount. Developers must sum across the full iterations array.
10. Token Counting Estimates Are Not Exact
Section titled “10. Token Counting Estimates Are Not Exact”The Anthropic token counting endpoint returns an estimate that may include system-added tokens (not billed). The actual message creation may use slightly different token counts. For budget-critical decisions near the context limit, the 1-2% margin of error matters.
11. Cache Invalidation on Context Editing
Section titled “11. Cache Invalidation on Context Editing”Server-side tool result clearing invalidates prompt caches at the clearing point. The clear_at_least parameter exists specifically to amortize this cost — clearing only a few tokens isn’t worth the cache rebuild. But the interaction between clearing, caching, and compaction creates non-obvious cost tradeoffs.
12. Context Awareness Is Model-Specific
Section titled “12. Context Awareness Is Model-Specific”Only specific Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) support context awareness with budget injection. Opus models and older Sonnet models don’t receive these signals, so they can’t self-manage their context usage. An agent targeting multiple providers can’t rely on this feature.
13. Rate Limit Scaling Is Non-Linear
Section titled “13. Rate Limit Scaling Is Non-Linear”Organization-wide rate limits are shared, not per-user. TPM recommendations decrease with team size because fewer users are concurrent in larger orgs. But coordinated events (training sessions, hackathons) break this assumption and require temporary TPM increases. A team of 50 that normally works fine at 100k TPM will hit rate limits hard when 30 people start a workshop simultaneously.
14. Agent Teams Multiply Cost Non-Obviously
Section titled “14. Agent Teams Multiply Cost Non-Obviously”Agent teams use ~7x more tokens than standard sessions. Each teammate maintains its own context window and runs as a separate Claude instance. The cost is roughly proportional to team size times session length, making teams expensive for long-running tasks. A 3-agent team running for an hour can easily consume more tokens than a full day of single-agent usage.
OpenOxide Blueprint
Section titled “OpenOxide Blueprint”Token Counting: Dual-Mode
Section titled “Token Counting: Dual-Mode”Use a tiered approach:
- Fast path: Byte-based heuristic (
len / 4) for real-time budget checks during message assembly. Zero dependencies, constant time. - Accurate path:
tiktoken-rscrate for pre-send validation. Call once before the API request with the fully assembled prompt. If the accurate count exceeds the budget, trigger compaction before sending.
pub trait TokenCounter { fn estimate_fast(&self, text: &str) -> usize; fn count_exact(&self, messages: &[Message]) -> Result<usize>;}The fast path guides assembly decisions; the accurate path gates API calls. This avoids Codex’s problem of only discovering overflow after the API rejects the request.
Budget Partitioning
Section titled “Budget Partitioning”Define explicit budget slices as a BudgetConfig struct:
pub struct BudgetConfig { pub system_prompt_reserve: usize, // Fixed: measured at startup pub repo_map_fraction: f32, // Default: 0.125 (12.5%, matching Aider) pub repo_map_max: usize, // Default: 8192 tokens pub repo_map_min: usize, // Default: 1024 tokens pub chat_history_fraction: f32, // Default: 0.0625 (6.25%) pub chat_history_max: usize, // Default: 8192 tokens pub safety_margin_fraction: f32, // Default: 0.05 (5%, matching Codex) pub output_reserve: usize, // Default: 32000 tokens}Budget allocation order:
- Output reserve — subtracted first (model needs space to respond).
- Safety margin — 5% of remaining for estimation error.
- System prompt — measured once, cached.
- Repo map — dynamic, computed from remaining space (clamped to min/max).
- Chat files — added files consume from the remaining pool.
- Chat history — fills remaining space up to its cap, summarized if overflowing.
- Current message + reminder — always included; if they don’t fit, error.
Message Assembly
Section titled “Message Assembly”Adopt Aider’s ChatChunks pattern with Rust types:
pub struct PromptChunks { pub system: Vec<Message>, pub repo_map: Vec<Message>, pub read_only_files: Vec<Message>, pub chat_history: Vec<Message>, // May be summarized pub chat_files: Vec<Message>, pub current: Vec<Message>, pub reminder: Vec<Message>,}Compaction Strategy
Section titled “Compaction Strategy”Implement a three-tier approach:
- Tier 1: Tool output pruning (cheapest). Same concept as OpenCode — clear old tool results beyond a protected window. No LLM call needed.
- Tier 2: History summarization (moderate). Use the weak/editor model to summarize old
chat_historymessages. Run in a backgroundtokio::task, swap atomically when complete. Mirror Aider’s recursive split-and-summarize. - Tier 3: Session compaction (expensive). Full session summary via primary model. Only triggered if tiers 1–2 are insufficient.
Crates
Section titled “Crates”| Crate | Purpose |
|---|---|
tiktoken-rs | Exact token counting for pre-send validation |
serde_json | Byte-based estimation via serialization (for fast path) |
tokio | Background summarization tasks |
Pre-Send Validation
Section titled “Pre-Send Validation”Before every API call, run a validation step:
async fn validate_prompt(&self, chunks: &PromptChunks) -> Result<(), BudgetError> { let total = self.counter.count_exact(&chunks.all_messages())?; let limit = self.model.effective_context_window(); if total > limit { return Err(BudgetError::Overflow { total, limit }); } Ok(())}On BudgetError::Overflow, attempt tier-1 pruning, then tier-2 summarization, then tier-3 compaction. If all fail, report the error to the user with actionable suggestions (drop files, clear history).
Provider-Aware Context Management (from Claude Code)
Section titled “Provider-Aware Context Management (from Claude Code)”Support both server-side and client-side context management depending on the provider:
pub enum ContextManagementStrategy { /// Provider handles context management server-side (Anthropic) ServerSide { tool_clearing: Option<ToolClearingConfig>, thinking_clearing: Option<ThinkingClearingConfig>, compaction: Option<ServerCompactionConfig>, }, /// Client handles context management (OpenAI, local models) ClientSide { prune_config: PruneConfig, summarization: SummarizationConfig, },}
pub struct ToolClearingConfig { pub trigger_tokens: usize, // default 100_000 pub keep_tool_uses: usize, // default 3 pub clear_at_least: Option<usize>, pub exclude_tools: Vec<String>, pub clear_tool_inputs: bool, // default false}
pub struct ThinkingClearingConfig { pub keep: ThinkingKeepPolicy, // Turns(N) or All}
pub struct ServerCompactionConfig { pub trigger_tokens: usize, // default 150_000 pub pause_after: bool, pub instructions: Option<String>,}Token Counting Strategy (Updated)
Section titled “Token Counting Strategy (Updated)”Use a three-tier approach (expanded from dual-mode):
pub trait TokenCounter { /// Tier 1: Fast estimate for assembly decisions (no dependencies, constant time) fn estimate_fast(&self, text: &str) -> usize;
/// Tier 2: Provider API token count (free for Anthropic, may not exist for others) async fn count_api(&self, messages: &[Message]) -> Result<Option<usize>>;
/// Tier 3: Local exact count via tokenizer (tiktoken-rs fallback) fn count_exact(&self, messages: &[Message]) -> Result<usize>;}Pre-send validation order:
- If provider API supports token counting (Anthropic) → use
count_api() - Else → use
count_exact()with local tokenizer - Fast estimate only used during message assembly, never for final validation
Context Awareness Support
Section titled “Context Awareness Support”When the provider supports model-level context awareness, inject budget information:
pub struct ContextBudgetInjection { /// Total context budget (injected once at session start) pub total_budget: usize, /// Whether to inject usage updates after tool calls pub inject_usage_updates: bool,}For providers without native context awareness, OpenOxide can simulate it by injecting synthetic system messages with token usage information.
Real-Time Budget Monitoring
Section titled “Real-Time Budget Monitoring”Expose budget state for TUI display (inspired by Claude Code’s status line):
pub struct BudgetSnapshot { pub context_window_size: usize, pub used_tokens: usize, pub used_percentage: f32, pub remaining_tokens: usize, pub session_cost_usd: f64, pub session_duration_ms: u64, pub total_input_tokens: usize, // cumulative across session pub total_output_tokens: usize, pub last_call_usage: Option<CallUsage>,}
pub struct CallUsage { pub input_tokens: usize, pub output_tokens: usize, pub cache_creation_tokens: usize, pub cache_read_tokens: usize,}Update snapshot after each API response. TUI renders from this struct.
Cost Governance Integration
Section titled “Cost Governance Integration”pub struct CostGovernance { /// Per-session cost tracking pub session_cost: SessionCost, /// Organization-level spend limits pub spend_limits: Option<SpendLimits>, /// OpenTelemetry metrics export pub otel_config: Option<OtelConfig>,}
pub struct SessionCost { pub total_cost_usd: f64, pub per_model_breakdown: HashMap<String, ModelCost>, pub api_duration_ms: u64, pub wall_duration_ms: u64, pub lines_added: usize, pub lines_removed: usize,}
pub struct ModelCost { pub input_tokens: usize, pub output_tokens: usize, pub cache_read_tokens: usize, pub cache_creation_tokens: usize, pub estimated_cost_usd: f64,}
pub struct OtelConfig { pub metrics_exporter: Vec<MetricsExporter>, // otlp, prometheus, console pub logs_exporter: Vec<LogsExporter>, // otlp, console pub endpoint: String, pub export_interval_ms: u64, // default 60_000 pub resource_attributes: HashMap<String, String>, // team/dept tagging}Design decisions:
- Session-level cost tracking as first-class data, not derived. Every API call updates the session’s cost snapshot immediately. The
SessionCoststruct is the single source of truth for the/costcommand equivalent in OpenOxide. - Per-model breakdown in cost tracking. Users switch models mid-session; cost attribution must track which model generated which tokens. The
per_model_breakdownmap is keyed by model identifier and accumulates across the full session. - OpenTelemetry export as optional module. When enabled, emit 8 metrics (
session.count,lines_of_code.count,pull_request.count,commit.count,cost.usage,token.usage,code_edit_tool.decision,active_time.total) matching Claude Code’s metric names for dashboard compatibility. TheOtelConfigsupports multiple exporter backends and custom resource attributes for team/department tagging. - Rate limit awareness. The token counter should expose TPM consumption rate so the TUI can display rate limit headroom alongside context window usage. This is especially important for teams where shared rate limits create contention that doesn’t show up in per-session context metrics.