Prompt Caching
Feature Definition
Section titled “Feature Definition”Every turn sent to an LLM repeats the same prefix: system prompt, instructions, repo map, read-only file contents. On a large codebase this prefix can be 50–100k tokens. Without caching, every request pays full input-token pricing for content the model has already seen. Prompt caching lets the provider store a hashed prefix server-side so subsequent requests that share the same prefix pay a fraction of the cost (typically 10% for Anthropic, free for OpenAI after first request).
The hard part is not enabling caching — it is structuring prompts so the prefix remains stable across turns. Any change to a message that precedes the cache boundary invalidates the entire cache. Reordering messages, updating a system instruction, or changing the repo map between turns all break the prefix hash. A good implementation must:
- Separate stable content (system prompt, examples, repo map) from volatile content (user messages, tool results).
- Place explicit cache breakpoint markers at the right positions for providers that require them (Anthropic).
- Keep the prefix frozen even when conversation metadata changes (sandbox policy, reasoning effort).
- Optionally keep the cache warm between turns to prevent expiry (Anthropic caches expire after 5 minutes).
- Track cached vs uncached tokens separately for accurate cost reporting.
OpenAI’s Responses API uses automatic prefix caching — no client markers needed, but a prompt_cache_key hint improves hit rates. Anthropic requires explicit cache_control markers on specific messages. DeepSeek provides automatic caching with separate pricing tiers. Each provider has different mechanics, and a multi-provider agent must handle all of them.
For context-window partitioning and overflow policy (separate from provider cache behavior), see Token Budgeting.
Aider Implementation
Section titled “Aider Implementation”Pin: b9050e1d
Aider has the most explicit and configurable prompt caching system of the three tools, centered on the ChatChunks dataclass and a background cache-warming thread.
Prompt Structure via ChatChunks
Section titled “Prompt Structure via ChatChunks”All messages are organized into eight ordered sections in aider/coders/chat_chunks.py:
@dataclassclass ChatChunks: system: List # System prompt examples: List # Few-shot examples done: List # Completed conversation turns repo: List # Repository map readonly_files: List # Read-only file contents chat_files: List # Editable file contents cur: List # Current turn messages reminder: List # System reminder suffixThe format_chat_chunks() method in aider/coders/base_coder.py:1226–1336 populates each section. The ordering is critical: stable content comes first (system, examples, repo map), volatile content last (current turn, reminder).
Cache Breakpoint Placement
Section titled “Cache Breakpoint Placement”add_cache_control_headers() in chat_chunks.py:28–41 places three cache_control markers:
- After examples (or system prompt if no examples):
self.add_cache_control(self.examples)— falls back toself.add_cache_control(self.system). - After repo map (or readonly files if no map):
self.add_cache_control(self.repo)— falls back toself.add_cache_control(self.readonly_files). - After editable files:
self.add_cache_control(self.chat_files).
Each marker is placed on the last message of the section. The add_cache_control() helper at chat_chunks.py:43–55 converts string content to the structured format Anthropic requires:
def add_cache_control(self, messages): if not messages: return content = messages[-1]["content"] if type(content) is str: content = dict(type="text", text=content) content["cache_control"] = {"type": "ephemeral"} messages[-1]["content"] = [content]The "ephemeral" type tells Anthropic to cache this prefix for 5 minutes. There is no "persistent" option — all Anthropic caches are ephemeral.
Cache Warming
Section titled “Cache Warming”Anthropic caches expire after 5 minutes of inactivity. Aider addresses this with a background daemon thread in base_coder.py:1340–1392:
def warm_cache(self, chunks): delay = 5 * 60 - 5 # 295 seconds delay = float(os.environ.get("AIDER_CACHE_KEEPALIVE_DELAY", delay)) # ... def warm_cache_worker(): while self.ok_to_warm_cache: time.sleep(1) if now < self.next_cache_warm: continue kwargs["max_tokens"] = 1 completion = litellm.completion( model=self.main_model.name, messages=self.cache_warming_chunks.cacheable_messages(), stream=False, **kwargs, )The strategy: every 295 seconds (5 minutes minus 5 seconds for network latency), send a max_tokens=1 request using only the cacheable prefix (via cacheable_messages() at chat_chunks.py:57–64). This keeps the server-side cache hot between user interactions. The ping cost is minimal — 1 output token plus the cached-read discount on input tokens.
Configuration: --cache-prompts enables caching, --cache-keepalive-pings N sets the number of warming cycles. The warming thread is a daemon that dies with the process.
Provider Detection and Gating
Section titled “Provider Detection and Gating”Caching is only enabled when both conditions are met (base_coder.py:426–427):
if cache_prompts and self.main_model.cache_control: self.add_cache_headers = TrueThe cache_control flag comes from aider/resources/model-settings.yml. Only Anthropic models have cache_control: true — Claude 3.5 Sonnet, Claude 3 Haiku, Claude Opus 4, and Bedrock Anthropic variants. DeepSeek models have a separate caches_by_default: true flag indicating they cache automatically without markers.
Cost Calculation
Section titled “Cost Calculation”compute_costs_from_tokens() in base_coder.py:2071–2100 handles provider-specific pricing:
Anthropic:
- Cache write:
cache_write_tokens * input_cost * 1.25(25% surcharge) - Cache read:
cache_hit_tokens * input_cost * 0.10(90% discount) - Uncached:
prompt_tokens * input_cost(full price)
DeepSeek:
- Detected by presence of
input_cost_per_token_cache_hitin model metadata - Cache read:
cache_hit_tokens * input_cost_per_token_cache_hit - Cache miss (current implementation):
(prompt_tokens - input_cost_per_token_cache_hit) * input_cost_per_token
Token tracking reads from litellm’s unified response: usage.prompt_cache_hit_tokens (DeepSeek) or usage.cache_read_input_tokens (Anthropic), plus usage.cache_creation_input_tokens for write tokens.
Important limitation: Cache statistics are only available in non-streaming mode. If streaming is enabled, Aider warns the user at main.py:1120-1122.
CLI Surface
Section titled “CLI Surface”--cache-prompts Enable caching of prompts (default: False)--cache-keepalive-pings N Number of 5-min interval pings to keep cache warm (default: 0)An additional behavioral side effect: when caching is enabled in auto mode, Aider disables automatic repo map refresh (main.py:954-956) to keep the map section stable and avoid invalidating the cache.
Codex Implementation
Section titled “Codex Implementation”Pin: 4ab44e2c5
Codex takes a simpler approach: it relies on OpenAI’s server-side automatic prefix caching via the Responses API, using a prompt_cache_key hint to improve hit rates.
prompt_cache_key
Section titled “prompt_cache_key”In codex-rs/core/src/client.rs:504–516, the request is constructed with:
let prompt_cache_key = Some(self.client.state.conversation_id.to_string());let request = ResponsesApiRequest { model: model_info.slug.clone(), instructions: instructions.clone(), input, tools, prompt_cache_key, stream: true, // ...};The prompt_cache_key is set to the conversation ID (a ThreadId), which remains constant for the entire session. This tells OpenAI’s backend to attempt prefix matching against previous requests with the same key.
The ResponsesApiRequest struct in codex-rs/codex-api/src/common.rs:144–160:
pub struct ResponsesApiRequest { pub model: String, pub instructions: String, pub input: Vec<ResponseItem>, pub tools: Vec<serde_json::Value>, #[serde(skip_serializing_if = "Option::is_none")] pub prompt_cache_key: Option<String>, // ...}The field is skip_serializing_if = "Option::is_none" — it’s only sent when set.
Prefix Stability Strategy
Section titled “Prefix Stability Strategy”Tests in codex-rs/core/tests/suite/prompt_caching.rs reveal the caching architecture:
The prompt prefix has a fixed order:
- Permissions message (developer role) — may change between turns
- UI instructions (developer role) — stable across turns
- Environment context (user role) — stable across turns
- User message (user role) — new each turn
When turn context changes (sandbox policy, reasoning effort), updated messages are appended after the cached prefix rather than modifying it. The test at line 328 explicitly verifies “expected cached prefix to be reused” and at line 415 “ensuring cache hit potential.” This design means that even when the agent’s configuration changes mid-session, the prefix hash remains valid.
Token Tracking
Section titled “Token Tracking”The TokenUsage struct in codex-rs/protocol/src/protocol.rs:1363–1375:
pub struct TokenUsage { pub input_tokens: i64, pub cached_input_tokens: i64, pub output_tokens: i64, pub reasoning_output_tokens: i64, pub total_tokens: i64,}Utility methods at lines 1487–1498:
pub fn cached_input(&self) -> i64 { self.cached_input_tokens.max(0)}pub fn non_cached_input(&self) -> i64 { (self.input_tokens - self.cached_input()).max(0)}pub fn blended_total(&self) -> i64 { (self.non_cached_input() + self.output_tokens.max(0)).max(0)}The blended_total() method excludes cached tokens from the “effective” token count displayed to users, since cached tokens are free on OpenAI.
No Cache Warming
Section titled “No Cache Warming”Unlike Aider, Codex does not implement any cache warming mechanism. OpenAI’s automatic caching has a longer TTL than Anthropic’s 5-minute ephemeral cache, and the Responses API’s server-side implementation handles cache management transparently.
No Explicit Cache Markers
Section titled “No Explicit Cache Markers”Codex does not inject cache_control fields into messages. The Responses API does not support client-side cache control markers — caching is entirely server-managed based on prefix matching and the prompt_cache_key hint.
OpenCode Implementation
Section titled “OpenCode Implementation”Pin: 7ed449974
OpenCode has the most provider-aware caching implementation, handling five different provider APIs with different cache control mechanisms.
Transform Pipeline
Section titled “Transform Pipeline”All caching logic lives in packages/opencode/src/provider/transform.ts. The entry point is the ProviderTransform.message() export at lines 252–290, which applies three transforms in order:
- Filter unsupported parts — removes image/file parts the model can’t handle
- Normalize messages — fixes empty content, normalizes tool IDs
- Apply caching — injects cache control markers (Anthropic-family only)
The transform is applied via Vercel AI SDK middleware in session/llm.ts:238–251:
middleware: [{ async transformParams(args) { if (args.type === "stream") { args.params.prompt = ProviderTransform.message( args.params.prompt, input.model, options ) } return args.params },}]Cache Marker Placement
Section titled “Cache Marker Placement”The applyCaching() function at transform.ts:174–212 places markers on four messages:
- First 2 system messages:
msgs.filter(msg => msg.role === "system").slice(0, 2) - Last 2 non-system messages:
msgs.filter(msg => msg.role !== "system").slice(-2)
The rationale: system messages are the most stable (they don’t change between turns), so caching them maximizes prefix reuse. The last 2 messages capture the most recent conversation context, which helps when the user sends multiple quick follow-ups with the same trailing context.
Provider-Specific Markers
Section titled “Provider-Specific Markers”Five different cache control formats at transform.ts:178–194:
| Provider | Key | Value | Level |
|---|---|---|---|
| Anthropic | cacheControl | {"type": "ephemeral"} | Message |
| Bedrock | cachePoint | {"type": "default"} | Message |
| OpenRouter | cacheControl | {"type": "ephemeral"} | Content |
| OpenAI-compatible | cache_control | {"type": "ephemeral"} | Content |
| Copilot | copilot_cache_control | {"type": "ephemeral"} | Content |
The placement level matters:
- Anthropic and Bedrock: markers go on the message object (
msg.providerOptions) - All others: markers go on the last content block (
msg.content[lastIndex].providerOptions)
This distinction exists because Anthropic’s API reads cache control from the message level, while other providers following Anthropic’s convention expect it on individual content parts.
Gating: Anthropic-Family Only
Section titled “Gating: Anthropic-Family Only”Cache markers are only applied when the model is identified as Anthropic-family (transform.ts:255–265):
if ( (model.providerID === "anthropic" || model.api.id.includes("anthropic") || model.api.id.includes("claude") || model.id.includes("anthropic") || model.id.includes("claude") || model.api.npm === "@ai-sdk/anthropic") && model.api.npm !== "@ai-sdk/gateway") { msgs = applyCaching(msgs, model)}The Gateway SDK is explicitly excluded — it handles caching at its own layer.
prompt_cache_key for OpenAI-family
Section titled “prompt_cache_key for OpenAI-family”For providers that support server-side caching with a key hint, OpenCode sets the session ID as the cache key (transform.ts:699–768):
| Provider | Field | Value |
|---|---|---|
| OpenAI | promptCacheKey | Session ID |
| Venice | promptCacheKey | Session ID |
| OpenRouter | prompt_cache_key | Session ID |
| OpenCode GPT-5 | promptCacheKey | Session ID |
This mirrors Codex’s approach: the session ID is stable across turns, enabling prefix caching without explicit markers.
System Prompt 2-Part Structure
Section titled “System Prompt 2-Part Structure”The system prompt is deliberately constructed as a 2-element array in session/llm.ts:67–97:
const system = []system.push([ ...(input.agent.prompt ? [input.agent.prompt] : []), ...input.system, ...(input.user.system ? [input.user.system] : []),].filter(x => x).join("\n"))
// After plugin transforms...if (system.length > 2 && system[0] === header) { const rest = system.slice(1) system.length = 0 system.push(header, rest.join("\n"))}The comment at line 92 explains: “maintain 2-part structure for caching if header unchanged.” By keeping the first system message stable and consolidating dynamic additions into the second, the first system message’s cache marker remains valid even when plugins inject additional instructions.
Pitfalls & Hard Lessons
Section titled “Pitfalls & Hard Lessons”Cache Invalidation Is Silent
Section titled “Cache Invalidation Is Silent”None of the three tools detect or report cache misses caused by prompt restructuring. If a developer changes the system prompt or reorders messages, the cache silently invalidates and costs jump. Aider partially addresses this by disabling auto map refresh when caching is on (main.py:954-956), but there’s no explicit “cache miss rate” metric exposed to users.
Streaming vs Cache Stats
Section titled “Streaming vs Cache Stats”Aider explicitly warns that cache statistics are unavailable during streaming (main.py:1120-1122). This is an Anthropic API limitation — the usage block with cache breakdown is only returned in non-streaming responses or as a final streaming event. Users who enable both --cache-prompts and --stream get caching but can’t verify it’s working.
Provider Fragmentation
Section titled “Provider Fragmentation”The five different cache control formats in OpenCode illustrate the fragmentation problem. Anthropic uses cacheControl, Bedrock uses cachePoint, OpenRouter uses the same key as Anthropic but at content level, OpenAI-compatible providers use cache_control (underscore), and Copilot has its own copilot_cache_control. A multi-provider agent must maintain a mapping table and test each format independently.
Ephemeral Expiry Under Load
Section titled “Ephemeral Expiry Under Load”Anthropic’s 5-minute cache TTL means that during slow coding sessions (reading docs, thinking about approach), the cache expires between turns. Aider’s warming mechanism addresses this but adds cost — each ping is a 1-token completion that still incurs the cached-read rate on the full prefix. For a 100k-token prefix, that’s ~10k token-equivalents per ping at 10% rate. Over an hour of idle warming (12 pings), that’s 120k token-equivalents — potentially more than the savings.
Prefix Instability from Dynamic Content
Section titled “Prefix Instability from Dynamic Content”The repo map, file contents, and tool definitions all change between turns. If the repo map is regenerated (new files, changed function signatures), the cache from the previous turn is invalidated. Aider’s decision to freeze the map during cached sessions is a pragmatic tradeoff: stale map vs cache hits. Codex avoids this by appending changed context after the cached prefix rather than replacing it.
Token Counting Accuracy
Section titled “Token Counting Accuracy”Accurate cost reporting requires distinguishing three token categories: cache write (1.25x), cache read (0.1x), and uncached (1x) for Anthropic. Aider hardcodes the multipliers rather than reading them from model metadata (base_coder.py:2087–2090). If Anthropic changes pricing, the cost display will be wrong until the code is updated.
OpenOxide Blueprint
Section titled “OpenOxide Blueprint”Crate: openoxide-cache
Section titled “Crate: openoxide-cache”A provider-agnostic caching abstraction that handles marker injection, cache key management, and cost tracking.
Core Trait
Section titled “Core Trait”pub trait CacheStrategy: Send + Sync { /// Inject provider-specific cache markers into the message list. fn apply_markers(&self, messages: &mut Vec<Message>);
/// Return the cache key for this session, if the provider supports it. fn cache_key(&self) -> Option<String>;
/// Compute cost given token breakdown. fn compute_cost(&self, usage: &CacheTokenUsage, pricing: &ModelPricing) -> f64;}Provider Implementations
Section titled “Provider Implementations”pub struct AnthropicCacheStrategy { /// Positions to place cache_control markers (indices into message vec). breakpoints: Vec<BreakpointRule>,}
pub struct OpenAICacheStrategy { session_id: String,}
pub struct NoCacheStrategy; // For providers without caching supportBreakpointRule would encode the logic: “last message in the system section”, “last message in the repo section”, etc. — mirroring Aider’s ChatChunks approach but as configuration rather than hardcoded logic.
Token Usage Tracking
Section titled “Token Usage Tracking”pub struct CacheTokenUsage { pub input_tokens: u64, pub cached_read_tokens: u64, pub cached_write_tokens: u64, pub output_tokens: u64,}Cost computation should read multipliers from model metadata (not hardcode them), falling back to known defaults: Anthropic write = 1.25x, Anthropic read = 0.10x, OpenAI read = free.
Cache Warming
Section titled “Cache Warming”Implement as an optional tokio::task that sends periodic max_tokens=1 requests:
pub struct CacheWarmer { interval: Duration, // Default: 295 seconds max_pings: u32, // 0 = disabled model: ModelConfig, prefix: Vec<Message>, // Frozen prefix to send}
impl CacheWarmer { pub fn spawn(self) -> JoinHandle<()> { tokio::spawn(async move { let mut remaining = self.max_pings; while remaining > 0 { tokio::time::sleep(self.interval).await; // Send minimal completion with frozen prefix remaining -= 1; } }) }}The warmer should be cancellable (via tokio::select! with a shutdown signal) and should log cache hit rates from ping responses.
Message Assembly
Section titled “Message Assembly”Follow Codex’s approach of keeping a stable prefix and appending new content:
pub struct PromptBuilder { /// Frozen prefix — system prompt, instructions, repo map. /// Only rebuilt when explicitly invalidated. prefix: Vec<Message>, prefix_hash: u64,
/// Dynamic suffix — user messages, tool results. suffix: Vec<Message>,}
impl PromptBuilder { pub fn build(&self) -> Vec<Message> { let mut msgs = self.prefix.clone(); msgs.extend(self.suffix.iter().cloned()); msgs }
pub fn invalidate_prefix(&mut self) { // Called when repo map changes, files are added, etc. self.prefix_hash = 0; }}Track the prefix hash to detect when cache invalidation occurs, and log a warning so users know their costs may increase.
Crates
Section titled “Crates”- tree-sitter — Not directly involved, but repo map changes (which trigger cache invalidation) come from tree-sitter diffs.
- tokio — For the cache warming background task.
- serde/serde_json — For serializing provider-specific cache control markers.
- blake3 or xxhash — For fast prefix hashing to detect invalidation.