Skip to content

Prompt Caching

Every turn sent to an LLM repeats the same prefix: system prompt, instructions, repo map, read-only file contents. On a large codebase this prefix can be 50–100k tokens. Without caching, every request pays full input-token pricing for content the model has already seen. Prompt caching lets the provider store a hashed prefix server-side so subsequent requests that share the same prefix pay a fraction of the cost (typically 10% for Anthropic, free for OpenAI after first request).

The hard part is not enabling caching — it is structuring prompts so the prefix remains stable across turns. Any change to a message that precedes the cache boundary invalidates the entire cache. Reordering messages, updating a system instruction, or changing the repo map between turns all break the prefix hash. A good implementation must:

  1. Separate stable content (system prompt, examples, repo map) from volatile content (user messages, tool results).
  2. Place explicit cache breakpoint markers at the right positions for providers that require them (Anthropic).
  3. Keep the prefix frozen even when conversation metadata changes (sandbox policy, reasoning effort).
  4. Optionally keep the cache warm between turns to prevent expiry (Anthropic caches expire after 5 minutes).
  5. Track cached vs uncached tokens separately for accurate cost reporting.

OpenAI’s Responses API uses automatic prefix caching — no client markers needed, but a prompt_cache_key hint improves hit rates. Anthropic requires explicit cache_control markers on specific messages. DeepSeek provides automatic caching with separate pricing tiers. Each provider has different mechanics, and a multi-provider agent must handle all of them. For context-window partitioning and overflow policy (separate from provider cache behavior), see Token Budgeting.


Pin: b9050e1d

Aider has the most explicit and configurable prompt caching system of the three tools, centered on the ChatChunks dataclass and a background cache-warming thread.

All messages are organized into eight ordered sections in aider/coders/chat_chunks.py:

@dataclass
class ChatChunks:
system: List # System prompt
examples: List # Few-shot examples
done: List # Completed conversation turns
repo: List # Repository map
readonly_files: List # Read-only file contents
chat_files: List # Editable file contents
cur: List # Current turn messages
reminder: List # System reminder suffix

The format_chat_chunks() method in aider/coders/base_coder.py:1226–1336 populates each section. The ordering is critical: stable content comes first (system, examples, repo map), volatile content last (current turn, reminder).

add_cache_control_headers() in chat_chunks.py:28–41 places three cache_control markers:

  1. After examples (or system prompt if no examples): self.add_cache_control(self.examples) — falls back to self.add_cache_control(self.system).
  2. After repo map (or readonly files if no map): self.add_cache_control(self.repo) — falls back to self.add_cache_control(self.readonly_files).
  3. After editable files: self.add_cache_control(self.chat_files).

Each marker is placed on the last message of the section. The add_cache_control() helper at chat_chunks.py:43–55 converts string content to the structured format Anthropic requires:

def add_cache_control(self, messages):
if not messages:
return
content = messages[-1]["content"]
if type(content) is str:
content = dict(type="text", text=content)
content["cache_control"] = {"type": "ephemeral"}
messages[-1]["content"] = [content]

The "ephemeral" type tells Anthropic to cache this prefix for 5 minutes. There is no "persistent" option — all Anthropic caches are ephemeral.

Anthropic caches expire after 5 minutes of inactivity. Aider addresses this with a background daemon thread in base_coder.py:1340–1392:

def warm_cache(self, chunks):
delay = 5 * 60 - 5 # 295 seconds
delay = float(os.environ.get("AIDER_CACHE_KEEPALIVE_DELAY", delay))
# ...
def warm_cache_worker():
while self.ok_to_warm_cache:
time.sleep(1)
if now < self.next_cache_warm:
continue
kwargs["max_tokens"] = 1
completion = litellm.completion(
model=self.main_model.name,
messages=self.cache_warming_chunks.cacheable_messages(),
stream=False,
**kwargs,
)

The strategy: every 295 seconds (5 minutes minus 5 seconds for network latency), send a max_tokens=1 request using only the cacheable prefix (via cacheable_messages() at chat_chunks.py:57–64). This keeps the server-side cache hot between user interactions. The ping cost is minimal — 1 output token plus the cached-read discount on input tokens.

Configuration: --cache-prompts enables caching, --cache-keepalive-pings N sets the number of warming cycles. The warming thread is a daemon that dies with the process.

Caching is only enabled when both conditions are met (base_coder.py:426–427):

if cache_prompts and self.main_model.cache_control:
self.add_cache_headers = True

The cache_control flag comes from aider/resources/model-settings.yml. Only Anthropic models have cache_control: true — Claude 3.5 Sonnet, Claude 3 Haiku, Claude Opus 4, and Bedrock Anthropic variants. DeepSeek models have a separate caches_by_default: true flag indicating they cache automatically without markers.

compute_costs_from_tokens() in base_coder.py:2071–2100 handles provider-specific pricing:

Anthropic:

  • Cache write: cache_write_tokens * input_cost * 1.25 (25% surcharge)
  • Cache read: cache_hit_tokens * input_cost * 0.10 (90% discount)
  • Uncached: prompt_tokens * input_cost (full price)

DeepSeek:

  • Detected by presence of input_cost_per_token_cache_hit in model metadata
  • Cache read: cache_hit_tokens * input_cost_per_token_cache_hit
  • Cache miss (current implementation): (prompt_tokens - input_cost_per_token_cache_hit) * input_cost_per_token

Token tracking reads from litellm’s unified response: usage.prompt_cache_hit_tokens (DeepSeek) or usage.cache_read_input_tokens (Anthropic), plus usage.cache_creation_input_tokens for write tokens.

Important limitation: Cache statistics are only available in non-streaming mode. If streaming is enabled, Aider warns the user at main.py:1120-1122.

--cache-prompts Enable caching of prompts (default: False)
--cache-keepalive-pings N Number of 5-min interval pings to keep cache warm (default: 0)

An additional behavioral side effect: when caching is enabled in auto mode, Aider disables automatic repo map refresh (main.py:954-956) to keep the map section stable and avoid invalidating the cache.


Pin: 4ab44e2c5

Codex takes a simpler approach: it relies on OpenAI’s server-side automatic prefix caching via the Responses API, using a prompt_cache_key hint to improve hit rates.

In codex-rs/core/src/client.rs:504–516, the request is constructed with:

let prompt_cache_key = Some(self.client.state.conversation_id.to_string());
let request = ResponsesApiRequest {
model: model_info.slug.clone(),
instructions: instructions.clone(),
input,
tools,
prompt_cache_key,
stream: true,
// ...
};

The prompt_cache_key is set to the conversation ID (a ThreadId), which remains constant for the entire session. This tells OpenAI’s backend to attempt prefix matching against previous requests with the same key.

The ResponsesApiRequest struct in codex-rs/codex-api/src/common.rs:144–160:

pub struct ResponsesApiRequest {
pub model: String,
pub instructions: String,
pub input: Vec<ResponseItem>,
pub tools: Vec<serde_json::Value>,
#[serde(skip_serializing_if = "Option::is_none")]
pub prompt_cache_key: Option<String>,
// ...
}

The field is skip_serializing_if = "Option::is_none" — it’s only sent when set.

Tests in codex-rs/core/tests/suite/prompt_caching.rs reveal the caching architecture:

The prompt prefix has a fixed order:

  1. Permissions message (developer role) — may change between turns
  2. UI instructions (developer role) — stable across turns
  3. Environment context (user role) — stable across turns
  4. User message (user role) — new each turn

When turn context changes (sandbox policy, reasoning effort), updated messages are appended after the cached prefix rather than modifying it. The test at line 328 explicitly verifies “expected cached prefix to be reused” and at line 415 “ensuring cache hit potential.” This design means that even when the agent’s configuration changes mid-session, the prefix hash remains valid.

The TokenUsage struct in codex-rs/protocol/src/protocol.rs:1363–1375:

pub struct TokenUsage {
pub input_tokens: i64,
pub cached_input_tokens: i64,
pub output_tokens: i64,
pub reasoning_output_tokens: i64,
pub total_tokens: i64,
}

Utility methods at lines 1487–1498:

pub fn cached_input(&self) -> i64 {
self.cached_input_tokens.max(0)
}
pub fn non_cached_input(&self) -> i64 {
(self.input_tokens - self.cached_input()).max(0)
}
pub fn blended_total(&self) -> i64 {
(self.non_cached_input() + self.output_tokens.max(0)).max(0)
}

The blended_total() method excludes cached tokens from the “effective” token count displayed to users, since cached tokens are free on OpenAI.

Unlike Aider, Codex does not implement any cache warming mechanism. OpenAI’s automatic caching has a longer TTL than Anthropic’s 5-minute ephemeral cache, and the Responses API’s server-side implementation handles cache management transparently.

Codex does not inject cache_control fields into messages. The Responses API does not support client-side cache control markers — caching is entirely server-managed based on prefix matching and the prompt_cache_key hint.


Pin: 7ed449974

OpenCode has the most provider-aware caching implementation, handling five different provider APIs with different cache control mechanisms.

All caching logic lives in packages/opencode/src/provider/transform.ts. The entry point is the ProviderTransform.message() export at lines 252–290, which applies three transforms in order:

  1. Filter unsupported parts — removes image/file parts the model can’t handle
  2. Normalize messages — fixes empty content, normalizes tool IDs
  3. Apply caching — injects cache control markers (Anthropic-family only)

The transform is applied via Vercel AI SDK middleware in session/llm.ts:238–251:

middleware: [{
async transformParams(args) {
if (args.type === "stream") {
args.params.prompt = ProviderTransform.message(
args.params.prompt, input.model, options
)
}
return args.params
},
}]

The applyCaching() function at transform.ts:174–212 places markers on four messages:

  • First 2 system messages: msgs.filter(msg => msg.role === "system").slice(0, 2)
  • Last 2 non-system messages: msgs.filter(msg => msg.role !== "system").slice(-2)

The rationale: system messages are the most stable (they don’t change between turns), so caching them maximizes prefix reuse. The last 2 messages capture the most recent conversation context, which helps when the user sends multiple quick follow-ups with the same trailing context.

Five different cache control formats at transform.ts:178–194:

ProviderKeyValueLevel
AnthropiccacheControl{"type": "ephemeral"}Message
BedrockcachePoint{"type": "default"}Message
OpenRoutercacheControl{"type": "ephemeral"}Content
OpenAI-compatiblecache_control{"type": "ephemeral"}Content
Copilotcopilot_cache_control{"type": "ephemeral"}Content

The placement level matters:

  • Anthropic and Bedrock: markers go on the message object (msg.providerOptions)
  • All others: markers go on the last content block (msg.content[lastIndex].providerOptions)

This distinction exists because Anthropic’s API reads cache control from the message level, while other providers following Anthropic’s convention expect it on individual content parts.

Cache markers are only applied when the model is identified as Anthropic-family (transform.ts:255–265):

if (
(model.providerID === "anthropic" ||
model.api.id.includes("anthropic") ||
model.api.id.includes("claude") ||
model.id.includes("anthropic") ||
model.id.includes("claude") ||
model.api.npm === "@ai-sdk/anthropic") &&
model.api.npm !== "@ai-sdk/gateway"
) {
msgs = applyCaching(msgs, model)
}

The Gateway SDK is explicitly excluded — it handles caching at its own layer.

For providers that support server-side caching with a key hint, OpenCode sets the session ID as the cache key (transform.ts:699–768):

ProviderFieldValue
OpenAIpromptCacheKeySession ID
VenicepromptCacheKeySession ID
OpenRouterprompt_cache_keySession ID
OpenCode GPT-5promptCacheKeySession ID

This mirrors Codex’s approach: the session ID is stable across turns, enabling prefix caching without explicit markers.

The system prompt is deliberately constructed as a 2-element array in session/llm.ts:67–97:

const system = []
system.push([
...(input.agent.prompt ? [input.agent.prompt] : []),
...input.system,
...(input.user.system ? [input.user.system] : []),
].filter(x => x).join("\n"))
// After plugin transforms...
if (system.length > 2 && system[0] === header) {
const rest = system.slice(1)
system.length = 0
system.push(header, rest.join("\n"))
}

The comment at line 92 explains: “maintain 2-part structure for caching if header unchanged.” By keeping the first system message stable and consolidating dynamic additions into the second, the first system message’s cache marker remains valid even when plugins inject additional instructions.


None of the three tools detect or report cache misses caused by prompt restructuring. If a developer changes the system prompt or reorders messages, the cache silently invalidates and costs jump. Aider partially addresses this by disabling auto map refresh when caching is on (main.py:954-956), but there’s no explicit “cache miss rate” metric exposed to users.

Aider explicitly warns that cache statistics are unavailable during streaming (main.py:1120-1122). This is an Anthropic API limitation — the usage block with cache breakdown is only returned in non-streaming responses or as a final streaming event. Users who enable both --cache-prompts and --stream get caching but can’t verify it’s working.

The five different cache control formats in OpenCode illustrate the fragmentation problem. Anthropic uses cacheControl, Bedrock uses cachePoint, OpenRouter uses the same key as Anthropic but at content level, OpenAI-compatible providers use cache_control (underscore), and Copilot has its own copilot_cache_control. A multi-provider agent must maintain a mapping table and test each format independently.

Anthropic’s 5-minute cache TTL means that during slow coding sessions (reading docs, thinking about approach), the cache expires between turns. Aider’s warming mechanism addresses this but adds cost — each ping is a 1-token completion that still incurs the cached-read rate on the full prefix. For a 100k-token prefix, that’s ~10k token-equivalents per ping at 10% rate. Over an hour of idle warming (12 pings), that’s 120k token-equivalents — potentially more than the savings.

The repo map, file contents, and tool definitions all change between turns. If the repo map is regenerated (new files, changed function signatures), the cache from the previous turn is invalidated. Aider’s decision to freeze the map during cached sessions is a pragmatic tradeoff: stale map vs cache hits. Codex avoids this by appending changed context after the cached prefix rather than replacing it.

Accurate cost reporting requires distinguishing three token categories: cache write (1.25x), cache read (0.1x), and uncached (1x) for Anthropic. Aider hardcodes the multipliers rather than reading them from model metadata (base_coder.py:2087–2090). If Anthropic changes pricing, the cost display will be wrong until the code is updated.


A provider-agnostic caching abstraction that handles marker injection, cache key management, and cost tracking.

pub trait CacheStrategy: Send + Sync {
/// Inject provider-specific cache markers into the message list.
fn apply_markers(&self, messages: &mut Vec<Message>);
/// Return the cache key for this session, if the provider supports it.
fn cache_key(&self) -> Option<String>;
/// Compute cost given token breakdown.
fn compute_cost(&self, usage: &CacheTokenUsage, pricing: &ModelPricing) -> f64;
}
pub struct AnthropicCacheStrategy {
/// Positions to place cache_control markers (indices into message vec).
breakpoints: Vec<BreakpointRule>,
}
pub struct OpenAICacheStrategy {
session_id: String,
}
pub struct NoCacheStrategy; // For providers without caching support

BreakpointRule would encode the logic: “last message in the system section”, “last message in the repo section”, etc. — mirroring Aider’s ChatChunks approach but as configuration rather than hardcoded logic.

pub struct CacheTokenUsage {
pub input_tokens: u64,
pub cached_read_tokens: u64,
pub cached_write_tokens: u64,
pub output_tokens: u64,
}

Cost computation should read multipliers from model metadata (not hardcode them), falling back to known defaults: Anthropic write = 1.25x, Anthropic read = 0.10x, OpenAI read = free.

Implement as an optional tokio::task that sends periodic max_tokens=1 requests:

pub struct CacheWarmer {
interval: Duration, // Default: 295 seconds
max_pings: u32, // 0 = disabled
model: ModelConfig,
prefix: Vec<Message>, // Frozen prefix to send
}
impl CacheWarmer {
pub fn spawn(self) -> JoinHandle<()> {
tokio::spawn(async move {
let mut remaining = self.max_pings;
while remaining > 0 {
tokio::time::sleep(self.interval).await;
// Send minimal completion with frozen prefix
remaining -= 1;
}
})
}
}

The warmer should be cancellable (via tokio::select! with a shutdown signal) and should log cache hit rates from ping responses.

Follow Codex’s approach of keeping a stable prefix and appending new content:

pub struct PromptBuilder {
/// Frozen prefix — system prompt, instructions, repo map.
/// Only rebuilt when explicitly invalidated.
prefix: Vec<Message>,
prefix_hash: u64,
/// Dynamic suffix — user messages, tool results.
suffix: Vec<Message>,
}
impl PromptBuilder {
pub fn build(&self) -> Vec<Message> {
let mut msgs = self.prefix.clone();
msgs.extend(self.suffix.iter().cloned());
msgs
}
pub fn invalidate_prefix(&mut self) {
// Called when repo map changes, files are added, etc.
self.prefix_hash = 0;
}
}

Track the prefix hash to detect when cache invalidation occurs, and log a warning so users know their costs may increase.

  • tree-sitter — Not directly involved, but repo map changes (which trigger cache invalidation) come from tree-sitter diffs.
  • tokio — For the cache warming background task.
  • serde/serde_json — For serializing provider-specific cache control markers.
  • blake3 or xxhash — For fast prefix hashing to detect invalidation.