Prompt Caching

Feature Definition

Every turn sent to an LLM repeats the same prefix: system prompt, instructions, repo map, read-only file contents. On a large codebase this prefix can be 50–100k tokens. Without caching, every request pays full input-token pricing for content the model has already seen. Prompt caching lets the provider store a hashed prefix server-side so subsequent requests that share the same prefix pay a fraction of the cost (typically 10% for Anthropic, free for OpenAI after first request).

The hard part is not enabling caching — it is structuring prompts so the prefix remains stable across turns. Any change to a message that precedes the cache boundary invalidates the entire cache. Reordering messages, updating a system instruction, or changing the repo map between turns all break the prefix hash. A good implementation must:

Separate stable content (system prompt, examples, repo map) from volatile content (user messages, tool results).
Place explicit cache breakpoint markers at the right positions for providers that require them (Anthropic).
Keep the prefix frozen even when conversation metadata changes (sandbox policy, reasoning effort).
Optionally keep the cache warm between turns to prevent expiry (Anthropic caches expire after 5 minutes).
Track cached vs uncached tokens separately for accurate cost reporting.

OpenAI’s Responses API uses automatic prefix caching — no client markers needed, but a prompt_cache_key hint improves hit rates. Anthropic requires explicit cache_control markers on specific messages. DeepSeek provides automatic caching with separate pricing tiers. Each provider has different mechanics, and a multi-provider agent must handle all of them. For context-window partitioning and overflow policy (separate from provider cache behavior), see Token Budgeting.

Aider Implementation

Pin: b9050e1d

Aider has the most explicit and configurable prompt caching system of the three tools, centered on the ChatChunks dataclass and a background cache-warming thread.

Prompt Structure via ChatChunks

All messages are organized into eight ordered sections in aider/coders/chat_chunks.py:

@dataclass
class ChatChunks:
    system: List       # System prompt
    examples: List     # Few-shot examples
    done: List         # Completed conversation turns
    repo: List         # Repository map
    readonly_files: List  # Read-only file contents
    chat_files: List   # Editable file contents
    cur: List          # Current turn messages
    reminder: List     # System reminder suffix

The format_chat_chunks() method in aider/coders/base_coder.py:1226–1336 populates each section. The ordering is critical: stable content comes first (system, examples, repo map), volatile content last (current turn, reminder).

Cache Breakpoint Placement

add_cache_control_headers() in chat_chunks.py:28–41 places three cache_control markers:

After examples (or system prompt if no examples): self.add_cache_control(self.examples) — falls back to self.add_cache_control(self.system).
After repo map (or readonly files if no map): self.add_cache_control(self.repo) — falls back to self.add_cache_control(self.readonly_files).
After editable files: self.add_cache_control(self.chat_files).

Each marker is placed on the last message of the section. The add_cache_control() helper at chat_chunks.py:43–55 converts string content to the structured format Anthropic requires:

def add_cache_control(self, messages):
    if not messages:
        return
    content = messages[-1]["content"]
    if type(content) is str:
        content = dict(type="text", text=content)
    content["cache_control"] = {"type": "ephemeral"}
    messages[-1]["content"] = [content]

The "ephemeral" type tells Anthropic to cache this prefix for 5 minutes. There is no "persistent" option — all Anthropic caches are ephemeral.

Cache Warming

Anthropic caches expire after 5 minutes of inactivity. Aider addresses this with a background daemon thread in base_coder.py:1340–1392:

def warm_cache(self, chunks):
    delay = 5 * 60 - 5  # 295 seconds
    delay = float(os.environ.get("AIDER_CACHE_KEEPALIVE_DELAY", delay))
    # ...
    def warm_cache_worker():
        while self.ok_to_warm_cache:
            time.sleep(1)
            if now < self.next_cache_warm:
                continue
            kwargs["max_tokens"] = 1
            completion = litellm.completion(
                model=self.main_model.name,
                messages=self.cache_warming_chunks.cacheable_messages(),
                stream=False,
                **kwargs,
            )

The strategy: every 295 seconds (5 minutes minus 5 seconds for network latency), send a max_tokens=1 request using only the cacheable prefix (via cacheable_messages() at chat_chunks.py:57–64). This keeps the server-side cache hot between user interactions. The ping cost is minimal — 1 output token plus the cached-read discount on input tokens.

Configuration: --cache-prompts enables caching, --cache-keepalive-pings N sets the number of warming cycles. The warming thread is a daemon that dies with the process.

Provider Detection and Gating

Caching is only enabled when both conditions are met (base_coder.py:426–427):

if cache_prompts and self.main_model.cache_control:
    self.add_cache_headers = True

The cache_control flag comes from aider/resources/model-settings.yml. Only Anthropic models have cache_control: true — Claude 3.5 Sonnet, Claude 3 Haiku, Claude Opus 4, and Bedrock Anthropic variants. DeepSeek models have a separate caches_by_default: true flag indicating they cache automatically without markers.

Cost Calculation

compute_costs_from_tokens() in base_coder.py:2071–2100 handles provider-specific pricing:

Anthropic:

Cache write: cache_write_tokens * input_cost * 1.25 (25% surcharge)
Cache read: cache_hit_tokens * input_cost * 0.10 (90% discount)
Uncached: prompt_tokens * input_cost (full price)

DeepSeek:

Detected by presence of input_cost_per_token_cache_hit in model metadata
Cache read: cache_hit_tokens * input_cost_per_token_cache_hit
Cache miss (current implementation): (prompt_tokens - input_cost_per_token_cache_hit) * input_cost_per_token

Token tracking reads from litellm’s unified response: usage.prompt_cache_hit_tokens (DeepSeek) or usage.cache_read_input_tokens (Anthropic), plus usage.cache_creation_input_tokens for write tokens.

Important limitation: Cache statistics are only available in non-streaming mode. If streaming is enabled, Aider warns the user at main.py:1120-1122.

CLI Surface

--cache-prompts           Enable caching of prompts (default: False)
--cache-keepalive-pings N Number of 5-min interval pings to keep cache warm (default: 0)

An additional behavioral side effect: when caching is enabled in auto mode, Aider disables automatic repo map refresh (main.py:954-956) to keep the map section stable and avoid invalidating the cache.

Codex Implementation

Pin: 4ab44e2c5

Codex takes a simpler approach: it relies on OpenAI’s server-side automatic prefix caching via the Responses API, using a prompt_cache_key hint to improve hit rates.

prompt_cache_key

In codex-rs/core/src/client.rs:504–516, the request is constructed with:

let prompt_cache_key = Some(self.client.state.conversation_id.to_string());
let request = ResponsesApiRequest {
    model: model_info.slug.clone(),
    instructions: instructions.clone(),
    input,
    tools,
    prompt_cache_key,
    stream: true,
    // ...
};

The prompt_cache_key is set to the conversation ID (a ThreadId), which remains constant for the entire session. This tells OpenAI’s backend to attempt prefix matching against previous requests with the same key.

The ResponsesApiRequest struct in codex-rs/codex-api/src/common.rs:144–160:

pub struct ResponsesApiRequest {
    pub model: String,
    pub instructions: String,
    pub input: Vec<ResponseItem>,
    pub tools: Vec<serde_json::Value>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub prompt_cache_key: Option<String>,
    // ...
}

The field is skip_serializing_if = "Option::is_none" — it’s only sent when set.

Prefix Stability Strategy

Tests in codex-rs/core/tests/suite/prompt_caching.rs reveal the caching architecture:

The prompt prefix has a fixed order:

Permissions message (developer role) — may change between turns
UI instructions (developer role) — stable across turns
Environment context (user role) — stable across turns
User message (user role) — new each turn

When turn context changes (sandbox policy, reasoning effort), updated messages are appended after the cached prefix rather than modifying it. The test at line 328 explicitly verifies “expected cached prefix to be reused” and at line 415 “ensuring cache hit potential.” This design means that even when the agent’s configuration changes mid-session, the prefix hash remains valid.

Token Tracking

The TokenUsage struct in codex-rs/protocol/src/protocol.rs:1363–1375:

pub struct TokenUsage {
    pub input_tokens: i64,
    pub cached_input_tokens: i64,
    pub output_tokens: i64,
    pub reasoning_output_tokens: i64,
    pub total_tokens: i64,
}

Utility methods at lines 1487–1498:

pub fn cached_input(&self) -> i64 {
    self.cached_input_tokens.max(0)
}
pub fn non_cached_input(&self) -> i64 {
    (self.input_tokens - self.cached_input()).max(0)
}
pub fn blended_total(&self) -> i64 {
    (self.non_cached_input() + self.output_tokens.max(0)).max(0)
}

The blended_total() method excludes cached tokens from the “effective” token count displayed to users, since cached tokens are free on OpenAI.

No Cache Warming

Unlike Aider, Codex does not implement any cache warming mechanism. OpenAI’s automatic caching has a longer TTL than Anthropic’s 5-minute ephemeral cache, and the Responses API’s server-side implementation handles cache management transparently.

No Explicit Cache Markers

Codex does not inject cache_control fields into messages. The Responses API does not support client-side cache control markers — caching is entirely server-managed based on prefix matching and the prompt_cache_key hint.

OpenCode Implementation

Pin: 7ed449974

OpenCode has the most provider-aware caching implementation, handling five different provider APIs with different cache control mechanisms.

Transform Pipeline

All caching logic lives in packages/opencode/src/provider/transform.ts. The entry point is the ProviderTransform.message() export at lines 252–290, which applies three transforms in order:

Filter unsupported parts — removes image/file parts the model can’t handle
Normalize messages — fixes empty content, normalizes tool IDs
Apply caching — injects cache control markers (Anthropic-family only)

The transform is applied via Vercel AI SDK middleware in session/llm.ts:238–251:

middleware: [{
    async transformParams(args) {
        if (args.type === "stream") {
            args.params.prompt = ProviderTransform.message(
                args.params.prompt, input.model, options
            )
        }
        return args.params
    },
}]

Cache Marker Placement

The applyCaching() function at transform.ts:174–212 places markers on four messages:

First 2 system messages: msgs.filter(msg => msg.role === "system").slice(0, 2)
Last 2 non-system messages: msgs.filter(msg => msg.role !== "system").slice(-2)

The rationale: system messages are the most stable (they don’t change between turns), so caching them maximizes prefix reuse. The last 2 messages capture the most recent conversation context, which helps when the user sends multiple quick follow-ups with the same trailing context.

Provider-Specific Markers

Five different cache control formats at transform.ts:178–194:

Provider	Key	Value	Level
Anthropic	`cacheControl`	`{"type": "ephemeral"}`	Message
Bedrock	`cachePoint`	`{"type": "default"}`	Message
OpenRouter	`cacheControl`	`{"type": "ephemeral"}`	Content
OpenAI-compatible	`cache_control`	`{"type": "ephemeral"}`	Content
Copilot	`copilot_cache_control`	`{"type": "ephemeral"}`	Content

The placement level matters:

Anthropic and Bedrock: markers go on the message object (msg.providerOptions)
All others: markers go on the last content block (msg.content[lastIndex].providerOptions)

This distinction exists because Anthropic’s API reads cache control from the message level, while other providers following Anthropic’s convention expect it on individual content parts.

Gating: Anthropic-Family Only

Cache markers are only applied when the model is identified as Anthropic-family (transform.ts:255–265):

if (
    (model.providerID === "anthropic" ||
        model.api.id.includes("anthropic") ||
        model.api.id.includes("claude") ||
        model.id.includes("anthropic") ||
        model.id.includes("claude") ||
        model.api.npm === "@ai-sdk/anthropic") &&
    model.api.npm !== "@ai-sdk/gateway"
) {
    msgs = applyCaching(msgs, model)
}

The Gateway SDK is explicitly excluded — it handles caching at its own layer.

prompt_cache_key for OpenAI-family

For providers that support server-side caching with a key hint, OpenCode sets the session ID as the cache key (transform.ts:699–768):

Provider	Field	Value
OpenAI	`promptCacheKey`	Session ID
Venice	`promptCacheKey`	Session ID
OpenRouter	`prompt_cache_key`	Session ID
OpenCode GPT-5	`promptCacheKey`	Session ID

This mirrors Codex’s approach: the session ID is stable across turns, enabling prefix caching without explicit markers.

System Prompt 2-Part Structure

The system prompt is deliberately constructed as a 2-element array in session/llm.ts:67–97:

const system = []
system.push([
    ...(input.agent.prompt ? [input.agent.prompt] : []),
    ...input.system,
    ...(input.user.system ? [input.user.system] : []),
].filter(x => x).join("\n"))

// After plugin transforms...
if (system.length > 2 && system[0] === header) {
    const rest = system.slice(1)
    system.length = 0
    system.push(header, rest.join("\n"))
}

The comment at line 92 explains: “maintain 2-part structure for caching if header unchanged.” By keeping the first system message stable and consolidating dynamic additions into the second, the first system message’s cache marker remains valid even when plugins inject additional instructions.

Pitfalls & Hard Lessons

Cache Invalidation Is Silent

None of the three tools detect or report cache misses caused by prompt restructuring. If a developer changes the system prompt or reorders messages, the cache silently invalidates and costs jump. Aider partially addresses this by disabling auto map refresh when caching is on (main.py:954-956), but there’s no explicit “cache miss rate” metric exposed to users.

Streaming vs Cache Stats

Aider explicitly warns that cache statistics are unavailable during streaming (main.py:1120-1122). This is an Anthropic API limitation — the usage block with cache breakdown is only returned in non-streaming responses or as a final streaming event. Users who enable both --cache-prompts and --stream get caching but can’t verify it’s working.

Provider Fragmentation

The five different cache control formats in OpenCode illustrate the fragmentation problem. Anthropic uses cacheControl, Bedrock uses cachePoint, OpenRouter uses the same key as Anthropic but at content level, OpenAI-compatible providers use cache_control (underscore), and Copilot has its own copilot_cache_control. A multi-provider agent must maintain a mapping table and test each format independently.

Ephemeral Expiry Under Load

Anthropic’s 5-minute cache TTL means that during slow coding sessions (reading docs, thinking about approach), the cache expires between turns. Aider’s warming mechanism addresses this but adds cost — each ping is a 1-token completion that still incurs the cached-read rate on the full prefix. For a 100k-token prefix, that’s ~10k token-equivalents per ping at 10% rate. Over an hour of idle warming (12 pings), that’s 120k token-equivalents — potentially more than the savings.

Prefix Instability from Dynamic Content

The repo map, file contents, and tool definitions all change between turns. If the repo map is regenerated (new files, changed function signatures), the cache from the previous turn is invalidated. Aider’s decision to freeze the map during cached sessions is a pragmatic tradeoff: stale map vs cache hits. Codex avoids this by appending changed context after the cached prefix rather than replacing it.

Token Counting Accuracy

Accurate cost reporting requires distinguishing three token categories: cache write (1.25x), cache read (0.1x), and uncached (1x) for Anthropic. Aider hardcodes the multipliers rather than reading them from model metadata (base_coder.py:2087–2090). If Anthropic changes pricing, the cost display will be wrong until the code is updated.

OpenOxide Blueprint

Crate: `openoxide-cache`

A provider-agnostic caching abstraction that handles marker injection, cache key management, and cost tracking.

Core Trait

pub trait CacheStrategy: Send + Sync {
    /// Inject provider-specific cache markers into the message list.
    fn apply_markers(&self, messages: &mut Vec<Message>);

    /// Return the cache key for this session, if the provider supports it.
    fn cache_key(&self) -> Option<String>;

    /// Compute cost given token breakdown.
    fn compute_cost(&self, usage: &CacheTokenUsage, pricing: &ModelPricing) -> f64;
}

Provider Implementations

pub struct AnthropicCacheStrategy {
    /// Positions to place cache_control markers (indices into message vec).
    breakpoints: Vec<BreakpointRule>,
}

pub struct OpenAICacheStrategy {
    session_id: String,
}

pub struct NoCacheStrategy; // For providers without caching support

BreakpointRule would encode the logic: “last message in the system section”, “last message in the repo section”, etc. — mirroring Aider’s ChatChunks approach but as configuration rather than hardcoded logic.

Token Usage Tracking

pub struct CacheTokenUsage {
    pub input_tokens: u64,
    pub cached_read_tokens: u64,
    pub cached_write_tokens: u64,
    pub output_tokens: u64,
}

Cost computation should read multipliers from model metadata (not hardcode them), falling back to known defaults: Anthropic write = 1.25x, Anthropic read = 0.10x, OpenAI read = free.

Cache Warming

Implement as an optional tokio::task that sends periodic max_tokens=1 requests:

pub struct CacheWarmer {
    interval: Duration,       // Default: 295 seconds
    max_pings: u32,           // 0 = disabled
    model: ModelConfig,
    prefix: Vec<Message>,     // Frozen prefix to send
}

impl CacheWarmer {
    pub fn spawn(self) -> JoinHandle<()> {
        tokio::spawn(async move {
            let mut remaining = self.max_pings;
            while remaining > 0 {
                tokio::time::sleep(self.interval).await;
                // Send minimal completion with frozen prefix
                remaining -= 1;
            }
        })
    }
}

The warmer should be cancellable (via tokio::select! with a shutdown signal) and should log cache hit rates from ping responses.

Message Assembly

Follow Codex’s approach of keeping a stable prefix and appending new content:

pub struct PromptBuilder {
    /// Frozen prefix — system prompt, instructions, repo map.
    /// Only rebuilt when explicitly invalidated.
    prefix: Vec<Message>,
    prefix_hash: u64,

    /// Dynamic suffix — user messages, tool results.
    suffix: Vec<Message>,
}

impl PromptBuilder {
    pub fn build(&self) -> Vec<Message> {
        let mut msgs = self.prefix.clone();
        msgs.extend(self.suffix.iter().cloned());
        msgs
    }

    pub fn invalidate_prefix(&mut self) {
        // Called when repo map changes, files are added, etc.
        self.prefix_hash = 0;
    }
}

Track the prefix hash to detect when cache invalidation occurs, and log a warning so users know their costs may increase.

Crates

tree-sitter — Not directly involved, but repo map changes (which trigger cache invalidation) come from tree-sitter diffs.
tokio — For the cache warming background task.
serde/serde_json — For serializing provider-specific cache control markers.
blake3 or xxhash — For fast prefix hashing to detect invalidation.

Prompt Caching

Feature Definition

Aider Implementation

Prompt Structure via ChatChunks

Cache Breakpoint Placement

Cache Warming

Provider Detection and Gating

Cost Calculation

CLI Surface

Codex Implementation

prompt_cache_key

Prefix Stability Strategy

Token Tracking

No Cache Warming

No Explicit Cache Markers

OpenCode Implementation

Transform Pipeline

Cache Marker Placement

Provider-Specific Markers

Gating: Anthropic-Family Only

prompt_cache_key for OpenAI-family

System Prompt 2-Part Structure

Pitfalls & Hard Lessons

Cache Invalidation Is Silent

Streaming vs Cache Stats

Provider Fragmentation

Ephemeral Expiry Under Load

Prefix Instability from Dynamic Content

Token Counting Accuracy

OpenOxide Blueprint

Crate: openoxide-cache

Core Trait

Provider Implementations

Token Usage Tracking

Cache Warming

Message Assembly

Crates

Crate: `openoxide-cache`