Skip to content

Chat History

Every multi-turn conversation with an LLM accumulates history. Each user message, assistant response, tool call, and tool result adds tokens to the context. After enough turns, the accumulated history exceeds the model’s context window, and the agent must decide what to keep and what to discard.

This is distinct from token budgeting, which covers how the total context window is allocated across system prompts, repo maps, file contents, and history. Chat history management is specifically about how the conversation history portion is stored, retrieved, pruned, and compressed. For persistence layout and resume mechanics, see Session Directory Layout and Session Resumption.

The hard parts are:

  • What to cut: Not all history is equally valuable. The most recent exchange is almost always important. Tool results from five turns ago are usually noise. Intermediate reasoning that led to a correct conclusion can be discarded, but reasoning that led to an error the model is still debugging cannot. There’s no static rule — the value of a message depends on the current conversation state.
  • Summarization vs truncation: Truncation (dropping old messages) is fast and deterministic but loses information permanently. Summarization (asking an LLM to compress old turns) preserves intent but is slow, costs tokens, and can introduce errors. Getting the balance right is critical.
  • Persistence format: The in-memory representation during a session and the on-disk format for persistence serve different needs. In-memory needs fast append and scan. On-disk needs durability, query support, and the ability to resume sessions days later.
  • Compaction markers: When history is summarized, the agent needs to know which messages have been compacted and which are still live. Without clear markers, resumed sessions can re-process already-summarized content or lose track of what the summary covers.
  • Background processing: Summarization involves an LLM call that can take seconds. If this blocks the main conversation loop, the user waits. If it runs asynchronously, there are race conditions between the summary completing and new messages arriving.

Reference: references/aider/aider/ | Commit: b9050e1d5faf8096eae7a46a9ecc05a86231384b

Aider maintains conversation history as two separate Python lists in base_coder.py (lines 395-403):

  • self.done_messages — Completed turns (accumulated history from prior turns)
  • self.cur_messages — Current turn’s messages (active exchange, not yet committed)

Each message is a simple dict: {"role": "user"|"assistant", "content": "..."}. There’s no structured type, no message ID, no timestamps.

When a turn completes, move_back_cur_messages() (lines 1036-1046) archives the current exchange:

def move_back_cur_messages(self, message=None):
self.done_messages += self.cur_messages
self.summarize_start() # check if summarization needed
if message:
self.done_messages += [
{"role": "user", "content": message},
{"role": "assistant", "content": "Ok."},
]
self.cur_messages = []

The optional message parameter allows injecting a synthetic exchange — used when the system needs to add context (like “the user added file X to the chat”) as a user/assistant pair. The "Ok." assistant response is a placeholder to maintain role alternation.

Chat history is persisted to a Markdown file via io.py:1117-1136:

def append_chat_history(self, text, linebreak=False, blockquote=False):
# Appends to ~/.aider.chat.history.md

The format is append-only Markdown with timestamps:

# aider chat started at 2024-03-15 14:32:07
#### Fix the login handler to validate the JWT token before checking permissions.
I'll update the login handler...

User messages are prefixed with #### (H4). Assistant responses are raw content. There’s no structured separation — the file is meant to be human-readable, not machine-parseable.

On session resume, if restore_chat_history=True (lines 519-523):

history_md = self.io.read_text(self.io.chat_history_file)
done_messages = utils.split_chat_history_markdown(history_md)
self.done_messages = done_messages
self.summarize_start()

The split_chat_history_markdown() function parses the Markdown back into message dicts by splitting on #### boundaries. This is fragile — if an assistant response contains ####, the parser breaks.

The ChatSummary class in history.py handles context overflow:

class ChatSummary:
def __init__(self, models, max_tokens):
self.models = models # [weak_model, primary_model]
self.max_tokens = max_tokens # from max_chat_history_tokens

Overflow detection (too_big(), line 23):

def too_big(self, messages):
sized = self.tokenize(messages)
total = sum(tokens for tokens, _msg in sized)
return total > self.max_tokens

Token counting uses the model’s tokenizer (via litellm). The max_tokens budget is typically set to the model’s total context minus reserved space for system prompt, repo map, and file contents.

Summarization algorithm (summarize(), lines 30-113):

  1. If messages fit within budget, return unchanged
  2. Find the split point — walk backwards through messages to find the last assistant message boundary in the first half
  3. Split into head (older messages, to be summarized) and tail (recent messages, kept verbatim)
  4. Recursively summarize head if it’s still too large (depth limit = 3)
  5. Send head to the LLM with a summarization prompt
  6. Replace head with a single summary message: {"role": "user", "content": summary_text}
  7. Return [summary_message] + tail

The summarization prompt asks the LLM to preserve key information: what files were discussed, what changes were made, what errors occurred, and what the user’s intent was.

Model selection: Summarization tries the weak model first (self.models[0]), falling back to the primary model if the weak model fails (lines 114-122). This saves cost — summarization doesn’t need the best model.

Summarization runs in a background thread to avoid blocking the user (lines 1002-1034 of base_coder.py):

def summarize_start(self):
if self.summarizer.too_big(self.done_messages):
self.summarizer_thread = threading.Thread(
target=self.summarize_worker
)
self.summarizer_thread.start()
def summarize_worker(self):
result = self.summarizer.summarize(self.done_messages_snapshot)
self.summarized_done_messages = result
def summarize_end(self):
if self.summarizer_thread:
self.summarizer_thread.join()
self.done_messages = self.summarized_done_messages

summarize_start() is called after each turn. summarize_end() is called before the next LLM call, blocking until the summary is ready. This gives the summarizer the time between turns to complete — usually a few seconds while the user types their next message.

Race condition: If the user sends a new message before summarization completes, summarize_end() blocks the conversation until the summary finishes. There’s no cancellation mechanism.


Reference: references/codex/codex-rs/core/src/ | Commit: 4ab44e2c5cc54ed47e47a6729dfd8aa5a3dc2476

Codex maintains conversation history as items: Vec<ResponseItem> in the ContextManager struct (context_manager/history.rs:25-29). Items are ordered oldest-to-newest. The ResponseItem type comes from the OpenAI Responses API and includes all message types: user messages, assistant responses, function calls, function outputs.

Adding items (record_items(), lines 65-80):

pub fn record_items(&mut self, items: Vec<ResponseItem>) {
// Filter and process items
// Append to self.items
}

Prompt construction (for_prompt(), lines 86-91):

pub fn for_prompt(&self) -> Vec<ResponseItem> {
// Return normalized history for LLM
self.items.clone()
}

Codex persists history to ~/.codex/history.jsonl via message_history.rs (lines 48-140). Each entry is a single JSON line:

{"session_id":"550e8400-e29b-41d4-a716-446655440000","ts":1710523927,"text":"Fix the login handler"}

Atomic writes: All writes use O_APPEND and stay within PIPE_BUF size to guarantee atomicity on POSIX systems (lines 112-140). This means multiple concurrent Codex sessions can safely append to the same history file.

Advisory locking: Read operations acquire a shared advisory lock, write operations acquire an exclusive lock (line 346). This prevents corruption from concurrent trimming and appending.

Codex does not use LLM-based summarization. Instead, it trims the history file by byte size (message_history.rs:159-244):

const HISTORY_SOFT_CAP_RATIO: f64 = 0.8;
fn enforce_history_limit(&self, config: &Config) -> Result<()> {
// If file exceeds config.history.max_bytes:
// 1. Read entire file
// 2. Drop oldest lines until file_size < max_bytes * 0.8
// 3. Rewrite file with remaining lines
// 4. Preserve the newest entry unconditionally
}

The soft cap at 80% of the hard limit prevents repeated trimming — once the file is trimmed, it has 20% headroom before the next trim is triggered. This is crude compared to Aider’s summarization (information is permanently lost, not compressed), but it’s fast and deterministic with no LLM dependency.

For within-session context management, Codex uses truncation, not summarization (context_manager/history.rs:125-144):

pub fn remove_first_item(&mut self) -> Option<ResponseItem> {
// Remove oldest item from history
}
pub fn drop_last_n_user_turns(&mut self, n: usize) {
// Remove the last N user turns (for rollback)
}

Token tracking uses a byte-based heuristic (4 bytes ≈ 1 token) for items added after the last API response, combined with exact token counts from the API’s usage field for prior items (lines 251-282):

pub fn estimate_token_count_with_base_instructions(&self) -> usize {
// API-reported tokens for items before last response
// + byte-estimate for items after last response
// + base instruction tokens
}

When the context overflows, Codex’s auto-compact feature iteratively removes the oldest items until the estimated token count fits within 95% of the model’s context window. This uses the same model at lower reasoning effort — it’s truncation with re-prompting, not summarization.

The history file supports random access by entry offset (message_history.rs:247-384):

pub fn history_metadata(config: &Config) -> (LogId, usize) {
// Returns (log_id, entry_count)
}
pub fn lookup(log_id: LogId, offset: usize, config: &Config) -> Option<HistoryEntry> {
// Retrieve specific entry by position
}

LogId is the file’s inode (Unix) or creation time (Windows), used to detect file rotation. If the log_id changes between metadata and lookup, the offset is invalid.


Reference: references/opencode/packages/opencode/src/session/ | Commit: 7ed449974864361bad2c1f1405769fd2c2fcdf42

OpenCode stores all conversation data in SQLite with three tables (session.sql.ts:11-88):

SessionTable:

id, project_id, parent_id, slug, directory, title, version,
share_url, summary_additions, summary_deletions, summary_files, summary_diffs,
revert, permission,
time_created, time_updated, time_compacting, time_archived

MessageTable:

id, session_id, time_created, time_updated, data (JSON)

PartTable:

id, message_id, session_id, time_created, time_updated, data (JSON)

The data column in both MessageTable and PartTable stores JSON blobs — the structured message/part content serialized via Zod schemas. Cascade deletes ensure session deletion cleans up all associated messages and parts.

Messages use a rich type system (message-v2.ts), with message info and message parts stored separately:

interface UserMessageInfo {
id: string
sessionID: string
role: "user"
time: { created: number }
agent: string
model: { providerID: string; modelID: string }
system?: string
tools?: Record<string, boolean>
summary?: { title?: string; body?: string; diffs: FileDiff[] }
}
interface AssistantMessageInfo {
id: string
sessionID: string
role: "assistant"
time: { created: number; completed?: number }
parentID: string
modelID: string
providerID: string
agent: string
path: { cwd: string; root: string }
cost: number
tokens: {
input: number
output: number
reasoning: number
total?: number
cache: { read: number; write: number }
}
summary?: boolean
error?: unknown
}

MessagePart[] entries (TextPart, ToolPart, CompactionPart, PatchPart, etc.) are stored in PartTable and joined back to each message when streamed.

MessagePart is a discriminated union with 10+ variants:

  • TextPart — user/assistant text
  • ReasoningPart — model chain-of-thought (Claude thinking blocks)
  • ToolPart — tool call with state (pending | running | completed | error)
  • FilePart — attached files
  • CompactionPart — summary marker (critical for compaction system)
  • SubtaskPart — delegated task reference
  • SnapshotPart — filesystem state at that point
  • PatchPart — diff hashes for undo

This is far more structured than Aider’s {"role", "content"} dicts or Codex’s ResponseItem enum. The trade-off is complexity — serialization, migration, and querying all become harder.

Messages are loaded via an async generator that streams from SQLite in batches (message-v2.ts:716-809):

async function* stream(sessionID: string) {
// Query MessageTable descending by time_created
// Batch size: 50 messages per query
// For each batch, join PartTable
// Yield {info, parts} tuples
}

The generator yields messages newest-first. The consumer — filterCompacted() — scans backwards from the newest message and stops when it hits a completed compaction boundary (lines 794-809):

function filterCompacted(messages: Message[]) {
// Walk newest → oldest
// Track: has an assistant message with summary=true been "finished"?
// When a user message matches the completed summary's ID → stop
// Reverse the collected messages → return oldest-first
}

This is how OpenCode avoids loading the entire history on session resume. The compaction marker tells the loader “everything before this point has been summarized — stop here.” Only messages after the last compaction are loaded into the LLM’s context.

OpenCode’s compaction system (compaction.ts, 262 lines) has two phases:

Phase 1: Pruning (prune(), lines 58-99):

Before summarizing, prune large tool outputs that are unlikely to be relevant:

function prune(messages: Message[]) {
// Walk backwards through messages
// Keep minimum 2 user turns untouched
// For each tool output:
// If accumulated tokens > PRUNE_PROTECT (40,000): mark as compacted
// Skip protected tools (e.g., "skill")
// Only prune if total pruned > PRUNE_MINIMUM (20,000)
// Mark pruned parts with compacted: Date.now()
}

Pruned tool outputs aren’t deleted — they’re marked with compacted: Date.now(). When the prompt builder encounters a compacted tool result, it replaces the content with "[Old tool result content cleared]" (line 620 of prompt.ts). The original data stays in SQLite for forensic purposes.

Phase 2: Summarization (process(), lines 101-229):

After pruning, if the context is still too large, a full LLM-based summary is generated:

  1. Create an assistant message with summary: true flag
  2. Build a prompt from the conversation above the overflow point
  3. Call the LLM to generate a summary via SessionProcessor
  4. Store the summary as a CompactionPart in the assistant message
  5. Optionally auto-continue with a synthetic user message (lines 202-225)
  6. Publish SessionCompaction.Event.Compacted event
function isOverflow(messages: Message[], model: Model) {
const usable = model.limit.input - COMPACTION_BUFFER // 20,000 reserved
const used = totalTokens(messages) // input + output
return used > usable
}

The COMPACTION_BUFFER of 20,000 tokens ensures compaction triggers before the context window is completely full, leaving room for the next turn’s system prompt and user message.

Compaction is triggered automatically in the main prompt loop (prompt.ts:542-554):

// After each LLM response completes:
if (lastFinished && isOverflow(messages, model)) {
await compact(sessionID)
// Re-load messages (now with compaction marker)
// Continue prompt loop
}

This runs synchronously between turns — unlike Aider’s background thread approach. The user sees a “Compacting…” indicator in the TUI while the summary is generated.

// Disable auto-compaction:
config.compaction?.auto === false
// Constants:
COMPACTION_BUFFER = 20_000 // tokens reserved before overflow
PRUNE_PROTECT = 40_000 // token threshold before pruning starts
PRUNE_MINIMUM = 20_000 // minimum tokens to justify pruning

Claude Code takes a fundamentally different approach to chat history management by leveraging server-side API features for compaction and context editing, a hierarchical memory system for persistent knowledge, and a checkpoint system for session-level undo.

Memory System: Persistent Context Beyond History

Section titled “Memory System: Persistent Context Beyond History”

Unlike Aider/Codex/OpenCode which treat chat history as the primary persistence layer, Claude Code separates instructions (CLAUDE.md files) from session history (conversation transcript) from learned knowledge (auto memory).

CLAUDE.md hierarchy (six levels, loaded at session start):

LevelLocationScope
Managed policy/etc/claude-code/CLAUDE.md (Linux)Organization-wide
Project memory./CLAUDE.md or ./.claude/CLAUDE.mdTeam via VCS
Project rules./.claude/rules/*.mdTeam, path-scoped via frontmatter
User memory~/.claude/CLAUDE.mdPersonal, all projects
Project local./CLAUDE.local.mdPersonal, current project
Auto memory~/.claude/projects/<project>/memory/Per-project, auto-generated

CLAUDE.md files above CWD are loaded in full at launch. Files in child directories load on-demand when Claude reads files in those subtrees. More specific instructions take precedence over broader ones.

Auto memory is a separate subsystem where Claude records its own learnings:

  • Stored at ~/.claude/projects/<project>/memory/ with MEMORY.md as an index file and optional topic files (e.g., debugging.md, api-conventions.md)
  • Only the first 200 lines of MEMORY.md are loaded into the system prompt at session start
  • Topic files are loaded on demand via standard file tools
  • Each git repo gets one memory directory; git worktrees get separate directories
  • Control via CLAUDE_CODE_DISABLE_AUTO_MEMORY=0 (force on) or =1 (force off)

CLAUDE.md imports: Files can reference other files with @path/to/import syntax, resolved relative to the containing file. Recursive imports up to depth 5. First-encounter approval dialog per project.

Project rules (.claude/rules/*.md): Modular, topic-specific instructions. Optional YAML frontmatter with paths: field for glob-based conditional activation. Subdirectories supported, recursively discovered. Symlinks allowed.

Claude Code tracks file edits as checkpoints — one per user prompt. Checkpoints are not conversation history management per se, but they interact with it via the rewind menu.

  • Every user prompt creates a checkpoint
  • Persists across sessions, auto-cleaned after 30 days (configurable)
  • Only tracks changes made by Claude’s file editing tools — NOT bash commands, NOT external changes
  • Complements (does not replace) version control

Rewind menu (via Esc + Esc or /rewind):

ActionCodeConversation
Restore code and conversationRevert to checkpointRewind to that message
Restore conversationKeep currentRewind to that message
Restore codeRevert to checkpointKeep current
Summarize from hereNo changeCompress from selected point forward

“Summarize from here” is a targeted compaction — it keeps early context in full detail and only compresses from the selected point forward. The original messages are preserved in the session transcript for reference. Accepts optional instructions to guide the summary focus.

Claude Code uses the Anthropic API’s server-side compaction (beta compact-2026-01-12) rather than implementing client-side summarization like Aider or OpenCode:

  1. API detects input tokens exceed trigger threshold (default 150K, minimum 50K)
  2. Generates summary of conversation
  3. Creates a compaction content block containing the summary
  4. Continues the response with compacted context
  5. On subsequent requests, API auto-drops all message blocks prior to the last compaction block

Configuration:

  • trigger: when to compact (input token threshold)
  • pause_after_compaction: return immediately after summary so client can preserve recent messages before continuing
  • instructions: custom summarization prompt (completely replaces default)

The pause_after_compaction option enables a pattern where the client preserves the last N messages verbatim alongside the compaction summary, avoiding information loss for recent context.

In addition to compaction, Claude Code uses the API’s context editing features (beta context-management-2025-06-27) for lighter-weight context management:

Tool result clearing (clear_tool_uses_20250919):

  • Trigger: configurable input token threshold (default 100K)
  • Keeps: configurable number of recent tool use/result pairs (default 3)
  • Cleared results replaced with placeholder text
  • exclude_tools: never-clear list for important tool types
  • clear_at_least: minimum tokens to clear (amortizes cache invalidation cost)

Thinking block clearing (clear_thinking_20251015):

  • Manages extended thinking blocks to save context space
  • Default: keep only last assistant turn’s thinking
  • Can keep N turns or "all" (maximizes prompt cache hits)

Both strategies are server-side — the client maintains full unmodified history while the API applies edits before the prompt reaches Claude.

Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) receive their token budget at session start:

<budget:token_budget>200000</budget:token_budget>

After each tool call, remaining capacity is updated:

<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>

This is a model-training feature — the model is trained to use this information for planning tasks within its available context.

Claude Code exposes context state to users via a customizable status line — a shell script that receives JSON session data on stdin after each assistant message (debounced at 300ms). Key context-related fields:

  • context_window.used_percentage / remaining_percentage — calculated from input tokens
  • context_window.context_window_size — 200K default, 1M for extended context
  • context_window.current_usage — token counts from last API call (input, output, cache creation, cache read)
  • cost.total_cost_usd — accumulated session cost
  • exceeds_200k_tokens — boolean threshold indicator
AspectAiderCodexOpenCodeClaude Code
Compaction locationClient (background thread)Client (post-response)Client (synchronous)Server-side API
Compaction triggerHistory token budget exceededauto_compact_token_limitisOverflow() post-responseAPI input token threshold (default 150K)
PruningNone (summarize everything)Truncate oldest itemsTwo-phase (prune tools, then summarize)Server-side tool result clearing (configurable)
Token countingExact (litellm)Byte heuristic (4 bytes/token)Char heuristic (4 chars/token) + API-reportedFree API endpoint pre-send
Persistent memoryNoneNoneNone6-level CLAUDE.md hierarchy + auto memory
Session undoNoneNoneNoneCheckpoint per prompt with rewind menu
Context awarenessNoneNoneNoneModel-level budget injection
Thinking managementN/AEncrypted reasoningReasoning partsServer-side thinking block clearing

When an LLM summarizes conversation history, it can subtly distort the intent. “The user asked to fix the login handler” might become “The user wants to update the authentication system” — close but not identical. Over multiple rounds of summarization (Aider’s recursive approach), these distortions compound. The model ends up working from a summary-of-a-summary that no longer accurately reflects what the user originally asked for.

OpenCode’s two-phase approach (prune tool outputs first, then summarize if needed) is empirically better than Aider’s summarize-everything approach. Most context bloat comes from large tool outputs (file reads, search results, command output) that are only relevant for the turn they were generated. Pruning these before summarization reduces the summary’s job to compressing actual conversation, not reams of file contents.

Aider’s Markdown history format is human-readable but machine-fragile. The #### separator breaks when assistant responses contain Markdown headers. The timestamp-based session boundaries don’t support querying by session ID. There’s no transaction safety — a crash mid-write can corrupt the file. SQLite (OpenCode) and JSONL with atomic writes (Codex) are more robust, at the cost of human readability.

Aider runs summarization in a background thread, which avoids blocking the user but introduces a race condition: if the user sends a message before summarization completes, the main thread blocks on thread.join(). OpenCode runs compaction synchronously between turns, which is simpler and avoids races but makes the user wait. Codex avoids the problem entirely by not summarizing.

Aider uses exact token counts via the model’s tokenizer (litellm). Codex uses a 4-byte heuristic for new items and exact counts from the API for prior items. OpenCode uses exact API-reported tokens for completed turns and heuristics for in-progress content. The heuristic approaches are faster but can be off by 10-20%, which means compaction triggers earlier or later than optimal. For models with large context windows (200k+), this margin is acceptable. For smaller windows (8k-32k), exact counting matters more.

Aider’s “restore chat history” replays the entire Markdown file into done_messages, then summarizes if too large. This means resuming a long session triggers an immediate summarization that can take 10+ seconds. Codex’s JSONL lookup supports random access by offset but doesn’t reconstruct the conversation context — it’s a history browser, not a session resume mechanism. OpenCode’s compaction markers allow efficient resume: load messages after the last compaction boundary, and the summary message provides the compressed context for everything before.

OpenCode’s SQLite schema uses cascade deletes — deleting a session deletes all its messages and parts. This is clean but irreversible. There’s no soft-delete, no trash, no recovery mechanism. Codex’s JSONL file can be manually edited to recover deleted entries. Aider’s Markdown file is append-only and never deletes anything.

Claude Code’s server-side compaction requires an additional sampling step — the summary generation is billed as a separate API call within the same request. The usage.iterations array separates compaction costs from message costs, but developers must sum across all iterations for accurate billing. Top-level input_tokens/output_tokens exclude compaction iterations, which can be misleading.

Claude Code’s checkpointing only tracks file editing tools. If Claude runs rm file.txt or mv old.txt new.txt via bash, those changes are invisible to the checkpoint system and cannot be undone via rewind. This creates a false sense of safety when Claude uses shell commands for file operations.

Auto memory’s 200-line limit on MEMORY.md is a hard cutoff. If the index file grows beyond that, content is silently dropped from the system prompt. Claude is instructed to keep it concise by moving details to topic files, but there’s no enforcement mechanism — a verbose auto-save could push important content past the limit.

The pause_after_compaction pattern (return after summary, client preserves recent messages, re-send) adds a second API round-trip. If the client doesn’t handle the stop_reason: "compaction" case, the conversation breaks. This is an easy integration bug in headless/CI deployments.


OpenOxide should use a three-layer system:

  1. In-memory: A Vec<Message> ordered oldest-to-newest, with efficient append and backward scan
  2. On-disk: SQLite for structured persistence with message/part separation
  3. Compaction: Two-phase (prune tool outputs, then LLM summarization) with marker-based filtering
// SQLite tables
struct Session {
id: Ulid,
project_id: String,
parent_id: Option<Ulid>, // for forked sessions
title: String,
directory: PathBuf,
created_at: i64,
updated_at: i64,
compacted_at: Option<i64>, // last compaction timestamp
}
struct Message {
id: Ulid,
session_id: Ulid,
role: Role, // User | Assistant
created_at: i64,
completed_at: Option<i64>,
}
struct Part {
id: Ulid,
message_id: Ulid,
session_id: Ulid, // denormalized for efficient session queries
kind: PartKind, // Text | Reasoning | ToolCall | ToolResult | Summary | Snapshot
data: serde_json::Value, // JSON blob
compacted_at: Option<i64>, // null = live, Some = pruned
created_at: i64,
}
CratePurpose
rusqliteSQLite bindings (WAL mode for concurrent reads)
ulidTime-ordered unique IDs
serde_jsonPart data serialization
tiktoken-rsExact token counting for OpenAI models

Follow OpenCode’s two-phase approach:

pub async fn compact(session_id: Ulid, model: &Model) -> Result<()> {
let messages = load_after_last_compaction(session_id).await?;
// Phase 1: Prune old tool outputs
let pruned_tokens = prune_tool_outputs(&messages, PruneConfig {
protect_threshold: 40_000, // start pruning after this many tokens
minimum_savings: 20_000, // don't bother if savings < this
keep_recent_turns: 2, // always keep last 2 user turns
}).await?;
// Phase 2: Summarize if still over budget
let budget = model.context_window - COMPACTION_BUFFER;
if estimate_tokens(&messages) > budget {
let summary = generate_summary(&messages, model).await?;
insert_compaction_marker(session_id, summary).await?;
}
}

Use a streaming loader that stops at compaction boundaries:

pub async fn load_context(session_id: Ulid) -> Vec<Message> {
// Query messages descending by created_at
// Stop when a completed compaction marker is found
// Reverse to oldest-first order
// Replace compacted tool outputs with placeholder text
}

Use exact counting for the primary model’s tokenizer, with a 4-byte fallback for unknown models:

pub fn count_tokens(text: &str, model: &str) -> usize {
match tiktoken_rs::get_bpe_from_model(model) {
Ok(bpe) => bpe.encode_with_special_tokens(text).len(),
Err(_) => text.len() / 4, // fallback heuristic
}
}

Run compaction in a background tokio::task with a channel-based result delivery, similar to Aider’s threading but with proper cancellation:

let (tx, rx) = oneshot::channel();
tokio::spawn(async move {
let result = compact(session_id, &model).await;
let _ = tx.send(result);
});
// Before next LLM call:
if let Ok(result) = rx.try_recv() {
// Apply compaction result
} else {
// Compaction still running — block or proceed with current context
}

This gives the user the time between turns for compaction to complete (like Aider) while supporting cancellation via tokio’s cooperative cancellation (unlike Aider’s uninterruptible thread).

Hierarchical Memory System (from Claude Code)

Section titled “Hierarchical Memory System (from Claude Code)”

Adopt Claude Code’s multi-level memory hierarchy:

pub struct MemoryConfig {
/// System-wide managed policy (e.g., /etc/openoxide/MEMORY.md)
pub managed_policy: Option<PathBuf>,
/// Project memory (./OPENOXIDE.md or ./.openoxide/MEMORY.md)
pub project_memory: Option<PathBuf>,
/// Project rules (./.openoxide/rules/*.md, with optional path globs)
pub project_rules: Vec<RuleFile>,
/// User memory (~/.openoxide/MEMORY.md)
pub user_memory: Option<PathBuf>,
/// Project local (./OPENOXIDE.local.md, gitignored)
pub project_local: Option<PathBuf>,
/// Auto memory (~/.openoxide/projects/<project>/memory/)
pub auto_memory: AutoMemoryConfig,
}
pub struct RuleFile {
pub path: PathBuf,
pub path_globs: Option<Vec<String>>, // YAML frontmatter paths
pub content: String,
}
pub struct AutoMemoryConfig {
pub dir: PathBuf,
pub index_file: PathBuf, // MEMORY.md
pub index_line_limit: usize, // 200
pub topic_files: Vec<PathBuf>, // loaded on demand
}

Loading strategy:

  1. At session start: load managed policy + project + user + local + auto memory index (first N lines)
  2. On file access in child directory: load any MEMORY.md found there
  3. On path match: activate path-scoped rules
pub struct Checkpoint {
pub id: Ulid,
pub session_id: Ulid,
pub message_id: Ulid, // the user prompt that triggered this checkpoint
pub file_states: Vec<FileState>, // snapshot of files before Claude's edits
pub created_at: i64,
pub ttl_days: u32, // default 30
}
pub struct FileState {
pub path: PathBuf,
pub content_hash: [u8; 32], // SHA-256 of file content
pub content: Option<Vec<u8>>, // stored if file was modified
}
pub enum RewindAction {
RestoreCodeAndConversation,
RestoreConversation,
RestoreCode,
SummarizeFromHere { instructions: Option<String> },
}

Only track files modified by the agent’s file editing tools, not shell commands. Store file content before each edit for restoration.

Server-Side Compaction Support (from Claude Code)

Section titled “Server-Side Compaction Support (from Claude Code)”

When the provider API supports server-side compaction (like Anthropic’s compact_20260112), prefer it over client-side compaction:

pub enum CompactionStrategy {
/// Use provider's server-side compaction API
ServerSide {
trigger_tokens: usize, // default 150_000
pause_after: bool, // pause to preserve recent messages
instructions: Option<String>, // custom summarization prompt
},
/// Client-side two-phase (prune + summarize)
ClientSide {
prune_config: PruneConfig,
summary_model: Option<String>,
},
/// No compaction
Disabled,
}

Server-side is preferred because:

  1. No extra client logic for summarization
  2. Summary quality benefits from provider-side optimization
  3. Integrated with prompt caching (compaction blocks can be cache-controlled)
  4. Usage tracking via iterations array in response