Chat History

Feature Definition

Every multi-turn conversation with an LLM accumulates history. Each user message, assistant response, tool call, and tool result adds tokens to the context. After enough turns, the accumulated history exceeds the model’s context window, and the agent must decide what to keep and what to discard.

This is distinct from token budgeting, which covers how the total context window is allocated across system prompts, repo maps, file contents, and history. Chat history management is specifically about how the conversation history portion is stored, retrieved, pruned, and compressed. For persistence layout and resume mechanics, see Session Directory Layout and Session Resumption.

The hard parts are:

What to cut: Not all history is equally valuable. The most recent exchange is almost always important. Tool results from five turns ago are usually noise. Intermediate reasoning that led to a correct conclusion can be discarded, but reasoning that led to an error the model is still debugging cannot. There’s no static rule — the value of a message depends on the current conversation state.
Summarization vs truncation: Truncation (dropping old messages) is fast and deterministic but loses information permanently. Summarization (asking an LLM to compress old turns) preserves intent but is slow, costs tokens, and can introduce errors. Getting the balance right is critical.
Persistence format: The in-memory representation during a session and the on-disk format for persistence serve different needs. In-memory needs fast append and scan. On-disk needs durability, query support, and the ability to resume sessions days later.
Compaction markers: When history is summarized, the agent needs to know which messages have been compacted and which are still live. Without clear markers, resumed sessions can re-process already-summarized content or lose track of what the summary covers.
Background processing: Summarization involves an LLM call that can take seconds. If this blocks the main conversation loop, the user waits. If it runs asynchronously, there are race conditions between the summary completing and new messages arriving.

Aider Implementation

Reference: references/aider/aider/ | Commit: b9050e1d5faf8096eae7a46a9ecc05a86231384b

In-Memory: Two Lists

Aider maintains conversation history as two separate Python lists in base_coder.py (lines 395-403):

self.done_messages — Completed turns (accumulated history from prior turns)
self.cur_messages — Current turn’s messages (active exchange, not yet committed)

Each message is a simple dict: {"role": "user"|"assistant", "content": "..."}. There’s no structured type, no message ID, no timestamps.

Turn Transition

When a turn completes, move_back_cur_messages() (lines 1036-1046) archives the current exchange:

def move_back_cur_messages(self, message=None):
    self.done_messages += self.cur_messages
    self.summarize_start()  # check if summarization needed
    if message:
        self.done_messages += [
            {"role": "user", "content": message},
            {"role": "assistant", "content": "Ok."},
        ]
    self.cur_messages = []

The optional message parameter allows injecting a synthetic exchange — used when the system needs to add context (like “the user added file X to the chat”) as a user/assistant pair. The "Ok." assistant response is a placeholder to maintain role alternation.

Disk Persistence: Markdown Append Log

Chat history is persisted to a Markdown file via io.py:1117-1136:

def append_chat_history(self, text, linebreak=False, blockquote=False):
    # Appends to ~/.aider.chat.history.md

The format is append-only Markdown with timestamps:

# aider chat started at 2024-03-15 14:32:07

#### Fix the login handler to validate the JWT token before checking permissions.

I'll update the login handler...

User messages are prefixed with #### (H4). Assistant responses are raw content. There’s no structured separation — the file is meant to be human-readable, not machine-parseable.

History Restoration

On session resume, if restore_chat_history=True (lines 519-523):

history_md = self.io.read_text(self.io.chat_history_file)
done_messages = utils.split_chat_history_markdown(history_md)
self.done_messages = done_messages
self.summarize_start()

The split_chat_history_markdown() function parses the Markdown back into message dicts by splitting on #### boundaries. This is fragile — if an assistant response contains ####, the parser breaks.

Summarization: Recursive LLM Compression

The ChatSummary class in history.py handles context overflow:

class ChatSummary:
    def __init__(self, models, max_tokens):
        self.models = models           # [weak_model, primary_model]
        self.max_tokens = max_tokens   # from max_chat_history_tokens

Overflow detection (too_big(), line 23):

def too_big(self, messages):
    sized = self.tokenize(messages)
    total = sum(tokens for tokens, _msg in sized)
    return total > self.max_tokens

Token counting uses the model’s tokenizer (via litellm). The max_tokens budget is typically set to the model’s total context minus reserved space for system prompt, repo map, and file contents.

Summarization algorithm (summarize(), lines 30-113):

If messages fit within budget, return unchanged
Find the split point — walk backwards through messages to find the last assistant message boundary in the first half
Split into head (older messages, to be summarized) and tail (recent messages, kept verbatim)
Recursively summarize head if it’s still too large (depth limit = 3)
Send head to the LLM with a summarization prompt
Replace head with a single summary message: {"role": "user", "content": summary_text}
Return [summary_message] + tail

The summarization prompt asks the LLM to preserve key information: what files were discussed, what changes were made, what errors occurred, and what the user’s intent was.

Model selection: Summarization tries the weak model first (self.models[0]), falling back to the primary model if the weak model fails (lines 114-122). This saves cost — summarization doesn’t need the best model.

Background Thread

Summarization runs in a background thread to avoid blocking the user (lines 1002-1034 of base_coder.py):

def summarize_start(self):
    if self.summarizer.too_big(self.done_messages):
        self.summarizer_thread = threading.Thread(
            target=self.summarize_worker
        )
        self.summarizer_thread.start()

def summarize_worker(self):
    result = self.summarizer.summarize(self.done_messages_snapshot)
    self.summarized_done_messages = result

def summarize_end(self):
    if self.summarizer_thread:
        self.summarizer_thread.join()
        self.done_messages = self.summarized_done_messages

summarize_start() is called after each turn. summarize_end() is called before the next LLM call, blocking until the summary is ready. This gives the summarizer the time between turns to complete — usually a few seconds while the user types their next message.

Race condition: If the user sends a new message before summarization completes, summarize_end() blocks the conversation until the summary finishes. There’s no cancellation mechanism.

Codex Implementation

Reference: references/codex/codex-rs/core/src/ | Commit: 4ab44e2c5cc54ed47e47a6729dfd8aa5a3dc2476

In-Memory: Vec

Codex maintains conversation history as items: Vec<ResponseItem> in the ContextManager struct (context_manager/history.rs:25-29). Items are ordered oldest-to-newest. The ResponseItem type comes from the OpenAI Responses API and includes all message types: user messages, assistant responses, function calls, function outputs.

Adding items (record_items(), lines 65-80):

pub fn record_items(&mut self, items: Vec<ResponseItem>) {
    // Filter and process items
    // Append to self.items
}

Prompt construction (for_prompt(), lines 86-91):

pub fn for_prompt(&self) -> Vec<ResponseItem> {
    // Return normalized history for LLM
    self.items.clone()
}

Disk Persistence: JSONL Append Log

Codex persists history to ~/.codex/history.jsonl via message_history.rs (lines 48-140). Each entry is a single JSON line:

{"session_id":"550e8400-e29b-41d4-a716-446655440000","ts":1710523927,"text":"Fix the login handler"}

Atomic writes: All writes use O_APPEND and stay within PIPE_BUF size to guarantee atomicity on POSIX systems (lines 112-140). This means multiple concurrent Codex sessions can safely append to the same history file.

Advisory locking: Read operations acquire a shared advisory lock, write operations acquire an exclusive lock (line 346). This prevents corruption from concurrent trimming and appending.

History Trimming (Not Summarization)

Codex does not use LLM-based summarization. Instead, it trims the history file by byte size (message_history.rs:159-244):

const HISTORY_SOFT_CAP_RATIO: f64 = 0.8;

fn enforce_history_limit(&self, config: &Config) -> Result<()> {
    // If file exceeds config.history.max_bytes:
    //   1. Read entire file
    //   2. Drop oldest lines until file_size < max_bytes * 0.8
    //   3. Rewrite file with remaining lines
    //   4. Preserve the newest entry unconditionally
}

The soft cap at 80% of the hard limit prevents repeated trimming — once the file is trimmed, it has 20% headroom before the next trim is triggered. This is crude compared to Aider’s summarization (information is permanently lost, not compressed), but it’s fast and deterministic with no LLM dependency.

Context Window Overflow

For within-session context management, Codex uses truncation, not summarization (context_manager/history.rs:125-144):

pub fn remove_first_item(&mut self) -> Option<ResponseItem> {
    // Remove oldest item from history
}

pub fn drop_last_n_user_turns(&mut self, n: usize) {
    // Remove the last N user turns (for rollback)
}

Token tracking uses a byte-based heuristic (4 bytes ≈ 1 token) for items added after the last API response, combined with exact token counts from the API’s usage field for prior items (lines 251-282):

pub fn estimate_token_count_with_base_instructions(&self) -> usize {
    // API-reported tokens for items before last response
    // + byte-estimate for items after last response
    // + base instruction tokens
}

When the context overflows, Codex’s auto-compact feature iteratively removes the oldest items until the estimated token count fits within 95% of the model’s context window. This uses the same model at lower reasoning effort — it’s truncation with re-prompting, not summarization.

History Lookup

The history file supports random access by entry offset (message_history.rs:247-384):

pub fn history_metadata(config: &Config) -> (LogId, usize) {
    // Returns (log_id, entry_count)
}

pub fn lookup(log_id: LogId, offset: usize, config: &Config) -> Option<HistoryEntry> {
    // Retrieve specific entry by position
}

LogId is the file’s inode (Unix) or creation time (Windows), used to detect file rotation. If the log_id changes between metadata and lookup, the offset is invalid.

OpenCode Implementation

Reference: references/opencode/packages/opencode/src/session/ | Commit: 7ed449974864361bad2c1f1405769fd2c2fcdf42

Persistence: SQLite

OpenCode stores all conversation data in SQLite with three tables (session.sql.ts:11-88):

SessionTable:

id, project_id, parent_id, slug, directory, title, version,
share_url, summary_additions, summary_deletions, summary_files, summary_diffs,
revert, permission,
time_created, time_updated, time_compacting, time_archived

MessageTable:

id, session_id, time_created, time_updated, data (JSON)

PartTable:

id, message_id, session_id, time_created, time_updated, data (JSON)

The data column in both MessageTable and PartTable stores JSON blobs — the structured message/part content serialized via Zod schemas. Cascade deletes ensure session deletion cleans up all associated messages and parts.

Message Format

Messages use a rich type system (message-v2.ts), with message info and message parts stored separately:

interface UserMessageInfo {
    id: string
    sessionID: string
    role: "user"
    time: { created: number }
    agent: string
    model: { providerID: string; modelID: string }
    system?: string
    tools?: Record<string, boolean>
    summary?: { title?: string; body?: string; diffs: FileDiff[] }
}

interface AssistantMessageInfo {
    id: string
    sessionID: string
    role: "assistant"
    time: { created: number; completed?: number }
    parentID: string
    modelID: string
    providerID: string
    agent: string
    path: { cwd: string; root: string }
    cost: number
    tokens: {
        input: number
        output: number
        reasoning: number
        total?: number
        cache: { read: number; write: number }
    }
    summary?: boolean
    error?: unknown
}

MessagePart[] entries (TextPart, ToolPart, CompactionPart, PatchPart, etc.) are stored in PartTable and joined back to each message when streamed.

MessagePart is a discriminated union with 10+ variants:

TextPart — user/assistant text
ReasoningPart — model chain-of-thought (Claude thinking blocks)
ToolPart — tool call with state (pending | running | completed | error)
FilePart — attached files
CompactionPart — summary marker (critical for compaction system)
SubtaskPart — delegated task reference
SnapshotPart — filesystem state at that point
PatchPart — diff hashes for undo

This is far more structured than Aider’s {"role", "content"} dicts or Codex’s ResponseItem enum. The trade-off is complexity — serialization, migration, and querying all become harder.

Message Loading: Streamed and Filtered

Messages are loaded via an async generator that streams from SQLite in batches (message-v2.ts:716-809):

async function* stream(sessionID: string) {
    // Query MessageTable descending by time_created
    // Batch size: 50 messages per query
    // For each batch, join PartTable
    // Yield {info, parts} tuples
}

The generator yields messages newest-first. The consumer — filterCompacted() — scans backwards from the newest message and stops when it hits a completed compaction boundary (lines 794-809):

function filterCompacted(messages: Message[]) {
    // Walk newest → oldest
    // Track: has an assistant message with summary=true been "finished"?
    // When a user message matches the completed summary's ID → stop
    // Reverse the collected messages → return oldest-first
}

This is how OpenCode avoids loading the entire history on session resume. The compaction marker tells the loader “everything before this point has been summarized — stop here.” Only messages after the last compaction are loaded into the LLM’s context.

Compaction: Two-Phase System

OpenCode’s compaction system (compaction.ts, 262 lines) has two phases:

Phase 1: Pruning (prune(), lines 58-99):

Before summarizing, prune large tool outputs that are unlikely to be relevant:

function prune(messages: Message[]) {
    // Walk backwards through messages
    // Keep minimum 2 user turns untouched
    // For each tool output:
    //   If accumulated tokens > PRUNE_PROTECT (40,000): mark as compacted
    //   Skip protected tools (e.g., "skill")
    //   Only prune if total pruned > PRUNE_MINIMUM (20,000)
    // Mark pruned parts with compacted: Date.now()
}

Pruned tool outputs aren’t deleted — they’re marked with compacted: Date.now(). When the prompt builder encounters a compacted tool result, it replaces the content with "[Old tool result content cleared]" (line 620 of prompt.ts). The original data stays in SQLite for forensic purposes.

Phase 2: Summarization (process(), lines 101-229):

After pruning, if the context is still too large, a full LLM-based summary is generated:

Create an assistant message with summary: true flag
Build a prompt from the conversation above the overflow point
Call the LLM to generate a summary via SessionProcessor
Store the summary as a CompactionPart in the assistant message
Optionally auto-continue with a synthetic user message (lines 202-225)
Publish SessionCompaction.Event.Compacted event

Overflow Detection

function isOverflow(messages: Message[], model: Model) {
    const usable = model.limit.input - COMPACTION_BUFFER  // 20,000 reserved
    const used = totalTokens(messages)                     // input + output
    return used > usable
}

The COMPACTION_BUFFER of 20,000 tokens ensures compaction triggers before the context window is completely full, leaving room for the next turn’s system prompt and user message.

Auto-Compaction Trigger

Compaction is triggered automatically in the main prompt loop (prompt.ts:542-554):

// After each LLM response completes:
if (lastFinished && isOverflow(messages, model)) {
    await compact(sessionID)
    // Re-load messages (now with compaction marker)
    // Continue prompt loop
}

This runs synchronously between turns — unlike Aider’s background thread approach. The user sees a “Compacting…” indicator in the TUI while the summary is generated.

Config

// Disable auto-compaction:
config.compaction?.auto === false

// Constants:
COMPACTION_BUFFER = 20_000    // tokens reserved before overflow
PRUNE_PROTECT = 40_000        // token threshold before pruning starts
PRUNE_MINIMUM = 20_000        // minimum tokens to justify pruning

Claude Code Implementation

Claude Code takes a fundamentally different approach to chat history management by leveraging server-side API features for compaction and context editing, a hierarchical memory system for persistent knowledge, and a checkpoint system for session-level undo.

Memory System: Persistent Context Beyond History

Unlike Aider/Codex/OpenCode which treat chat history as the primary persistence layer, Claude Code separates instructions (CLAUDE.md files) from session history (conversation transcript) from learned knowledge (auto memory).

CLAUDE.md hierarchy (six levels, loaded at session start):

Level	Location	Scope
Managed policy	`/etc/claude-code/CLAUDE.md` (Linux)	Organization-wide
Project memory	`./CLAUDE.md` or `./.claude/CLAUDE.md`	Team via VCS
Project rules	`./.claude/rules/*.md`	Team, path-scoped via frontmatter
User memory	`~/.claude/CLAUDE.md`	Personal, all projects
Project local	`./CLAUDE.local.md`	Personal, current project
Auto memory	`~/.claude/projects/<project>/memory/`	Per-project, auto-generated

CLAUDE.md files above CWD are loaded in full at launch. Files in child directories load on-demand when Claude reads files in those subtrees. More specific instructions take precedence over broader ones.

Auto memory is a separate subsystem where Claude records its own learnings:

Stored at ~/.claude/projects/<project>/memory/ with MEMORY.md as an index file and optional topic files (e.g., debugging.md, api-conventions.md)
Only the first 200 lines of MEMORY.md are loaded into the system prompt at session start
Topic files are loaded on demand via standard file tools
Each git repo gets one memory directory; git worktrees get separate directories
Control via CLAUDE_CODE_DISABLE_AUTO_MEMORY=0 (force on) or =1 (force off)

CLAUDE.md imports: Files can reference other files with @path/to/import syntax, resolved relative to the containing file. Recursive imports up to depth 5. First-encounter approval dialog per project.

Project rules (.claude/rules/*.md): Modular, topic-specific instructions. Optional YAML frontmatter with paths: field for glob-based conditional activation. Subdirectories supported, recursively discovered. Symlinks allowed.

Checkpointing: Session-Level Undo

Claude Code tracks file edits as checkpoints — one per user prompt. Checkpoints are not conversation history management per se, but they interact with it via the rewind menu.

Every user prompt creates a checkpoint
Persists across sessions, auto-cleaned after 30 days (configurable)
Only tracks changes made by Claude’s file editing tools — NOT bash commands, NOT external changes
Complements (does not replace) version control

Rewind menu (via Esc + Esc or /rewind):

Action	Code	Conversation
Restore code and conversation	Revert to checkpoint	Rewind to that message
Restore conversation	Keep current	Rewind to that message
Restore code	Revert to checkpoint	Keep current
Summarize from here	No change	Compress from selected point forward

“Summarize from here” is a targeted compaction — it keeps early context in full detail and only compresses from the selected point forward. The original messages are preserved in the session transcript for reference. Accepts optional instructions to guide the summary focus.

Server-Side Compaction

Claude Code uses the Anthropic API’s server-side compaction (beta compact-2026-01-12) rather than implementing client-side summarization like Aider or OpenCode:

API detects input tokens exceed trigger threshold (default 150K, minimum 50K)
Generates summary of conversation
Creates a compaction content block containing the summary
Continues the response with compacted context
On subsequent requests, API auto-drops all message blocks prior to the last compaction block

Configuration:

trigger: when to compact (input token threshold)
pause_after_compaction: return immediately after summary so client can preserve recent messages before continuing
instructions: custom summarization prompt (completely replaces default)

The pause_after_compaction option enables a pattern where the client preserves the last N messages verbatim alongside the compaction summary, avoiding information loss for recent context.

Context Editing: Server-Side Pruning

In addition to compaction, Claude Code uses the API’s context editing features (beta context-management-2025-06-27) for lighter-weight context management:

Tool result clearing (clear_tool_uses_20250919):

Trigger: configurable input token threshold (default 100K)
Keeps: configurable number of recent tool use/result pairs (default 3)
Cleared results replaced with placeholder text
exclude_tools: never-clear list for important tool types
clear_at_least: minimum tokens to clear (amortizes cache invalidation cost)

Thinking block clearing (clear_thinking_20251015):

Manages extended thinking blocks to save context space
Default: keep only last assistant turn’s thinking
Can keep N turns or "all" (maximizes prompt cache hits)

Both strategies are server-side — the client maintains full unmodified history while the API applies edits before the prompt reaches Claude.

Context Awareness

Claude models (Sonnet 4.6, Sonnet 4.5, Haiku 4.5) receive their token budget at session start:

<budget:token_budget>200000</budget:token_budget>

After each tool call, remaining capacity is updated:

<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>

This is a model-training feature — the model is trained to use this information for planning tasks within its available context.

Status Line: Real-Time Context Monitoring

Claude Code exposes context state to users via a customizable status line — a shell script that receives JSON session data on stdin after each assistant message (debounced at 300ms). Key context-related fields:

context_window.used_percentage / remaining_percentage — calculated from input tokens
context_window.context_window_size — 200K default, 1M for extended context
context_window.current_usage — token counts from last API call (input, output, cache creation, cache read)
cost.total_cost_usd — accumulated session cost
exceeds_200k_tokens — boolean threshold indicator

Comparison Table

Aspect	Aider	Codex	OpenCode	Claude Code
Compaction location	Client (background thread)	Client (post-response)	Client (synchronous)	Server-side API
Compaction trigger	History token budget exceeded	`auto_compact_token_limit`	`isOverflow()` post-response	API input token threshold (default 150K)
Pruning	None (summarize everything)	Truncate oldest items	Two-phase (prune tools, then summarize)	Server-side tool result clearing (configurable)
Token counting	Exact (litellm)	Byte heuristic (4 bytes/token)	Char heuristic (4 chars/token) + API-reported	Free API endpoint pre-send
Persistent memory	None	None	None	6-level CLAUDE.md hierarchy + auto memory
Session undo	None	None	None	Checkpoint per prompt with rewind menu
Context awareness	None	None	None	Model-level budget injection
Thinking management	N/A	Encrypted reasoning	Reasoning parts	Server-side thinking block clearing

Pitfalls & Hard Lessons

Summary Drift

When an LLM summarizes conversation history, it can subtly distort the intent. “The user asked to fix the login handler” might become “The user wants to update the authentication system” — close but not identical. Over multiple rounds of summarization (Aider’s recursive approach), these distortions compound. The model ends up working from a summary-of-a-summary that no longer accurately reflects what the user originally asked for.

Pruning vs Summarization

OpenCode’s two-phase approach (prune tool outputs first, then summarize if needed) is empirically better than Aider’s summarize-everything approach. Most context bloat comes from large tool outputs (file reads, search results, command output) that are only relevant for the turn they were generated. Pruning these before summarization reduces the summary’s job to compressing actual conversation, not reams of file contents.

Markdown Persistence Fragility

Aider’s Markdown history format is human-readable but machine-fragile. The #### separator breaks when assistant responses contain Markdown headers. The timestamp-based session boundaries don’t support querying by session ID. There’s no transaction safety — a crash mid-write can corrupt the file. SQLite (OpenCode) and JSONL with atomic writes (Codex) are more robust, at the cost of human readability.

Background vs Synchronous Summarization

Aider runs summarization in a background thread, which avoids blocking the user but introduces a race condition: if the user sends a message before summarization completes, the main thread blocks on thread.join(). OpenCode runs compaction synchronously between turns, which is simpler and avoids races but makes the user wait. Codex avoids the problem entirely by not summarizing.

Token Counting Accuracy

Aider uses exact token counts via the model’s tokenizer (litellm). Codex uses a 4-byte heuristic for new items and exact counts from the API for prior items. OpenCode uses exact API-reported tokens for completed turns and heuristics for in-progress content. The heuristic approaches are faster but can be off by 10-20%, which means compaction triggers earlier or later than optimal. For models with large context windows (200k+), this margin is acceptable. For smaller windows (8k-32k), exact counting matters more.

Session Resume Semantics

Aider’s “restore chat history” replays the entire Markdown file into done_messages, then summarizes if too large. This means resuming a long session triggers an immediate summarization that can take 10+ seconds. Codex’s JSONL lookup supports random access by offset but doesn’t reconstruct the conversation context — it’s a history browser, not a session resume mechanism. OpenCode’s compaction markers allow efficient resume: load messages after the last compaction boundary, and the summary message provides the compressed context for everything before.

Cascade Deletes and Data Loss

OpenCode’s SQLite schema uses cascade deletes — deleting a session deletes all its messages and parts. This is clean but irreversible. There’s no soft-delete, no trash, no recovery mechanism. Codex’s JSONL file can be manually edited to recover deleted entries. Aider’s Markdown file is append-only and never deletes anything.

Server-Side Compaction Is Not Free

Claude Code’s server-side compaction requires an additional sampling step — the summary generation is billed as a separate API call within the same request. The usage.iterations array separates compaction costs from message costs, but developers must sum across all iterations for accurate billing. Top-level input_tokens/output_tokens exclude compaction iterations, which can be misleading.

Claude Code’s checkpointing only tracks file editing tools. If Claude runs rm file.txt or mv old.txt new.txt via bash, those changes are invisible to the checkpoint system and cannot be undone via rewind. This creates a false sense of safety when Claude uses shell commands for file operations.

Memory Loading Budget

Auto memory’s 200-line limit on MEMORY.md is a hard cutoff. If the index file grows beyond that, content is silently dropped from the system prompt. Claude is instructed to keep it concise by moving details to topic files, but there’s no enforcement mechanism — a verbose auto-save could push important content past the limit.

Compaction Pause Complexity

The pause_after_compaction pattern (return after summary, client preserves recent messages, re-send) adds a second API round-trip. If the client doesn’t handle the stop_reason: "compaction" case, the conversation breaks. This is an easy integration bug in headless/CI deployments.

OpenOxide Blueprint

Architecture

OpenOxide should use a three-layer system:

In-memory: A Vec<Message> ordered oldest-to-newest, with efficient append and backward scan
On-disk: SQLite for structured persistence with message/part separation
Compaction: Two-phase (prune tool outputs, then LLM summarization) with marker-based filtering

Storage Schema

// SQLite tables
struct Session {
    id: Ulid,
    project_id: String,
    parent_id: Option<Ulid>,       // for forked sessions
    title: String,
    directory: PathBuf,
    created_at: i64,
    updated_at: i64,
    compacted_at: Option<i64>,     // last compaction timestamp
}

struct Message {
    id: Ulid,
    session_id: Ulid,
    role: Role,                     // User | Assistant
    created_at: i64,
    completed_at: Option<i64>,
}

struct Part {
    id: Ulid,
    message_id: Ulid,
    session_id: Ulid,              // denormalized for efficient session queries
    kind: PartKind,                // Text | Reasoning | ToolCall | ToolResult | Summary | Snapshot
    data: serde_json::Value,       // JSON blob
    compacted_at: Option<i64>,     // null = live, Some = pruned
    created_at: i64,
}

Crates

Crate	Purpose
`rusqlite`	SQLite bindings (WAL mode for concurrent reads)
`ulid`	Time-ordered unique IDs
`serde_json`	Part data serialization
`tiktoken-rs`	Exact token counting for OpenAI models

Compaction Algorithm

Follow OpenCode’s two-phase approach:

pub async fn compact(session_id: Ulid, model: &Model) -> Result<()> {
    let messages = load_after_last_compaction(session_id).await?;

    // Phase 1: Prune old tool outputs
    let pruned_tokens = prune_tool_outputs(&messages, PruneConfig {
        protect_threshold: 40_000,  // start pruning after this many tokens
        minimum_savings: 20_000,    // don't bother if savings < this
        keep_recent_turns: 2,       // always keep last 2 user turns
    }).await?;

    // Phase 2: Summarize if still over budget
    let budget = model.context_window - COMPACTION_BUFFER;
    if estimate_tokens(&messages) > budget {
        let summary = generate_summary(&messages, model).await?;
        insert_compaction_marker(session_id, summary).await?;
    }
}

Message Loading

Use a streaming loader that stops at compaction boundaries:

pub async fn load_context(session_id: Ulid) -> Vec<Message> {
    // Query messages descending by created_at
    // Stop when a completed compaction marker is found
    // Reverse to oldest-first order
    // Replace compacted tool outputs with placeholder text
}

Token Counting Strategy

Use exact counting for the primary model’s tokenizer, with a 4-byte fallback for unknown models:

pub fn count_tokens(text: &str, model: &str) -> usize {
    match tiktoken_rs::get_bpe_from_model(model) {
        Ok(bpe) => bpe.encode_with_special_tokens(text).len(),
        Err(_) => text.len() / 4,  // fallback heuristic
    }
}

Background Compaction

Run compaction in a background tokio::task with a channel-based result delivery, similar to Aider’s threading but with proper cancellation:

let (tx, rx) = oneshot::channel();
tokio::spawn(async move {
    let result = compact(session_id, &model).await;
    let _ = tx.send(result);
});

// Before next LLM call:
if let Ok(result) = rx.try_recv() {
    // Apply compaction result
} else {
    // Compaction still running — block or proceed with current context
}

This gives the user the time between turns for compaction to complete (like Aider) while supporting cancellation via tokio’s cooperative cancellation (unlike Aider’s uninterruptible thread).

Hierarchical Memory System (from Claude Code)

Adopt Claude Code’s multi-level memory hierarchy:

pub struct MemoryConfig {
    /// System-wide managed policy (e.g., /etc/openoxide/MEMORY.md)
    pub managed_policy: Option<PathBuf>,
    /// Project memory (./OPENOXIDE.md or ./.openoxide/MEMORY.md)
    pub project_memory: Option<PathBuf>,
    /// Project rules (./.openoxide/rules/*.md, with optional path globs)
    pub project_rules: Vec<RuleFile>,
    /// User memory (~/.openoxide/MEMORY.md)
    pub user_memory: Option<PathBuf>,
    /// Project local (./OPENOXIDE.local.md, gitignored)
    pub project_local: Option<PathBuf>,
    /// Auto memory (~/.openoxide/projects/<project>/memory/)
    pub auto_memory: AutoMemoryConfig,
}

pub struct RuleFile {
    pub path: PathBuf,
    pub path_globs: Option<Vec<String>>,  // YAML frontmatter paths
    pub content: String,
}

pub struct AutoMemoryConfig {
    pub dir: PathBuf,
    pub index_file: PathBuf,         // MEMORY.md
    pub index_line_limit: usize,     // 200
    pub topic_files: Vec<PathBuf>,   // loaded on demand
}

Loading strategy:

At session start: load managed policy + project + user + local + auto memory index (first N lines)
On file access in child directory: load any MEMORY.md found there
On path match: activate path-scoped rules

Checkpoint System (from Claude Code)

pub struct Checkpoint {
    pub id: Ulid,
    pub session_id: Ulid,
    pub message_id: Ulid,           // the user prompt that triggered this checkpoint
    pub file_states: Vec<FileState>, // snapshot of files before Claude's edits
    pub created_at: i64,
    pub ttl_days: u32,              // default 30
}

pub struct FileState {
    pub path: PathBuf,
    pub content_hash: [u8; 32],     // SHA-256 of file content
    pub content: Option<Vec<u8>>,   // stored if file was modified
}

pub enum RewindAction {
    RestoreCodeAndConversation,
    RestoreConversation,
    RestoreCode,
    SummarizeFromHere { instructions: Option<String> },
}

Only track files modified by the agent’s file editing tools, not shell commands. Store file content before each edit for restoration.

Server-Side Compaction Support (from Claude Code)

When the provider API supports server-side compaction (like Anthropic’s compact_20260112), prefer it over client-side compaction:

pub enum CompactionStrategy {
    /// Use provider's server-side compaction API
    ServerSide {
        trigger_tokens: usize,        // default 150_000
        pause_after: bool,            // pause to preserve recent messages
        instructions: Option<String>, // custom summarization prompt
    },
    /// Client-side two-phase (prune + summarize)
    ClientSide {
        prune_config: PruneConfig,
        summary_model: Option<String>,
    },
    /// No compaction
    Disabled,
}

Server-side is preferred because:

No extra client logic for summarization
Summary quality benefits from provider-side optimization
Integrated with prompt caching (compaction blocks can be cache-controlled)
Usage tracking via iterations array in response