Streaming
Streaming is the pipeline that delivers tokens from the API to the user’s screen in real-time. It involves SSE parsing, token accumulation, markdown rendering, tool call detection, and display throttling. Each reference tool takes a different approach: Aider uses a Python generator with Rich markdown rendering, Codex uses a Rust async stream with newline-gated buffering, and OpenCode uses the Vercel AI SDK’s event-driven model with database-backed persistence.
For reasoning-token stream semantics, see Reasoning & Thinking Tokens. For overflow and truncation pressure caused by long streams, see Token Budgeting and Chat History. For the render-layer details beneath this pipeline, see Markdown Rendering.
Aider Implementation
Section titled “Aider Implementation”Pinned commit: b9050e1d
Aider’s streaming is a synchronous Python generator pipeline: the litellm completion returns an iterable, and Aider iterates over it, accumulating content and rendering markdown to the terminal.
The Streaming Path
Section titled “The Streaming Path”Entry: send() at aider/coders/base_coder.py:1783
def send(self, messages, model=None, functions=None): self.partial_response_content = "" self.partial_response_function_call = dict()
completion = model.send_completion(messages, functions, self.stream, self.temperature)
if self.stream: yield from self.show_send_output_stream(completion) else: self.show_send_output(completion)model.send_completion() returns a streaming completion object when self.stream is True. This is a generator that yields OpenAI-compatible chunks.
Token Accumulation: show_send_output_stream()
Section titled “Token Accumulation: show_send_output_stream()”File: aider/coders/base_coder.py:1900
Iterates over the completion generator:
for chunk in completion: # Extract text delta text = chunk.choices[0].delta.content if text: self.partial_response_content += text
# Extract function call delta (for function-calling coders) func = chunk.choices[0].delta.function_call if func: for k, v in func.items(): if k in self.partial_response_function_call: self.partial_response_function_call[k] += v else: self.partial_response_function_call[k] = v
# Extract reasoning content (thinking tags) # ...
# Render incrementally self.live_incremental_response(False)
# Check for finish_reason == "length" (context exhausted) if chunk.choices[0].finish_reason == "length": raise FinishReasonLength()Markdown Rendering: MarkdownStream
Section titled “Markdown Rendering: MarkdownStream”File: aider/mdstream.py:92
The MarkdownStream class manages real-time markdown display using Rich’s Live display:
Key state:
self.printed— lines already emitted to console scrollback (stable)self.live— RichLiveinstance for animated windowself.live_window = 6— number of lines kept in the animated window
Update logic (update() at line 149):
- On first call, start
Rich.Livewith1.0 / min_delayrefresh rate (20 FPS default) - Throttle updates to maintain smooth rendering (line 174-175)
- Measure render time and adjust
min_delaydynamically (line 179-184):min_delay = max(0.05, min(2.0, render_time * 10)) - Render the full accumulated markdown buffer through Rich’s markdown parser
- Split output into “stable” (older lines) and “unstable” (last 6 lines)
- Emit stable lines to
console.print()— these go to terminal scrollback and cannot change - Update the
Livewindow with unstable lines — these are re-rendered each frame
The 6-line live window is the key insight: markdown rendering can change earlier output as new tokens arrive (e.g., a heading becomes a code block fence), so only the last few lines are kept “live” for re-rendering. Lines above the window are committed permanently.
Non-Streaming Path
Section titled “Non-Streaming Path”show_send_output() (line 1836) handles non-streaming responses by extracting the full content, function calls, and reasoning content from the completed response object in one pass.
Streaming State
Section titled “Streaming State”self.partial_response_content = "" # accumulated textself.partial_response_function_call = {} # accumulated function callself.got_reasoning_content = False # seen <think> tagself.ended_reasoning_content = False # seen </think> tagCodex Implementation
Section titled “Codex Implementation”Pinned commit: 4ab44e2c5
Codex streams from the OpenAI Responses API via SSE, processes events in a tokio async loop, and renders to the TUI using a newline-gated buffering system.
SSE Stream Processing
Section titled “SSE Stream Processing”File: codex-rs/core/src/codex.rs:5505
try_run_sampling_request() calls client_session.stream() to get an async SSE stream, then processes events in a loop:
while let Some(event) = stream.next().await { match event { ResponseEvent::Created => { /* no-op */ } ResponseEvent::OutputItemAdded(item) => { // Emit TurnItemStarted } ResponseEvent::OutputTextDelta(delta) => { // Emit AgentMessageContentDelta } ResponseEvent::OutputItemDone(item) => { // Tool call extraction and dispatch if let Some(tool_call) = extract_tool_call(&item) { let future = tool_runtime.handle_tool_call(tool_call, cancel.clone()); in_flight.push_back(future); } } ResponseEvent::Completed { usage, .. } => { // Update token tracking, break } ResponseEvent::ReasoningSummaryDelta(delta) => { // Emit reasoning event } }}Tool calls discovered in OutputItemDone events are dispatched immediately via ToolCallRuntime, with futures collected in a FuturesOrdered. After the stream completes, drain_in_flight() waits for all tool results.
Newline-Gated Buffering
Section titled “Newline-Gated Buffering”File: codex-rs/tui/src/markdown_stream.rs:7
pub(crate) struct MarkdownStreamCollector { buffer: String, committed_line_count: usize, width: Option<usize>,}The core innovation: tokens are accumulated in a buffer, but lines are only “committed” (rendered and emitted to the TUI) when a newline character is encountered.
push_delta(delta: &str) — appends to the buffer string.
commit_complete_lines() (line 35):
- Find the last newline position in
buffer - If no newline exists, return empty (nothing to commit)
- Render the entire buffer up to the newline as markdown
- Count logical rendered lines
- Return only newly completed lines (lines after
self.committed_line_count) - Increment
committed_line_count
This prevents partial markdown from reaching the display. A header like ## He will not render until the full line ## Hello World\n is received.
Stream Controller
Section titled “Stream Controller”File: codex-rs/tui/src/streaming/controller.rs:14
pub(crate) struct StreamController { state: StreamState, finishing_after_drain: bool, header_emitted: bool,}The controller manages the flow from committed lines to TUI cells:
push(delta) (line 30):
- Appends delta to the collector
- If the delta contains
\n: commit complete lines, enqueue them with timestamps - Returns
trueif new lines are available for display
on_commit_tick() (line 71):
- Pops one line from the front of the queue
- Converts it to an
AgentMessageCell(ratatuiLine) - Returns
(Option<HistoryCell>, bool idle) - Called once per frame — rate-limits output to the terminal refresh rate
finalize() (line 47):
- Flushes remaining content (adds temporary
\nif needed) - Drains the entire queue
- Emits final cell
Stream State
Section titled “Stream State”File: codex-rs/tui/src/streaming/mod.rs:29
pub(crate) struct StreamState { pub(crate) collector: MarkdownStreamCollector, queued_lines: VecDeque<QueuedLine>, pub(crate) has_seen_delta: bool,}QueuedLine includes an enqueued_at: Instant timestamp, enabling age-based drain policies for adaptive rendering.
Data Flow Summary
Section titled “Data Flow Summary”API SSE delta "Hello" -> push("Hello") -> no newline -> no displayAPI SSE delta " world" -> push(" world") -> no newline -> no displayAPI SSE delta "!\n" -> push("!\n") -> newline found -> commit_complete_lines() -> renders "Hello world!" as markdown -> enqueueNext render tick -> on_commit_tick() -> dequeue one line -> displayOpenCode Implementation
Section titled “OpenCode Implementation”Pinned commit: 7ed449974
OpenCode uses the Vercel AI SDK’s streamText() which provides a standardized event-driven stream across all providers. Events are processed in a SessionProcessor and persisted to SQLite in real-time.
LLM Stream Entry
Section titled “LLM Stream Entry”File: packages/opencode/src/session/llm.ts:176
const result = streamText({ model: languageModel, system: systemPrompts, messages: convertedMessages, tools: resolvedTools, temperature, topP, topK, maxOutputTokens, abortSignal, // ...})The AI SDK handles SSE parsing, provider-specific format translation, and tool call accumulation internally. It exposes a result.fullStream async iterable that yields typed events.
Event Processing
Section titled “Event Processing”File: packages/opencode/src/session/processor.ts:55
The processor iterates stream.fullStream:
| Event | Processing |
|---|---|
start | Set SessionStatus to “busy” |
text-start | Create TextPart, persist via Session.updatePart() |
text-delta | Accumulate: currentText.text += value.text; emit Session.updatePartDelta() |
text-end | Finalize text part, fire plugin hook "experimental.text.complete" |
reasoning-start | Create ReasoningPart with timestamps |
reasoning-delta | Accumulate reasoning text, emit delta |
reasoning-end | Finalize reasoning part |
tool-input-start | Create ToolPart with status “pending” |
tool-call | Transition to “running”, check doom loop guard |
tool-result | Transition to “completed” with output, timing, metadata |
tool-error | Transition to “error”, check for permission rejection |
start-step | Reset step state |
finish-step | Record finish reason, update token usage, check overflow |
error | Throw through |
finish | No-op (handled by finish-step) |
Delta Broadcasting
Section titled “Delta Broadcasting”File: packages/opencode/src/session/message-v2.ts:465
Text deltas are broadcast via the event bus:
PartDelta: BusEvent.define("message.part.delta", z.object({ sessionID: z.string(), messageID: z.string(), partID: z.string(), field: z.string(), delta: z.string(),}))The TUI subscribes to "message.part.delta" events and renders them incrementally. The HTTP/SSE server also subscribes and forwards to web clients.
Token Usage Tracking
Section titled “Token Usage Tracking”At finish-step (line 244), the processor:
- Records
finishReasonon the assistant message - Updates token usage via
Session.getUsage()(computes cumulative from provider response) - Captures a snapshot patch if files were modified
- Checks
SessionCompaction.isOverflow()and setsneedsCompaction = trueif exceeded
Retry Integration
Section titled “Retry Integration”File: packages/opencode/src/session/retry.ts
The processor wraps stream iteration in a try/catch. On retryable errors:
- Check
SessionRetry.retryable(error)— respectsisRetryableflag and rate limit headers - Compute delay:
SessionRetry.delay(attempt, error)— exponential backoff (2s base, 2x factor, 30s cap) orretry-afterheader - Set
SessionStatusto “retry” with attempt count and next retry timestamp - Sleep and retry
Non-retryable errors (context overflow, auth errors) break the loop immediately.
Persistence Model
Section titled “Persistence Model”Most stream state transitions are persisted with Session.updatePart() (an upsert on PartTable) and finalized with Session.updateMessage():
Session.updatePart()— creates or updates text/reasoning/tool/step/patch partsSession.updatePartDelta()— emitsmessage.part.deltaon the bus only (no DB write)Session.updateMessage()— writes assistant message state (finish/error/tokens/cost)
This gives durable turn state while still keeping high-frequency token deltas off the database write path.
Pitfalls & Hard Lessons
Section titled “Pitfalls & Hard Lessons”Partial Markdown Rendering
Section titled “Partial Markdown Rendering”Aider’s 6-line live window and Codex’s newline-gating both solve the same problem: partial markdown tokens can produce invalid rendering. A half-received ## Hel renders as a heading in markdown, but becomes body text when completed as ## Hello World\n\nSome text. Aider solves this by keeping recent lines in a re-renderable Live window. Codex solves it by never committing incomplete lines. OpenCode delegates to the TUI’s markdown renderer, which handles partial content gracefully.
Throttling vs Latency
Section titled “Throttling vs Latency”Rendering every token at 60 FPS is wasteful and can cause UI flickering. Aider dynamically adjusts its throttle based on measured render time (line 179-184 of mdstream.py). Codex rate-limits output to one line per frame tick via on_commit_tick(). OpenCode relies on the event bus and React/Solid reconciliation to batch updates.
Streaming Tool Calls
Section titled “Streaming Tool Calls”Tool call arguments arrive as fragments across multiple tool-input-delta events. All three tools accumulate these fragments before dispatching:
- Aider:
self.partial_response_function_call[k] += v - Codex: Accumulated in the SSE parser before
OutputItemDoneis emitted - OpenCode: AI SDK handles accumulation internally, emits complete
tool-callevent
Stream Interruption Recovery
Section titled “Stream Interruption Recovery”If the SSE connection drops mid-stream:
- Aider: Python exception propagates to retry loop, exponential backoff, re-sends entire prompt
- Codex:
ResponseStreamFailederror classified as retryable, can fall back from WebSocket to HTTPS transport - OpenCode:
APIErrorwithisRetryable: true, respectsretry-afterheaders
All three restart from scratch — none support resuming a partial stream.
Token Count Consistency
Section titled “Token Count Consistency”Token usage is only available after the stream completes (the Completed / finish-step event includes it). During streaming, tools must rely on heuristic estimates (Codex: 4 bytes/token, OpenCode: 4 chars/token, Aider: litellm estimate). This means overflow detection is always slightly delayed.
Database Write Pressure
Section titled “Database Write Pressure”OpenCode writes every tool state transition to SQLite during streaming. Under heavy tool use (many grep/read calls), this creates write pressure. WAL mode helps, but the per-event write pattern is inherently heavier than Aider’s in-memory accumulation or Codex’s deferred JSONL persistence.
OpenOxide Blueprint
Section titled “OpenOxide Blueprint”Architecture: Async Stream with Event Emission
Section titled “Architecture: Async Stream with Event Emission”OpenOxide combines Codex’s newline-gated buffering with OpenCode’s event-driven broadcasting:
pub struct StreamProcessor { collector: MarkdownStreamCollector, // from Codex tx_event: Sender<StreamEvent>, // from OpenCode token_accumulator: String, tool_calls: HashMap<String, ToolCallState>,}SSE Parsing
Section titled “SSE Parsing”Use reqwest-eventsource or a custom SSE parser to consume the API stream. Events are parsed into a ResponseEvent enum following Codex’s pattern. The parser handles reconnection for transient network errors.
Newline-Gated Rendering
Section titled “Newline-Gated Rendering”Adopt Codex’s MarkdownStreamCollector pattern:
- Accumulate text deltas in a buffer
- On newline: render accumulated markdown, emit completed lines
- On finalize: flush remaining content
This prevents partial markdown from reaching the display.
Event Broadcasting
Section titled “Event Broadcasting”Stream events are emitted to an async broadcast channel:
pub enum StreamEvent { TextDelta { part_id: String, delta: String }, TextComplete { part_id: String, text: String }, ReasoningDelta { part_id: String, delta: String }, ToolCallStart { call_id: String, tool_name: String }, ToolCallResult { call_id: String, output: ToolOutput }, ToolCallError { call_id: String, error: String }, TokenUsage { input: u32, output: u32, cached: u32 }, Finished { reason: FinishReason },}The TUI subscribes and renders. The MCP server subscribes and forwards to clients. The session store subscribes and persists. Each consumer operates independently.
Throttled Display
Section titled “Throttled Display”The TUI uses a frame-rate limiter (60 FPS cap) and drains one committed line per tick from the queue, following Codex’s on_commit_tick() pattern. This prevents render thrashing while maintaining perceived responsiveness.
Tool Call Accumulation
Section titled “Tool Call Accumulation”Tool call arguments are accumulated internally by the SSE parser. Only complete tool calls (after the OutputItemDone equivalent) are dispatched to the tool registry.
Crates
Section titled “Crates”| Crate | Responsibility |
|---|---|
openoxide-stream | SSE parser, ResponseEvent enum, StreamProcessor |
openoxide-markdown | MarkdownStreamCollector, newline-gated buffering |
openoxide-tui | StreamController, frame-rate limiter, widget rendering |
openoxide-provider | API client, transport (HTTPS/WebSocket), retry logic |
Key Design Decisions
Section titled “Key Design Decisions”- Newline-gated buffering. Prevents partial markdown rendering artifacts.
- Event broadcast channel. Decouples streaming from display — TUI, MCP, and persistence are independent subscribers.
- Frame-rate limiter. One line per tick prevents render thrashing.
- No mid-stream persistence. Unlike OpenCode, persist only at turn boundaries (completed messages and tool results). Reduces write pressure, accepts that a crash mid-stream loses the partial turn.
- Transport fallback. WebSocket preferred for lower latency, HTTPS SSE as fallback.