EVOKE

oss

OS-like memory management for the LLM KV cache.

status
in-progress
started
2025
role
creator
stack
PythonRustC++llama.cpp forkCUDAQwen

Long-running LLM agent sessions outgrow the physical KV cache budget within a few turns. EVOKE evicts low-relevance blocks under budget pressure and recovers them recompute-free via a custom save/restore primitive in a forked llama.cpp: 20–32× faster than re-prefilling the same tokens.

Demos

Eviction demo on Qwen 2.5

A 14-turn session with a 1024-token budget. A fact is planted at turn 1 (favorite number = 4242), 12 unrelated knowledge questions fill the session, and at turn 14 the fact is probed. The session survives 40 evictions and 13 recoveries, and the model recalls “4242”.

Eviction demo on Qwen 3.5

Same demo on a hybrid Mamba/Attention architecture. The model emits a <think>...</think> trace each turn. With EVOKE_SUPPRESS_THINKING_STRIP=1 the server keeps the thinking trace in the returned content so the cached state stays aligned with what the client echoes back, and no session resets fire. 26 evictions, 4 recoveries, fact recalled.

What it actually is

  • Two new C++ primitives in a forked llama.cpp: llama_kv_block_save and llama_kv_block_load. They serialise a position range’s K/V tensors to a host buffer and splice them back with per-cell RoPE re-anchoring, with no llama_decode call.
  • A third C primitive llama_attn_capture_* that taps per-head softmax attention weights from one or more chosen layers (up to 16) into a host buffer once per decode. Used by the relevance scorer to learn what the model is actually attending to.
  • A Python policy layer (evoke/manager.py, evoke/scorer.py, evoke/attention_scorer.py) that drives eviction under a watermark policy via a multi-signal scorer (model attention + harness priority + task-focus coherence + recency) and routes recovery through three pluggable backends: discard, breadcrumb, or kv_restore (the recompute-free splice).
  • An OpenAI-compatible chat-completions server that exposes EVOKE as a stateful endpoint. The persistent KV cache survives across requests; only the new tail of each prompt is decoded.
  • Cross-architecture coverage: pure attention with standard RoPE (Qwen 2.5, Llama 3), hybrid Mamba/Attention (Qwen 3.5), MoE attention with mrope and thinking mode (Qwen 3.6 35B-A3B).

How does the system know what’s relevant?

Four signals, combined into a per-block score in [0, 1]. Lowest scores get evicted first when the cache exceeds budget.

  • The model’s own attention. A second softmax for one or more chosen transformer layers runs alongside the main attention path, writing per-head softmax weights to a host buffer once per decode. The scorer maintains a sliding window of recent attention mass per block (last 64 decode steps, EWMA decay 0.95). Blocks the model is actually attending to score high. This is the strongest single signal — the truest answer to “what’s relevant right now.”
  • Harness-supplied priority tags. A coding harness like opencode or Claude Code can set evoke_priority (a float multiplier) and evoke_pinned (boolean, excluded from eviction entirely) on each chat request. Useful when the harness knows things the model can’t see: a file read is the central artifact of the current task; a tool scratch output is one-shot. Defaults to 1.0 / false.
  • Task-focus coherence. The scorer tracks a single task-focus embedding that updates via EMA on new user messages but snaps to the new message when a topic shift is detected (cosine drop below 0.3) or signaled by the harness via evoke_task_boundary=true. Blocks from a prior task lose their coherence score in one pass instead of decaying over five turns.
  • Recency, sink protection, source-type floors. Stability priors: prevent thrashing on a single attention spike; protect StreamingLLM-style sink tokens; give USER and ASSISTANT turns a floor so the conversation backbone isn’t evicted before document content.

Final score: min(priority * (w_attn·attn + w_rec·recency + w_coh·coherence) / Σw, 1.0) lifted by a source-type floor (USER blocks 0.6, ASSISTANT blocks 0.5 by default) and with pinned-block protection.

Latency

Measured on Qwen 2.5 7B, RTX 4070 Ti SUPER, Flash Attention enabled. kv_block_load is the EVOKE recovery path; re-prefill is the cost of re-encoding the same tokens via llama_decode.

Block (tokens)save (ms)load (ms)re-prefill (ms)speedup
201.100.4811.9025×
401.610.7013.7820×
1604.691.5032.6022×
64016.374.34118.3627×
128031.907.25232.1832×

The gap widens linearly with block size: re-prefill is O(tokens × model_FLOPs), load is O(tokens × bytes).

Verified end-to-end

The mechanism has been run, measured, and stress-tested against a real coding agent.

  • Live opencode session against Qwen 3.5 9B (hybrid Mamba/Attention + thinking, budget = 2048). 250 cumulative evictions, 4 smart-recoveries, active_tokens held near 1414 (within budget) while cached_tokens grew to 32 902. The agent’s conversation was 23× larger than what was actually held in GPU at any moment.
  • A real bug was caught and fixed during this live integration. _evictable_blocks was over-pinning prompt-decoded DOCUMENT blocks under pin_generated, which silently zeroed evictions on the server path used by external harnesses. Root-cause traced via two reproducer scripts, fix in one targeted condition, 106 unit tests still pass.
  • All paper numbers are reproducible from the scripts in scripts/. Raw output for the agentic eval, attention-scorer ablation, and keepalive workload are checked into the repo.
  • Three model families verified at the primitive level (Qwen 2.5 7B, Qwen 3.5 9B, Qwen 3.6 35B-A3B). Full server-side evaluation is on Qwen 2.5 7B; cross-architecture latency and scorer ablations are explicit limitations in the paper.

Status

Research prototype, targeting both a working system and a paper draft. The mechanism is verified end-to-end across three model families. Recently closed: tools-aware Jinja chat template (so tool-using turns no longer trigger session resets), multi-session pool with state swap on a custom X-EVOKE-Session header, iSWA dual-cache support in the fork primitives, multi-layer attention capture (up to 16 layers per decode).

License: Apache-2.0 on the policy layer; MIT-licensed on the forked llama.cpp work.