Memory frameworks survey
Memory & RAG Architectures Survey — For a Claude Code Transcript Indexer
Section titled “Memory & RAG Architectures Survey — For a Claude Code Transcript Indexer”Author: research-agent
Date: 2026-05-11
Scope: Survey production memory/RAG systems to inform the design of a transcript
indexer that supports semantic search, fact extraction, drift detection, and
multi-granularity summarization over ~1 month → 12+ months of Claude Code .jsonl
transcripts across 5 machines and ~18 agent variants.
Length cap: ~900 lines.
0. TL;DR — Recommendation Up Front
Section titled “0. TL;DR — Recommendation Up Front”For Claude Code transcripts, the right shape is a three-layer hybrid, not any single framework verbatim:
- Raw layer — original
.jsonltranscripts on disk, addressable bysession_id + message_index. Source of truth. Never mutated. - Structured layer — SQLite (or D1) with three table families:
episodes(session-level summaries),entities/facts(Graphiti-style triplets with bi-temporal validity windows), andchunks(paragraph- and turn-level units with embeddings + BM25). Append-only; contradictions are invalidated, never deleted. - Insight layer — periodic “dream”/“memify”-style consolidation jobs that prune, reweight, derive meta-facts, and compute drift signals.
Retrieval is hybrid by default: parallel BM25 + dense vector + 1-hop graph traversal, fused with RRF, then optionally reranked. This is the consensus pattern across Zep/Graphiti, Mem0, and Cognee.
Steal patterns from:
- Graphiti/Zep — bi-temporal model, episode→entity→fact pipeline, RRF retrieval. Best fit for drift detection.
- Mem0 — single-pass ADD-only extraction (cheap), four-scope memory (user/agent/run/app — maps cleanly to machine/variant/session/project).
- Letta —
core_memoryblocks as pinned agent state, sleep-time-compute for async consolidation. - Anthropic Memory tool — file-based
/memoriesdirectory as a human-inspectable working layer. - Cognee — Memify (post-ingest graph refinement) and multiple retrieval modes selected per question type.
Do not adopt any of these as the backend wholesale. Wes already runs Cloudflare D1, fleet-node KV, and a vault. The indexer should live on those substrates, not on Neo4j, a Letta server, or Mem0 Cloud.
1. Comparison Matrix
Section titled “1. Comparison Matrix”| Framework | Storage | Tiers | Extraction | Retrieval | Drift / Update | License | Stars | Fit |
|---|---|---|---|---|---|---|---|---|
| Letta (MemGPT) | Postgres + vector + (optional) graph | Core / Recall / Archival | Agent self-edits via tool calls | Agent picks tier; vector or graph search | Sleep-time compute agent refines core blocks | Apache-2.0 | 22.6k | Pattern-steal; runtime is too heavy |
| Mem0 | Vector + optional graph (Neo4j/Kuzu) | Flat with 4 scopes (user/agent/run/app) | Single-pass ADD-only LLM extraction | Hybrid: semantic + BM25 + entity | Async writes; staleness is open problem | Apache-2.0 | 55.4k | Closest fit conceptually; lift the extractor pattern |
| LangMem | LangGraph BaseStore (pluggable) | Semantic / Episodic / Procedural | Hot path or background; Pydantic schemas | Direct + semantic + metadata filter | Prompt-optimizer rewrites procedural memory | MIT | ~3k | Concept taxonomy is gold; SDK is not load-bearing |
| LlamaIndex Memory | Pluggable (vector, kv, etc.) | Short-term buffer + Memory Blocks | FactExtractionMemoryBlock is opt-in | Block-priority eviction; vector for VectorMemoryBlock | Manual via block priority | MIT | n/a (in framework) | Useful for in-session compaction, not corpus indexing |
| Anthropic Memory tool | Client-controlled files in /memories | Single tier (filesystem) | None — Claude writes free-form files | File view/read; agent navigates | Claude edits files with str_replace | Anthropic API | n/a | Use as working-memory surface, not corpus index |
| Anthropic Memory MCP | memory.jsonl (JSONL) | Single tier | Manual via tools | search_nodes (substring) | Manual delete/add observations | MIT | n/a | Reference impl; format is useful starting schema |
| Sleep-time Compute | n/a (technique) | n/a | Pre-computes anticipated answers offline | n/a | This is the update mechanism | Research (Letta) | n/a | Adopt as the background-job concept |
| OpenAI Memory | OpenAI-side (saved memories + chat history retrieval) | Two tiers (saved / referenced) | Auto-extraction from chat | ”Memory sources” attribution UI | Auto-update with user control | Closed | n/a | Black box; UX patterns only |
| Cognee | Vector + graph (Neo4j/Kuzu/Memgraph) | Episodic graph + derived facts | 6-stage Cognify pipeline | 14 named retrieval modes | Memify: prune, reweight, derive | Apache-2.0 | 17.2k | Strong patterns for post-ingest refinement |
| Zep / Graphiti | Neo4j / FalkorDB / Kuzu / Neptune | Episode → Entity → Fact | Episode-extracted triplets w/ temporal windows | Semantic + BM25 + graph traversal, RRF/MMR | Bi-temporal invalidation (t_valid, t_invalid) | Apache-2.0 (Graphiti); Zep is SaaS | 25.9k | Highest-fit for drift detection |
Headline numbers to remember:
- Zep beats MemGPT 94.8% vs 93.4% on Deep Memory Retrieval; 18.5% absolute on LongMemEval (gpt-4o).
- Mem0 hits 91.6 on LoCoMo, 91% lower p95 latency than full-context, 90% fewer tokens.
- Anthropic reports 84% token reduction in extended Memory-tool workflows.
- Sleep-time compute reduces test-time compute 5x at equal accuracy, +13-18% absolute with more compute.
2. Per-Framework Deep Dives
Section titled “2. Per-Framework Deep Dives”2.1 Letta (formerly MemGPT)
Section titled “2.1 Letta (formerly MemGPT)”The OS-inspired hierarchical-memory original. Three tiers:
- Core memory — in-context blocks (label + value + char limit) pinned to
every model call. Agent edits via
memory_replace/memory_insert/memory_rethink. Like RAM. - Recall memory — searchable raw conversation history outside the context window. Auto-persisted. Like a disk cache.
- Archival memory — explicitly written, indexed knowledge in vector or
graph DB. Agent uses
archival_memory_insert/archival_memory_search.
The agent itself decides what to promote/demote between tiers via tool calls. This is powerful but model-dependent — extraction quality fluctuates with reasoning quality. License Apache-2.0, 22.6k stars, v0.16.7 March 2026, full runtime with Python/TypeScript/Rust SDKs. Adopt the pattern, not the runtime. The “core memory block” idea is exactly what a Claude Code variant’s persistent identity card should look like (peer summary, current mission, recent decisions). The runtime swap-in cost is too high given Wes’s existing fleet primitives.
2.2 Mem0
Section titled “2.2 Mem0”The most popular open-source memory layer (55.4k stars, Apache-2.0). Sits between an agent framework and a backend store — pluggable, not a runtime.
- Extraction: single-pass ADD-only. One LLM call extracts a structured fact and appends it. No write-time UPDATE/DELETE; conflict resolution is deferred to retrieval. This is what saves 60-70% of write-time LLM cost.
- Retrieval: parallel semantic vectors + BM25 + entity linking, then rerank.
- Scopes:
user_id,agent_id,run_id,app_id— four-axis namespacing. Maps almost 1:1 onto Wes’s fleet (machine + variant + session + client project). - Graph: optional via Neo4j or Kuzu (Kuzu added Sept 2025).
Mem0gvariant adds 1.5pp accuracy at 80% latency cost. - Failure modes (admitted in their 2026 state-of-the-art post): staleness detection is an unsolved problem; cross-session identity resolution remains open; LOCOMO benchmarks don’t capture domain-specific performance.
This is the closest single framework to what we want, but it’s still a runtime service with its own server. Lift the extraction pipeline and the four-scope model; don’t run their server.
2.3 LangMem (LangChain)
Section titled “2.3 LangMem (LangChain)”A library, not a service. Best contribution is the taxonomy:
- Semantic — facts. Two structures: collections (unbounded) and profiles (single-doc, strict schema).
- Episodic — past interactions stored as few-shot examples (situation + thoughts + action + result).
- Procedural — rules and instructions, stored in the system prompt and updated by a prompt-optimizer.
It distinguishes hot path writes (during conversation, latency cost) vs
background writes (after, deeper analysis). Pydantic-typed schemas drive
extraction. LangGraph BaseStore is the persistence interface — pluggable to
whatever store you want.
Adopt the taxonomy. Procedural memory is exactly where Wes’s standing rules belong (e.g., the verify-deployed-version rule, the no-pasted-secrets rule). Episodic memory matches the per-incident vault notes. Semantic memory is the entities-and-facts layer.
2.4 LlamaIndex Memory
Section titled “2.4 LlamaIndex Memory”Less ambitious than the others. Two layers:
- Short-term:
Memoryclass with token-budgeted message buffer. - Long-term: “Memory Blocks” with priorities. Three built-in block
types:
StaticMemoryBlock,FactExtractionMemoryBlock,VectorMemoryBlock. Blocks are inserted into the system message or the latest user message.
Useful for in-session context compaction. Not a corpus indexer. The block priority + token-budget eviction is a clean primitive. Skip.
2.5 Anthropic Memory tool (memory_20250818)
Section titled “2.5 Anthropic Memory tool (memory_20250818)”Beta-released Aug 2025, GA-eligible Sept 29 2025. Client-side: Anthropic
defines six operations (view, create, str_replace, insert, delete,
rename) and an automatic system-prompt instruction telling Claude to view
/memories before doing work. You implement the backend.
This is not a corpus indexer either — it’s a workspace where Claude writes
free-form markdown/XML files as it works. But the contract is interesting for
our purposes: it’s already the standard surface Claude knows how to use. If
we build the indexer underneath it, every Claude Code agent on the fleet can
read the indexed memory through the familiar view interface without learning
a new MCP.
Anthropic’s own usage pattern (the “Multi-session software development pattern” in their docs):
- Initializer session writes a progress log + feature checklist.
- Subsequent sessions read those files first.
- End-of-session updates the progress log.
This is essentially what Wes’s handoff protocol already does. The indexer should be queryable through a memory-tool-compatible surface so any agent can grep its own corpus.
2.6 Anthropic Memory MCP server
Section titled “2.6 Anthropic Memory MCP server”Reference impl in modelcontextprotocol/servers/src/memory. JSONL persistence
(memory.jsonl), in-process knowledge graph with three primitives:
- Entities:
{ name, entityType, observations: string[] } - Relations: directed edges, active-voice
- Observations: atomic facts attached to entities
Eight tools: create_entities, create_relations, add_observations,
delete_entities, delete_observations, delete_relations, read_graph,
search_nodes, open_nodes. MIT. Search is substring — no embeddings.
This is the schema starting point. The JSONL append-only format is also a nice match for the raw layer. We do not want substring search alone — but the entity/relation/observation triple is a fine canonical schema.
2.7 Sleep-time Compute (Letta + Berkeley, arXiv 2504.13171, April 2025)
Section titled “2.7 Sleep-time Compute (Letta + Berkeley, arXiv 2504.13171, April 2025)”Not a framework — a technique. While the agent is idle, run a sleep-time agent that pre-computes structured representations of context. At test time, the active agent answers from those pre-computed quantities.
Numbers:
- 5x test-time compute reduction at equal accuracy.
- Up to 13–18% absolute accuracy gains when scaling sleep-time.
- 2.5x cost reduction across related queries on Multi-Query GSM-Symbolic.
The “auto-dream” feature people talk about for Claude Code is a community implementation of this same idea applied to memory consolidation. This is the right model for the consolidation job: nightly, the indexer runs a deeper extraction pass over the day’s transcripts, building summaries, refining entities, and updating drift signals.
2.8 OpenAI Memory (ChatGPT + Conversations API)
Section titled “2.8 OpenAI Memory (ChatGPT + Conversations API)”Two surfaces:
- Saved memories — user-directed (“remember I’m vegetarian”). Auto-managed by the model.
- Chat history references — model pulls relevant context from prior conversations at run time. May 2026 update added a “memory sources” UI showing which memories influenced a reply.
The Conversations API exposes “conversation context” as a first-class object combinable with files, vector stores, and containers. Architecture is opaque (closed-source). UX pattern worth stealing: show which memory rows influenced an answer. That’s exactly what we need for drift detection — provenance per claim.
2.9 Cognee (topoteretes/cognee)
Section titled “2.9 Cognee (topoteretes/cognee)”Apache-2.0, 17.2k stars, v1.0.9 May 8 2026. Self-described “memory control plane.” Production user count claimed > 70 (Bayer, U-Wyoming, dltHub, others).
Pipeline is two-phase:
- Cognify (ingest): classify documents → check permissions → extract chunks → LLM extract entities + relationships → generate summaries → embed into vector store + commit edges to graph.
- Memify (refine): prune stale nodes, strengthen frequent connections, reweight edges based on usage signals, derive new facts.
Memify is the unique contribution. Most frameworks treat memory as append-only or hand contradiction-resolution to retrieval. Cognee actively rewrites the graph based on usage patterns. This is what we want for drift detection.
Cognee ships 14 named retrieval modes — GRAPH_COMPLETION, RAG_COMPLETION,
GRAPH_COMPLETION_COT, GRAPH_COMPLETION_CONTEXT_EXTENSION,
GRAPH_SUMMARY_COMPLETION, TRIPLET_COMPLETION, NATURAL_LANGUAGE,
CYPHER, CHUNKS, CHUNKS_LEXICAL, SUMMARIES, TEMPORAL,
CODING_RULES, FEELING_LUCKY. The right mode depends on the question. They
claim +25% to +1618% on multi-hop reasoning vs Mem0 and Graphiti. Their own
benchmark, take with salt — but the idea of selecting a retrieval mode per
question type is sound.
2.10 Zep / Graphiti
Section titled “2.10 Zep / Graphiti”The most architecturally serious. Graphiti is the open-source temporal-graph engine (Apache-2.0, 25.9k stars); Zep is the hosted product on top.
Schema:
- Episodes — raw ingested data with timestamps. Ground truth. Every derived fact traces back to an episode.
- Entities — nodes with summaries that evolve over time.
- Facts — directed edges (subject, predicate, object) with bi-temporal
validity (
t_valid,t_invalid) and ingestion bookkeeping (t′_created,t′_expired).
Bi-temporal model:
- Event time
T— when the fact was true in the world. - Ingestion time
T′— when we observed it. - Contradictions don’t delete — they invalidate. Old facts retain
t_invalid = new_fact.t_valid. You can always query “what was true at time X” or “what is true now.”
Extraction pipeline: for each episode, extract speaker + N preceding messages as context, run entity extraction with dedup via embedding + full-text + LLM resolution, then fact extraction (hyper-edges for multi-entity facts), then temporal-overlap check to invalidate contradicted edges.
Retrieval: three parallel methods (cosine semantic, BM25, breadth-first graph traversal up to n hops) → rerank (RRF / MMR / episode-mention frequency / graph distance / cross-encoder) → context construction. No LLM calls during retrieval. P95 ~300ms.
Benchmarks: 18.5% absolute on LongMemEval (gpt-4o, vs 60.2% baseline), 90% lower latency. On the older DMR benchmark, 94.8% vs MemGPT’s 93.4% (gpt-4-turbo).
Backend dependencies: Neo4j 5.26+, FalkorDB, Kuzu, or Neptune. Search backend OpenSearch Serverless for Neptune deployments. This is the backend tax that argues against adopting Graphiti directly — adding Neo4j to Wes’s fleet is real ops work. But the schema and the bi-temporal model are the single most important pattern to copy.
3. Recommended Hybrid Architecture for Claude Code Transcripts
Section titled “3. Recommended Hybrid Architecture for Claude Code Transcripts”3.1 Substrate (use what’s already on the fleet)
Section titled “3.1 Substrate (use what’s already on the fleet)”- SQLite (per-machine indexer) + D1 (fleet-wide read model). No Neo4j, no Postgres. SQLite handles 12 months of transcripts on a laptop. D1 holds the federated read model for cross-machine queries.
- Embedding store: local SQLite using
sqlite-vecextension (orsqlite-vss). For the cross-machine read model, Cloudflare Vectorize. - BM25: SQLite FTS5. Already built in. Free.
- Graph: materialized as edge tables in SQLite. We don’t need real graph traversal beyond 2 hops; recursive CTEs handle that. If 12-month corpus reveals 3+ hop needs later, swap to Kuzu (embeddable, no server).
- Raw transcripts: stay on disk where Claude Code already writes them.
Indexer just builds pointers (
session_id,message_index,byte_offset).
3.2 Layer 1 — Raw (immutable, source of truth)
Section titled “3.2 Layer 1 — Raw (immutable, source of truth)”~/.claude/projects/*/<session_id>.jsonlalready exists. Don’t move it (MEMORY.md note already says don’t — Claude uses the path as cwd lookup key). Index in place.- Per-message pointer table:
CREATE TABLE messages (id TEXT PRIMARY KEY, -- session_id:idxsession_id TEXT NOT NULL,idx INTEGER NOT NULL, -- position in jsonlmachine TEXT NOT NULL, -- mac/clippy/grater/imac/m1variant TEXT, -- pepper/nagatha/etccwd TEXT NOT NULL,role TEXT NOT NULL, -- user/assistant/tool/systemts INTEGER NOT NULL, -- unix mstokens_est INTEGER,byte_offset INTEGER NOT NULL,byte_len INTEGER NOT NULL,UNIQUE(session_id, idx));
- Never edit. Re-index by re-walking jsonl files; cheap.
3.3 Layer 2 — Structured (the working index)
Section titled “3.3 Layer 2 — Structured (the working index)”Three table families. All append-only; updates go through new rows with
t_invalid on the old.
Episodes (Graphiti pattern)
Section titled “Episodes (Graphiti pattern)”CREATE TABLE episodes ( id TEXT PRIMARY KEY, session_id TEXT NOT NULL, machine TEXT, variant TEXT, cwd TEXT, start_ts INTEGER NOT NULL, end_ts INTEGER NOT NULL, msg_start_idx INTEGER NOT NULL, msg_end_idx INTEGER NOT NULL, summary TEXT, -- 2-4 sentences topics TEXT, -- JSON array outcome TEXT, -- success/blocked/abandoned decision_count INTEGER DEFAULT 0, created_at INTEGER NOT NULL);An episode is a coherent unit of work inside a session. Granularity: ~5–30
messages. We segment by tool-call density spikes, by /clear boundaries, by
topic-shift heuristics. Episodes are the unit of summarization.
Entities and Facts (Graphiti pattern, simplified)
Section titled “Entities and Facts (Graphiti pattern, simplified)”CREATE TABLE entities ( id TEXT PRIMARY KEY, -- canonical_name type TEXT NOT NULL, -- client/project/file/person/tool/decision/rule summary TEXT, aliases TEXT, -- JSON array created_at INTEGER NOT NULL, last_seen INTEGER NOT NULL);
CREATE TABLE facts ( id TEXT PRIMARY KEY, subject_id TEXT NOT NULL REFERENCES entities(id), predicate TEXT NOT NULL, object_id TEXT REFERENCES entities(id), object_lit TEXT, -- literal value when object isn't an entity episode_id TEXT REFERENCES episodes(id), -- provenance t_valid INTEGER NOT NULL, -- when the fact became true (event time) t_invalid INTEGER, -- NULL = still true t_created INTEGER NOT NULL, -- ingestion time t_expired INTEGER, -- when superseded in our store confidence REAL DEFAULT 1.0, source_msg TEXT REFERENCES messages(id));CREATE INDEX facts_subject ON facts(subject_id, t_invalid);CREATE INDEX facts_pred ON facts(predicate, t_invalid);Drift detection falls out for free. “When did the no-band-aid-fixes rule
change?” = SELECT * FROM facts WHERE predicate='rule' AND subject_id='wes' ORDER BY t_valid DESC shows every version with intervals.
Chunks (Mem0 pattern, for semantic search)
Section titled “Chunks (Mem0 pattern, for semantic search)”CREATE TABLE chunks ( id TEXT PRIMARY KEY, episode_id TEXT REFERENCES episodes(id), text TEXT NOT NULL, text_norm TEXT NOT NULL, -- for FTS5 kind TEXT NOT NULL, -- turn/reasoning/decision/tool_io created_at INTEGER NOT NULL);CREATE VIRTUAL TABLE chunks_fts USING fts5(text_norm, content='chunks', content_rowid='rowid');
CREATE TABLE chunk_vec ( chunk_id TEXT PRIMARY KEY REFERENCES chunks(id), embedding BLOB NOT NULL -- via sqlite-vec);Chunks are the embedding unit. One chunk per assistant turn for reasoning, plus separate chunks for code blocks, decisions, and tool outputs. Don’t embed everything — embedding 12 months of raw tool output is wasteful and poisons retrieval.
3.4 Layer 3 — Insight (the “dream” job)
Section titled “3.4 Layer 3 — Insight (the “dream” job)”Run nightly. Reads yesterday’s episodes; writes:
- Updated entity summaries (entities accrue new aliases, type changes).
- Cross-episode patterns (“this is the 4th time Wes asked about GraphQL reverse-engineering this month — promote to procedural memory”).
- Drift signals (predicate-versus-time graphs, contradiction counts).
- Pruning candidates (low-usage entities, fully-superseded facts older than N months — soft-deleted, not hard-removed).
Borrowed wholesale from Cognee’s Memify and the sleep-time-compute pattern.
Schedule under launchd on Mac, systemd on Linux, Task Scheduler on PC. Reuse
the existing /dream skill plumbing.
3.5 Extraction Pipeline (Mem0 + Graphiti, in one)
Section titled “3.5 Extraction Pipeline (Mem0 + Graphiti, in one)”Per-episode, single LLM call:
Input: episode messages + recent-N entities + recent-N facts (for dedup hints)Output: { new_entities: [...], new_facts: [...], summary: "...", outcome: "...", topics: [...] }- Single-pass ADD-only. No UPDATE/DELETE at write time (Mem0 lesson).
- Dedup via embedding + name during write. If two candidates score
0.92 cosine and share an alias, merge.
- Contradiction detection happens at write time too (Graphiti lesson):
for each new fact
(s, p, o), find existing facts with same(s, p)andt_invalid IS NULL. If conflict, set the old fact’st_invalidto new fact’st_valid. Keep both rows.
Use Haiku for the extraction LLM. Sonnet for the nightly insight job. Opus only for question-answering at retrieval time.
3.6 Retrieval Pipeline (Zep pattern, fitted to question type)
Section titled “3.6 Retrieval Pipeline (Zep pattern, fitted to question type)”Three parallel candidates → fuse → optionally rerank:
- BM25 over
chunks_fts(FTS5). - Dense vector over
chunk_vec(sqlite-vec cosine). - Graph traversal — start from entities mentioned in the query
(matched by canonical name and aliases), traverse 1-2 hops through
factswheret_invalid IS NULL(or at a specifiedas_oftime).
Fuse with RRF (k=60). Then for >50 candidates or precision-sensitive queries, run a cross-encoder rerank (a small local model or one Sonnet call).
Per-question-type routing (Cognee idea, lighter version):
- “What did Wes decide about X?” → graph-first (facts with
predicate=
decided, subject=wes, object related to X). - “Find the time we discussed Y.” → BM25-first (exact-term recall).
- “Have I seen this error before?” → vector-first (semantic similarity).
- “When did rule Z change?” → temporal scan on facts table.
3.7 When to embed vs not
Section titled “3.7 When to embed vs not”- Embed: user prompts, assistant reasoning/decisions, episode summaries, entity summaries. These are semantically rich and short.
- Don’t embed: tool inputs (paths, args), tool outputs (file contents, command stdout), system reminders, hook output, raw JSON dumps. These poison the embedding space with low-signal high-volume noise. Index them with BM25 only and point to the raw byte range.
Rough math: 12 months × 5 machines × 18 variants × ~50 sessions/variant × ~400 messages/session ≈ 18M messages. If we embed only assistant text + decisions + summaries (~20% of messages), that’s ~3.6M embeddings. At 1024 dims × 4 bytes = 4 KB each → 14 GB. Manageable on a single SSD; on Vectorize, well under quota.
3.8 Schema for Wes’s specific entity types
Section titled “3.8 Schema for Wes’s specific entity types”Predefine these entity types — they cover the corpus:
client(covenant, shaw, summit, hines-creative, wellness-oasis, …)project(covenant-hcp-sync, shaw-plumbing-site, hcp-prospector, …)machine(mac, clippy, grater, imac, m1)variant(pepper, stark, nagatha, bilby, …)person(wes, lance, scott, anne, christian, …)tool(mcp server names, cli names)file(canonical paths in vault and repos)decision(one-line statements: “use 11ty for shaw v3”)rule(standing rules — “ALTER TABLE, don’t DROP”)incident(named events — “Shaw incident 2026-03-23”)deadline(dated commitments)
Predefined predicates (think Graphiti EntityTypes + EdgeTypes):
decided,learned,blocked_on,worked_on,pinged,deployed_to,assigned_to,partners_with,superseded_by,tagged,due_on,lives_at,mentioned_in_episode.
Constraining the schema is what makes drift detection tractable. With a fixed predicate set, “rule changes” is a single query.
3.9 Surface — how agents query it
Section titled “3.9 Surface — how agents query it”Expose two interfaces:
- Memory MCP server matching Anthropic’s
memory_20250818contract./memories/index/by-client/covenant.mdsynthesizes recent facts about Covenant on read./memories/raw/<session_id>.txtreturns a pointer. Variants already know how to use this surface. - CLI / fleet-node endpoint —
transcripts query "..."returns ranked results with provenance (episode + source message). Same query goes to fleet-node so any machine can hit the federated read model.
3.10 Build order
Section titled “3.10 Build order”- Raw layer + messages table. Walk every
.jsonlon Mac, index it. No LLM yet. Verify counts. (1 day.) - Episode segmentation + summarization. Haiku call per session, chunked. Cache aggressively. (2-3 days.)
- Entity + fact extraction with bi-temporal model. Predefine schemas. Backfill across all episodes. (3-5 days.)
- Embeddings on the right subset only. sqlite-vec. (1 day.)
- Retrieval CLI + RRF fusion. (2 days.)
- Nightly Memify/dream job. Reuse
/dreamskill. (2 days.) - Memory MCP server surface. Wraps the same query layer. (2 days.)
- Federate to D1/Vectorize for fleet-wide queries. (3-5 days.)
Two-week first cut to a system that beats grep meaningfully. The 12-month corpus pulls in over time once the pipeline is stable.
4. Open Questions
Section titled “4. Open Questions”-
Episode segmentation heuristics. What’s the right rule for cutting a session into episodes? Tool-call density spikes?
/clearmarkers (already present in transcripts)? Topic-shift cosine drops between consecutive turns? Needs empirical tuning on real data. -
Cross-machine identity resolution. When Pepper on Clippy and Stark on Clippy both reference “the Shaw site,” do we want one entity or two (variant-scoped)? Mem0’s four-scope model says scope by
app_id, but we usually want cross-variant queries. Lean toward one entity, withmentioned_by_variantas a fact predicate. Verify with real queries. -
Embedding model choice and lock-in. OpenAI
text-embedding-3-smallis cheapest and good enough, but ties us to OpenAI. Local models (bge-small-en-v1.5,nomic-embed-text) avoid the call but need per-machine compute. Mac M3 can run nomic locally at <50ms/embedding. Recommend local for Mac+Clippy+Grater, OpenAI for fallback. Lock in the dim (1024) so we can swap models without re-embedding everything. -
Drift detection threshold. A rule that “changed” at minute granularity is noise; one that “changed” at month granularity is signal. What’s the debounce window? Probably a function of predicate type —
decidedis spiky;ruleshould be stable for weeks before counting as drift. -
Tool-output handling. Some tool outputs are gold (a Cloudflare
wrangler deploylog) and some are dross (a directory listing). Should we extract facts from tool output, or only from assistant text? Trying only assistant text first will undercount infra facts. Pragmatic: extract fromassistant.textand from tool outputs only when the tool name is on an allowlist (wrangler, gh, sqlite3, curl results parsed as JSON, etc). -
Privacy / secrets. Transcripts contain leaked tokens (the
feedback_never_echo_secrets_in_text.mdandfleet-node-keyed-pathincidents). Extraction can’t put secret values into facts. Need a pre-extraction scrubber regex (matches the existingpre-tool-use-secret-intercept.jspatterns) before any LLM call. -
Conflict between fact extraction and Wes’s vault canon. The vault is already human-canon. When the extractor produces a fact that contradicts the vault, who wins? Default rule: vault wins; emit a
drift_alertfor Wes to review. But this needs a real reconciliation policy. -
Replay/idempotency. If we re-extract a transcript later (better model, schema update), do we throw out the prior facts and re-emit, or merge? Likely answer: keep prior with
t_expired, re-emit fresh. Same bi-temporal trick on the extraction itself. -
Cost ceiling on extraction. At Haiku rates, the 12-month corpus extraction is probably $20-80. Acceptable. But the re-extraction every time we change the schema is the recurring cost — need to plan for that. Sonnet on insight pass only; Haiku on raw extraction; never Opus inside the pipeline.
-
What “the corpus” includes. Just
.jsonltranscripts, or also handoff files, Telegram logs, the vault itself, peer-messages? Strong bias toward starting with transcripts only and layering others in as distinct “source” tags on episodes. A fact found in the vault is higher trust than one extracted from a transcript turn.
5. Patterns Adopted vs Rejected (Quick Reference)
Section titled “5. Patterns Adopted vs Rejected (Quick Reference)”Adopted:
- Graphiti episode→entity→fact pipeline with bi-temporal validity.
- Mem0 single-pass ADD-only extraction.
- Mem0 four-scope namespacing (mapped: machine/variant/session/project).
- Zep three-way hybrid retrieval (BM25 + dense + graph) with RRF.
- Cognee Memify post-ingest refinement (becomes our
/dreamjob). - Cognee retrieval-mode-per-question-type idea (lighter version).
- LangMem semantic/episodic/procedural taxonomy.
- Letta core-memory-block concept (for agent identity cards).
- Anthropic Memory tool as the surface (so agents already know the API).
- Anthropic Memory MCP server entity/relation/observation schema (as a starting canonical form).
- Sleep-time-compute as the background-job pattern.
- OpenAI “memory sources” UX — surface provenance per answer.
Rejected:
- Neo4j or Postgres dependency (use SQLite/D1).
- Letta as runtime (too heavy to swap for the fleet).
- Mem0 server (we want the pattern, not the service).
- LlamaIndex memory (in-session, not corpus-wide).
- Substring-only search (the official Memory MCP is fine schema, weak retrieval).
- Embedding everything (tool output, dir listings) — embed only semantically rich text.
- Hard-deleting on contradiction — invalidate, never delete.
6. Sources
Section titled “6. Sources”Letta / MemGPT:
- https://github.com/letta-ai/letta
- https://docs.letta.com/concepts/memgpt/
- https://www.letta.com/blog/agent-memory
- https://www.letta.com/blog/sleep-time-compute
- https://arxiv.org/abs/2504.13171 (Sleep-time Compute paper)
- https://github.com/letta-ai/sleep-time-compute
- https://vectorize.io/articles/mem0-vs-letta
Mem0:
- https://github.com/mem0ai/mem0
- https://mem0.ai/blog/state-of-ai-agent-memory-2026
- https://mem0.ai/blog/graph-memory-solutions-ai-agents
- https://docs.mem0.ai/cookbooks/essentials/choosing-memory-architecture-vector-vs-graph
- https://arxiv.org/pdf/2504.19413 (Mem0 paper)
LangMem / LangChain:
- https://langchain-ai.github.io/langmem/concepts/conceptual_guide/
- https://blog.langchain.com/langmem-sdk-launch/
- https://langchain-ai.github.io/langmem/guides/extract_episodic_memories/
LlamaIndex:
- https://developers.llamaindex.ai/python/framework/module_guides/deploying/agents/memory/
- https://developers.llamaindex.ai/python/examples/agent/memory/summary_memory_buffer/
Anthropic:
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool
- https://github.com/modelcontextprotocol/servers/tree/main/src/memory
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Zep / Graphiti:
- https://arxiv.org/abs/2501.13956 (Zep paper)
- https://arxiv.org/html/2501.13956v1
- https://github.com/getzep/graphiti
- https://blog.getzep.com/graphiti-knowledge-graphs-for-agents/
- https://www.getzep.com/product/open-source/
Cognee:
- https://github.com/topoteretes/cognee
- https://www.cognee.ai/blog/fundamentals/how-cognee-builds-ai-memory
- https://memgraph.com/blog/from-rag-to-graphs-cognee-ai-memory
OpenAI:
- https://openai.com/index/memory-and-new-controls-for-chatgpt/
- https://help.openai.com/en/articles/8590148-memory-faq
- https://www.arielsoftwares.com/openai-conversations-api/
Hybrid RAG references:
- https://calmops.com/ai/hybrid-search-rag-complete-guide-2026/
- https://blog.premai.io/hybrid-search-for-rag-bm25-splade-and-vector-search-combined/
- https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking
- https://arxiv.org/html/2507.03226v3 (GraphRAG construction at scale)
- https://medium.com/@kumaran.isk/building-a-production-rag-pipeline-start-with-hybrid-retrieval-dense-bm25-rrf-e901aba17cae
Dream / AutoDream community impls: