Memory frameworks survey

Memory & RAG Architectures Survey — For a Claude Code Transcript Indexer

Author: research-agent Date: 2026-05-11 Scope: Survey production memory/RAG systems to inform the design of a transcript indexer that supports semantic search, fact extraction, drift detection, and multi-granularity summarization over ~1 month → 12+ months of Claude Code .jsonl transcripts across 5 machines and ~18 agent variants. Length cap: ~900 lines.

0. TL;DR — Recommendation Up Front

For Claude Code transcripts, the right shape is a three-layer hybrid, not any single framework verbatim:

Raw layer — original .jsonl transcripts on disk, addressable by session_id + message_index. Source of truth. Never mutated.
Structured layer — SQLite (or D1) with three table families: episodes (session-level summaries), entities/facts (Graphiti-style triplets with bi-temporal validity windows), and chunks (paragraph- and turn-level units with embeddings + BM25). Append-only; contradictions are invalidated, never deleted.
Insight layer — periodic “dream”/“memify”-style consolidation jobs that prune, reweight, derive meta-facts, and compute drift signals.

Retrieval is hybrid by default: parallel BM25 + dense vector + 1-hop graph traversal, fused with RRF, then optionally reranked. This is the consensus pattern across Zep/Graphiti, Mem0, and Cognee.

Steal patterns from:

Graphiti/Zep — bi-temporal model, episode→entity→fact pipeline, RRF retrieval. Best fit for drift detection.
Mem0 — single-pass ADD-only extraction (cheap), four-scope memory (user/agent/run/app — maps cleanly to machine/variant/session/project).
Letta — core_memory blocks as pinned agent state, sleep-time-compute for async consolidation.
Anthropic Memory tool — file-based /memories directory as a human-inspectable working layer.
Cognee — Memify (post-ingest graph refinement) and multiple retrieval modes selected per question type.

Do not adopt any of these as the backend wholesale. Wes already runs Cloudflare D1, fleet-node KV, and a vault. The indexer should live on those substrates, not on Neo4j, a Letta server, or Mem0 Cloud.

1. Comparison Matrix

Framework	Storage	Tiers	Extraction	Retrieval	Drift / Update	License	Stars	Fit
Letta (MemGPT)	Postgres + vector + (optional) graph	Core / Recall / Archival	Agent self-edits via tool calls	Agent picks tier; vector or graph search	Sleep-time compute agent refines core blocks	Apache-2.0	22.6k	Pattern-steal; runtime is too heavy
Mem0	Vector + optional graph (Neo4j/Kuzu)	Flat with 4 scopes (user/agent/run/app)	Single-pass ADD-only LLM extraction	Hybrid: semantic + BM25 + entity	Async writes; staleness is open problem	Apache-2.0	55.4k	Closest fit conceptually; lift the extractor pattern
LangMem	LangGraph BaseStore (pluggable)	Semantic / Episodic / Procedural	Hot path or background; Pydantic schemas	Direct + semantic + metadata filter	Prompt-optimizer rewrites procedural memory	MIT	~3k	Concept taxonomy is gold; SDK is not load-bearing
LlamaIndex Memory	Pluggable (vector, kv, etc.)	Short-term buffer + Memory Blocks	`FactExtractionMemoryBlock` is opt-in	Block-priority eviction; vector for VectorMemoryBlock	Manual via block priority	MIT	n/a (in framework)	Useful for in-session compaction, not corpus indexing
Anthropic Memory tool	Client-controlled files in `/memories`	Single tier (filesystem)	None — Claude writes free-form files	File view/read; agent navigates	Claude edits files with `str_replace`	Anthropic API	n/a	Use as working-memory surface, not corpus index
Anthropic Memory MCP	`memory.jsonl` (JSONL)	Single tier	Manual via tools	`search_nodes` (substring)	Manual delete/add observations	MIT	n/a	Reference impl; format is useful starting schema
Sleep-time Compute	n/a (technique)	n/a	Pre-computes anticipated answers offline	n/a	This is the update mechanism	Research (Letta)	n/a	Adopt as the background-job concept
OpenAI Memory	OpenAI-side (saved memories + chat history retrieval)	Two tiers (saved / referenced)	Auto-extraction from chat	”Memory sources” attribution UI	Auto-update with user control	Closed	n/a	Black box; UX patterns only
Cognee	Vector + graph (Neo4j/Kuzu/Memgraph)	Episodic graph + derived facts	6-stage Cognify pipeline	14 named retrieval modes	Memify: prune, reweight, derive	Apache-2.0	17.2k	Strong patterns for post-ingest refinement
Zep / Graphiti	Neo4j / FalkorDB / Kuzu / Neptune	Episode → Entity → Fact	Episode-extracted triplets w/ temporal windows	Semantic + BM25 + graph traversal, RRF/MMR	Bi-temporal invalidation (t_valid, t_invalid)	Apache-2.0 (Graphiti); Zep is SaaS	25.9k	Highest-fit for drift detection

Headline numbers to remember:

Zep beats MemGPT 94.8% vs 93.4% on Deep Memory Retrieval; 18.5% absolute on LongMemEval (gpt-4o).
Mem0 hits 91.6 on LoCoMo, 91% lower p95 latency than full-context, 90% fewer tokens.
Anthropic reports 84% token reduction in extended Memory-tool workflows.
Sleep-time compute reduces test-time compute 5x at equal accuracy, +13-18% absolute with more compute.

2. Per-Framework Deep Dives

2.1 Letta (formerly MemGPT)

The OS-inspired hierarchical-memory original. Three tiers:

Core memory — in-context blocks (label + value + char limit) pinned to every model call. Agent edits via memory_replace / memory_insert / memory_rethink. Like RAM.
Recall memory — searchable raw conversation history outside the context window. Auto-persisted. Like a disk cache.
Archival memory — explicitly written, indexed knowledge in vector or graph DB. Agent uses archival_memory_insert / archival_memory_search.

The agent itself decides what to promote/demote between tiers via tool calls. This is powerful but model-dependent — extraction quality fluctuates with reasoning quality. License Apache-2.0, 22.6k stars, v0.16.7 March 2026, full runtime with Python/TypeScript/Rust SDKs. Adopt the pattern, not the runtime. The “core memory block” idea is exactly what a Claude Code variant’s persistent identity card should look like (peer summary, current mission, recent decisions). The runtime swap-in cost is too high given Wes’s existing fleet primitives.

2.2 Mem0

The most popular open-source memory layer (55.4k stars, Apache-2.0). Sits between an agent framework and a backend store — pluggable, not a runtime.

Extraction: single-pass ADD-only. One LLM call extracts a structured fact and appends it. No write-time UPDATE/DELETE; conflict resolution is deferred to retrieval. This is what saves 60-70% of write-time LLM cost.
Retrieval: parallel semantic vectors + BM25 + entity linking, then rerank.
Scopes: user_id, agent_id, run_id, app_id — four-axis namespacing. Maps almost 1:1 onto Wes’s fleet (machine + variant + session + client project).
Graph: optional via Neo4j or Kuzu (Kuzu added Sept 2025). Mem0g variant adds 1.5pp accuracy at 80% latency cost.
Failure modes (admitted in their 2026 state-of-the-art post): staleness detection is an unsolved problem; cross-session identity resolution remains open; LOCOMO benchmarks don’t capture domain-specific performance.

This is the closest single framework to what we want, but it’s still a runtime service with its own server. Lift the extraction pipeline and the four-scope model; don’t run their server.

2.3 LangMem (LangChain)

A library, not a service. Best contribution is the taxonomy:

Semantic — facts. Two structures: collections (unbounded) and profiles (single-doc, strict schema).
Episodic — past interactions stored as few-shot examples (situation + thoughts + action + result).
Procedural — rules and instructions, stored in the system prompt and updated by a prompt-optimizer.

It distinguishes hot path writes (during conversation, latency cost) vs background writes (after, deeper analysis). Pydantic-typed schemas drive extraction. LangGraph BaseStore is the persistence interface — pluggable to whatever store you want.

Adopt the taxonomy. Procedural memory is exactly where Wes’s standing rules belong (e.g., the verify-deployed-version rule, the no-pasted-secrets rule). Episodic memory matches the per-incident vault notes. Semantic memory is the entities-and-facts layer.

2.4 LlamaIndex Memory

Less ambitious than the others. Two layers:

Short-term: Memory class with token-budgeted message buffer.
Long-term: “Memory Blocks” with priorities. Three built-in block types: StaticMemoryBlock, FactExtractionMemoryBlock, VectorMemoryBlock. Blocks are inserted into the system message or the latest user message.

Useful for in-session context compaction. Not a corpus indexer. The block priority + token-budget eviction is a clean primitive. Skip.

2.5 Anthropic Memory tool (`memory_20250818`)

Beta-released Aug 2025, GA-eligible Sept 29 2025. Client-side: Anthropic defines six operations (view, create, str_replace, insert, delete, rename) and an automatic system-prompt instruction telling Claude to view /memories before doing work. You implement the backend.

This is not a corpus indexer either — it’s a workspace where Claude writes free-form markdown/XML files as it works. But the contract is interesting for our purposes: it’s already the standard surface Claude knows how to use. If we build the indexer underneath it, every Claude Code agent on the fleet can read the indexed memory through the familiar view interface without learning a new MCP.

Anthropic’s own usage pattern (the “Multi-session software development pattern” in their docs):

Initializer session writes a progress log + feature checklist.
Subsequent sessions read those files first.
End-of-session updates the progress log.

This is essentially what Wes’s handoff protocol already does. The indexer should be queryable through a memory-tool-compatible surface so any agent can grep its own corpus.

2.6 Anthropic Memory MCP server

Reference impl in modelcontextprotocol/servers/src/memory. JSONL persistence (memory.jsonl), in-process knowledge graph with three primitives:

Entities: { name, entityType, observations: string[] }
Relations: directed edges, active-voice
Observations: atomic facts attached to entities

Eight tools: create_entities, create_relations, add_observations, delete_entities, delete_observations, delete_relations, read_graph, search_nodes, open_nodes. MIT. Search is substring — no embeddings.

This is the schema starting point. The JSONL append-only format is also a nice match for the raw layer. We do not want substring search alone — but the entity/relation/observation triple is a fine canonical schema.

2.7 Sleep-time Compute (Letta + Berkeley, arXiv 2504.13171, April 2025)

Not a framework — a technique. While the agent is idle, run a sleep-time agent that pre-computes structured representations of context. At test time, the active agent answers from those pre-computed quantities.

Numbers:

5x test-time compute reduction at equal accuracy.
Up to 13–18% absolute accuracy gains when scaling sleep-time.
2.5x cost reduction across related queries on Multi-Query GSM-Symbolic.

The “auto-dream” feature people talk about for Claude Code is a community implementation of this same idea applied to memory consolidation. This is the right model for the consolidation job: nightly, the indexer runs a deeper extraction pass over the day’s transcripts, building summaries, refining entities, and updating drift signals.

2.8 OpenAI Memory (ChatGPT + Conversations API)

Two surfaces:

Saved memories — user-directed (“remember I’m vegetarian”). Auto-managed by the model.
Chat history references — model pulls relevant context from prior conversations at run time. May 2026 update added a “memory sources” UI showing which memories influenced a reply.

The Conversations API exposes “conversation context” as a first-class object combinable with files, vector stores, and containers. Architecture is opaque (closed-source). UX pattern worth stealing: show which memory rows influenced an answer. That’s exactly what we need for drift detection — provenance per claim.

2.9 Cognee (topoteretes/cognee)

Apache-2.0, 17.2k stars, v1.0.9 May 8 2026. Self-described “memory control plane.” Production user count claimed > 70 (Bayer, U-Wyoming, dltHub, others).

Pipeline is two-phase:

Cognify (ingest): classify documents → check permissions → extract chunks → LLM extract entities + relationships → generate summaries → embed into vector store + commit edges to graph.
Memify (refine): prune stale nodes, strengthen frequent connections, reweight edges based on usage signals, derive new facts.

Memify is the unique contribution. Most frameworks treat memory as append-only or hand contradiction-resolution to retrieval. Cognee actively rewrites the graph based on usage patterns. This is what we want for drift detection.

Cognee ships 14 named retrieval modes — GRAPH_COMPLETION, RAG_COMPLETION, GRAPH_COMPLETION_COT, GRAPH_COMPLETION_CONTEXT_EXTENSION, GRAPH_SUMMARY_COMPLETION, TRIPLET_COMPLETION, NATURAL_LANGUAGE, CYPHER, CHUNKS, CHUNKS_LEXICAL, SUMMARIES, TEMPORAL, CODING_RULES, FEELING_LUCKY. The right mode depends on the question. They claim +25% to +1618% on multi-hop reasoning vs Mem0 and Graphiti. Their own benchmark, take with salt — but the idea of selecting a retrieval mode per question type is sound.

2.10 Zep / Graphiti

The most architecturally serious. Graphiti is the open-source temporal-graph engine (Apache-2.0, 25.9k stars); Zep is the hosted product on top.

Schema:

Episodes — raw ingested data with timestamps. Ground truth. Every derived fact traces back to an episode.
Entities — nodes with summaries that evolve over time.
Facts — directed edges (subject, predicate, object) with bi-temporal validity (t_valid, t_invalid) and ingestion bookkeeping (t′_created, t′_expired).

Bi-temporal model:

Event time T — when the fact was true in the world.
Ingestion time T′ — when we observed it.
Contradictions don’t delete — they invalidate. Old facts retain t_invalid = new_fact.t_valid. You can always query “what was true at time X” or “what is true now.”

Extraction pipeline: for each episode, extract speaker + N preceding messages as context, run entity extraction with dedup via embedding + full-text + LLM resolution, then fact extraction (hyper-edges for multi-entity facts), then temporal-overlap check to invalidate contradicted edges.

Retrieval: three parallel methods (cosine semantic, BM25, breadth-first graph traversal up to n hops) → rerank (RRF / MMR / episode-mention frequency / graph distance / cross-encoder) → context construction. No LLM calls during retrieval. P95 ~300ms.

Benchmarks: 18.5% absolute on LongMemEval (gpt-4o, vs 60.2% baseline), 90% lower latency. On the older DMR benchmark, 94.8% vs MemGPT’s 93.4% (gpt-4-turbo).

Backend dependencies: Neo4j 5.26+, FalkorDB, Kuzu, or Neptune. Search backend OpenSearch Serverless for Neptune deployments. This is the backend tax that argues against adopting Graphiti directly — adding Neo4j to Wes’s fleet is real ops work. But the schema and the bi-temporal model are the single most important pattern to copy.

3. Recommended Hybrid Architecture for Claude Code Transcripts

3.1 Substrate (use what’s already on the fleet)

SQLite (per-machine indexer) + D1 (fleet-wide read model). No Neo4j, no Postgres. SQLite handles 12 months of transcripts on a laptop. D1 holds the federated read model for cross-machine queries.
Embedding store: local SQLite using sqlite-vec extension (or sqlite-vss). For the cross-machine read model, Cloudflare Vectorize.
BM25: SQLite FTS5. Already built in. Free.
Graph: materialized as edge tables in SQLite. We don’t need real graph traversal beyond 2 hops; recursive CTEs handle that. If 12-month corpus reveals 3+ hop needs later, swap to Kuzu (embeddable, no server).
Raw transcripts: stay on disk where Claude Code already writes them. Indexer just builds pointers (session_id, message_index, byte_offset).

3.2 Layer 1 — Raw (immutable, source of truth)

~/.claude/projects/*/<session_id>.jsonl already exists. Don’t move it (MEMORY.md note already says don’t — Claude uses the path as cwd lookup key). Index in place.

Per-message pointer table:

CREATE TABLE messages (
  id              TEXT PRIMARY KEY,         -- session_id:idx
  session_id      TEXT NOT NULL,
  idx             INTEGER NOT NULL,         -- position in jsonl
  machine         TEXT NOT NULL,            -- mac/clippy/grater/imac/m1
  variant         TEXT,                     -- pepper/nagatha/etc
  cwd             TEXT NOT NULL,
  role            TEXT NOT NULL,            -- user/assistant/tool/system
  ts              INTEGER NOT NULL,         -- unix ms
  tokens_est      INTEGER,
  byte_offset     INTEGER NOT NULL,
  byte_len        INTEGER NOT NULL,
  UNIQUE(session_id, idx)
);

Never edit. Re-index by re-walking jsonl files; cheap.

3.3 Layer 2 — Structured (the working index)

Three table families. All append-only; updates go through new rows with t_invalid on the old.

Episodes (Graphiti pattern)

CREATE TABLE episodes (
  id            TEXT PRIMARY KEY,
  session_id    TEXT NOT NULL,
  machine       TEXT,
  variant       TEXT,
  cwd           TEXT,
  start_ts      INTEGER NOT NULL,
  end_ts        INTEGER NOT NULL,
  msg_start_idx INTEGER NOT NULL,
  msg_end_idx   INTEGER NOT NULL,
  summary       TEXT,       -- 2-4 sentences
  topics        TEXT,       -- JSON array
  outcome       TEXT,       -- success/blocked/abandoned
  decision_count INTEGER DEFAULT 0,
  created_at    INTEGER NOT NULL
);

An episode is a coherent unit of work inside a session. Granularity: ~5–30 messages. We segment by tool-call density spikes, by /clear boundaries, by topic-shift heuristics. Episodes are the unit of summarization.

Entities and Facts (Graphiti pattern, simplified)

CREATE TABLE entities (
  id          TEXT PRIMARY KEY,    -- canonical_name
  type        TEXT NOT NULL,       -- client/project/file/person/tool/decision/rule
  summary     TEXT,
  aliases     TEXT,                -- JSON array
  created_at  INTEGER NOT NULL,
  last_seen   INTEGER NOT NULL
);

CREATE TABLE facts (
  id          TEXT PRIMARY KEY,
  subject_id  TEXT NOT NULL REFERENCES entities(id),
  predicate   TEXT NOT NULL,
  object_id   TEXT REFERENCES entities(id),
  object_lit  TEXT,                -- literal value when object isn't an entity
  episode_id  TEXT REFERENCES episodes(id),  -- provenance
  t_valid     INTEGER NOT NULL,    -- when the fact became true (event time)
  t_invalid   INTEGER,             -- NULL = still true
  t_created   INTEGER NOT NULL,    -- ingestion time
  t_expired   INTEGER,             -- when superseded in our store
  confidence  REAL DEFAULT 1.0,
  source_msg  TEXT REFERENCES messages(id)
);
CREATE INDEX facts_subject ON facts(subject_id, t_invalid);
CREATE INDEX facts_pred    ON facts(predicate, t_invalid);

Drift detection falls out for free. “When did the no-band-aid-fixes rule change?” = SELECT * FROM facts WHERE predicate='rule' AND subject_id='wes' ORDER BY t_valid DESC shows every version with intervals.

Chunks (Mem0 pattern, for semantic search)

CREATE TABLE chunks (
  id          TEXT PRIMARY KEY,
  episode_id  TEXT REFERENCES episodes(id),
  text        TEXT NOT NULL,
  text_norm   TEXT NOT NULL,       -- for FTS5
  kind        TEXT NOT NULL,       -- turn/reasoning/decision/tool_io
  created_at  INTEGER NOT NULL
);
CREATE VIRTUAL TABLE chunks_fts USING fts5(text_norm, content='chunks', content_rowid='rowid');

CREATE TABLE chunk_vec (
  chunk_id    TEXT PRIMARY KEY REFERENCES chunks(id),
  embedding   BLOB NOT NULL        -- via sqlite-vec
);

Chunks are the embedding unit. One chunk per assistant turn for reasoning, plus separate chunks for code blocks, decisions, and tool outputs. Don’t embed everything — embedding 12 months of raw tool output is wasteful and poisons retrieval.

3.4 Layer 3 — Insight (the “dream” job)

Run nightly. Reads yesterday’s episodes; writes:

Updated entity summaries (entities accrue new aliases, type changes).
Cross-episode patterns (“this is the 4th time Wes asked about GraphQL reverse-engineering this month — promote to procedural memory”).
Drift signals (predicate-versus-time graphs, contradiction counts).
Pruning candidates (low-usage entities, fully-superseded facts older than N months — soft-deleted, not hard-removed).

Borrowed wholesale from Cognee’s Memify and the sleep-time-compute pattern. Schedule under launchd on Mac, systemd on Linux, Task Scheduler on PC. Reuse the existing /dream skill plumbing.

3.5 Extraction Pipeline (Mem0 + Graphiti, in one)

Per-episode, single LLM call:

Input:  episode messages + recent-N entities + recent-N facts (for dedup hints)
Output: { new_entities: [...], new_facts: [...], summary: "...", outcome: "...", topics: [...] }

Single-pass ADD-only. No UPDATE/DELETE at write time (Mem0 lesson).
Dedup via embedding + name during write. If two candidates score

0.92 cosine and share an alias, merge.
Contradiction detection happens at write time too (Graphiti lesson): for each new fact (s, p, o), find existing facts with same (s, p) and t_invalid IS NULL. If conflict, set the old fact’s t_invalid to new fact’s t_valid. Keep both rows.

Use Haiku for the extraction LLM. Sonnet for the nightly insight job. Opus only for question-answering at retrieval time.

3.6 Retrieval Pipeline (Zep pattern, fitted to question type)

Three parallel candidates → fuse → optionally rerank:

BM25 over chunks_fts (FTS5).
Dense vector over chunk_vec (sqlite-vec cosine).
Graph traversal — start from entities mentioned in the query (matched by canonical name and aliases), traverse 1-2 hops through facts where t_invalid IS NULL (or at a specified as_of time).

Fuse with RRF (k=60). Then for >50 candidates or precision-sensitive queries, run a cross-encoder rerank (a small local model or one Sonnet call).

Per-question-type routing (Cognee idea, lighter version):

“What did Wes decide about X?” → graph-first (facts with predicate=decided, subject=wes, object related to X).
“Find the time we discussed Y.” → BM25-first (exact-term recall).
“Have I seen this error before?” → vector-first (semantic similarity).
“When did rule Z change?” → temporal scan on facts table.

3.7 When to embed vs not

Embed: user prompts, assistant reasoning/decisions, episode summaries, entity summaries. These are semantically rich and short.
Don’t embed: tool inputs (paths, args), tool outputs (file contents, command stdout), system reminders, hook output, raw JSON dumps. These poison the embedding space with low-signal high-volume noise. Index them with BM25 only and point to the raw byte range.

Rough math: 12 months × 5 machines × 18 variants × ~50 sessions/variant × ~400 messages/session ≈ 18M messages. If we embed only assistant text + decisions + summaries (~20% of messages), that’s ~3.6M embeddings. At 1024 dims × 4 bytes = 4 KB each → 14 GB. Manageable on a single SSD; on Vectorize, well under quota.

3.8 Schema for Wes’s specific entity types

Predefine these entity types — they cover the corpus:

client (covenant, shaw, summit, hines-creative, wellness-oasis, …)
project (covenant-hcp-sync, shaw-plumbing-site, hcp-prospector, …)
machine (mac, clippy, grater, imac, m1)
variant (pepper, stark, nagatha, bilby, …)
person (wes, lance, scott, anne, christian, …)
tool (mcp server names, cli names)
file (canonical paths in vault and repos)
decision (one-line statements: “use 11ty for shaw v3”)
rule (standing rules — “ALTER TABLE, don’t DROP”)
incident (named events — “Shaw incident 2026-03-23”)
deadline (dated commitments)

Predefined predicates (think Graphiti EntityTypes + EdgeTypes):

decided, learned, blocked_on, worked_on, pinged, deployed_to, assigned_to, partners_with, superseded_by, tagged, due_on, lives_at, mentioned_in_episode.

Constraining the schema is what makes drift detection tractable. With a fixed predicate set, “rule changes” is a single query.

3.9 Surface — how agents query it

Expose two interfaces:

Memory MCP server matching Anthropic’s memory_20250818 contract. /memories/index/by-client/covenant.md synthesizes recent facts about Covenant on read. /memories/raw/<session_id>.txt returns a pointer. Variants already know how to use this surface.
CLI / fleet-node endpoint — transcripts query "..." returns ranked results with provenance (episode + source message). Same query goes to fleet-node so any machine can hit the federated read model.

3.10 Build order

Raw layer + messages table. Walk every .jsonl on Mac, index it. No LLM yet. Verify counts. (1 day.)
Episode segmentation + summarization. Haiku call per session, chunked. Cache aggressively. (2-3 days.)
Entity + fact extraction with bi-temporal model. Predefine schemas. Backfill across all episodes. (3-5 days.)
Embeddings on the right subset only. sqlite-vec. (1 day.)
Retrieval CLI + RRF fusion. (2 days.)
Nightly Memify/dream job. Reuse /dream skill. (2 days.)
Memory MCP server surface. Wraps the same query layer. (2 days.)
Federate to D1/Vectorize for fleet-wide queries. (3-5 days.)

Two-week first cut to a system that beats grep meaningfully. The 12-month corpus pulls in over time once the pipeline is stable.

4. Open Questions

Episode segmentation heuristics. What’s the right rule for cutting a session into episodes? Tool-call density spikes? /clear markers (already present in transcripts)? Topic-shift cosine drops between consecutive turns? Needs empirical tuning on real data.
Cross-machine identity resolution. When Pepper on Clippy and Stark on Clippy both reference “the Shaw site,” do we want one entity or two (variant-scoped)? Mem0’s four-scope model says scope by app_id, but we usually want cross-variant queries. Lean toward one entity, with mentioned_by_variant as a fact predicate. Verify with real queries.
Embedding model choice and lock-in. OpenAI text-embedding-3-small is cheapest and good enough, but ties us to OpenAI. Local models (bge-small-en-v1.5, nomic-embed-text) avoid the call but need per-machine compute. Mac M3 can run nomic locally at <50ms/embedding. Recommend local for Mac+Clippy+Grater, OpenAI for fallback. Lock in the dim (1024) so we can swap models without re-embedding everything.
Drift detection threshold. A rule that “changed” at minute granularity is noise; one that “changed” at month granularity is signal. What’s the debounce window? Probably a function of predicate type — decided is spiky; rule should be stable for weeks before counting as drift.
Tool-output handling. Some tool outputs are gold (a Cloudflare wrangler deploy log) and some are dross (a directory listing). Should we extract facts from tool output, or only from assistant text? Trying only assistant text first will undercount infra facts. Pragmatic: extract from assistant.text and from tool outputs only when the tool name is on an allowlist (wrangler, gh, sqlite3, curl results parsed as JSON, etc).
Privacy / secrets. Transcripts contain leaked tokens (the feedback_never_echo_secrets_in_text.md and fleet-node-keyed-path incidents). Extraction can’t put secret values into facts. Need a pre-extraction scrubber regex (matches the existing pre-tool-use-secret-intercept.js patterns) before any LLM call.
Conflict between fact extraction and Wes’s vault canon. The vault is already human-canon. When the extractor produces a fact that contradicts the vault, who wins? Default rule: vault wins; emit a drift_alert for Wes to review. But this needs a real reconciliation policy.
Replay/idempotency. If we re-extract a transcript later (better model, schema update), do we throw out the prior facts and re-emit, or merge? Likely answer: keep prior with t_expired, re-emit fresh. Same bi-temporal trick on the extraction itself.
Cost ceiling on extraction. At Haiku rates, the 12-month corpus extraction is probably $20-80. Acceptable. But the re-extraction every time we change the schema is the recurring cost — need to plan for that. Sonnet on insight pass only; Haiku on raw extraction; never Opus inside the pipeline.
What “the corpus” includes. Just .jsonl transcripts, or also handoff files, Telegram logs, the vault itself, peer-messages? Strong bias toward starting with transcripts only and layering others in as distinct “source” tags on episodes. A fact found in the vault is higher trust than one extracted from a transcript turn.

5. Patterns Adopted vs Rejected (Quick Reference)

Adopted:

Graphiti episode→entity→fact pipeline with bi-temporal validity.
Mem0 single-pass ADD-only extraction.
Mem0 four-scope namespacing (mapped: machine/variant/session/project).
Zep three-way hybrid retrieval (BM25 + dense + graph) with RRF.
Cognee Memify post-ingest refinement (becomes our /dream job).
Cognee retrieval-mode-per-question-type idea (lighter version).
LangMem semantic/episodic/procedural taxonomy.
Letta core-memory-block concept (for agent identity cards).
Anthropic Memory tool as the surface (so agents already know the API).
Anthropic Memory MCP server entity/relation/observation schema (as a starting canonical form).
Sleep-time-compute as the background-job pattern.
OpenAI “memory sources” UX — surface provenance per answer.

Rejected:

Neo4j or Postgres dependency (use SQLite/D1).
Letta as runtime (too heavy to swap for the fleet).
Mem0 server (we want the pattern, not the service).
LlamaIndex memory (in-session, not corpus-wide).
Substring-only search (the official Memory MCP is fine schema, weak retrieval).
Embedding everything (tool output, dir listings) — embed only semantically rich text.
Hard-deleting on contradiction — invalidate, never delete.

6. Sources

Letta / MemGPT:

Mem0:

LangMem / LangChain:

LlamaIndex:

Anthropic:

Zep / Graphiti:

Cognee:

OpenAI:

Hybrid RAG references:

Dream / AutoDream community impls: