Indexing tech eval
05 — Indexing & Search Technology Evaluation
Section titled “05 — Indexing & Search Technology Evaluation”Corpus: ~1–10M events (one event = one Claude Code conversation turn),
parsed from .jsonl transcripts across 5 fleet machines.
Host: Cheesegrater (Ubuntu 24.04, 2009 Mac Pro, vault storage).
Needs: full-text + structured filters (variant, machine, role, tool, date)
- eventually semantic search. Constraints: single-machine, low ops, solo operator, <100ms typical queries.
TL;DR — Recommended Stack
Section titled “TL;DR — Recommended Stack”Phase 1 (today, ship in a weekend): SQLite FTS5 + sqlite-vec, both as
loadable extensions inside one .sqlite file. Single binary, no daemon, no
backup story to invent (it’s a file — rsync it). BM25 ranking is built in;
sqlite-vec is the actively maintained successor to sqlite-vss and works as a
virtual table you join against. Hybrid search = two queries in the same
transaction with Reciprocal Rank Fusion (RRF) in Python. Scales comfortably
to ~5M events with <100ms p50 for typical queries on this hardware.
Phase 2 (only if Phase 1 hits a wall): DuckDB FTS for the analytics
side (cohort queries across machine × variant × date) while keeping SQLite as
the row store. DuckDB reads SQLite directly via its sqlite_scanner
extension, so no migration cost.
Phase 3 (only if semantic recall is genuinely insufficient): add a BGE-reranker stage (cross-encoder, runs locally on CPU for top-50 candidates). Don’t reach for Qdrant/Tantivy/etc. until you’ve proven the embedded path can’t carry the load.
Why this and not the cool stuff: every other option on the list pulls in either a daemon, a multi-node mental model, or a query-language tax. Wes runs lean, hates daemon sprawl, and the corpus is small enough that a single SQLite file with two extensions wins every category that matters for a solo operator.
Comparison Matrix
Section titled “Comparison Matrix”Legend: A = great fit, B = workable, C = misfit for this use case. “Daemon” = does this process need to be supervised separately from the app?
Full-Text Engines
Section titled “Full-Text Engines”| Engine | Daemon? | RAM @ 5M docs | Install | Hybrid story | Backup | License | Fit |
|---|---|---|---|---|---|---|---|
| SQLite FTS5 | No (library) | ~200–500 MB index, query RSS minimal | apt install / built into Python | Pairs with sqlite-vec in same DB | cp file.db | Public domain | A |
| DuckDB FTS | No (library) | Columnar, very cheap for analytics | pip install duckdb | No native vector; can attach | cp file.db | MIT | A- |
| Tantivy / pytantivy | No (library) | ~1–2 GB index typical | Rust toolchain or wheel | Manual; bolt on a vec store | Filesystem copy | MIT/Apache | B |
| Bleve | No (library, Go) | Similar to Tantivy | Requires Go integration | Manual | Filesystem copy | Apache 2.0 | C (Go ecosystem mismatch) |
| Manticore Search | Yes (mysqld-style) | 1–3 GB | apt repo, systemd unit | Native vectors since 6.x | Snapshots, replication | GPL-2 | B (overkill) |
| Meilisearch | Yes | ~1–2 GB | Single binary, systemd | Hybrid search GA in 1.6+ | Snapshot API | MIT | B |
| Typesense | Yes | 1–2 GB | Single binary, systemd | Native hybrid in 0.25+ | Snapshot API | GPL-3 | B |
| ParadeDB | Yes (Postgres) | Heavy — full Postgres footprint | Docker / pg extension | Tantivy + pgvector in one DB | pg_dump | AGPL/PG | C (you’d be running Postgres) |
Vector / Hybrid Engines
Section titled “Vector / Hybrid Engines”| Engine | Daemon? | RAM @ 5M vecs (768-d) | Install | Hybrid (BM25)? | Backup | License | Fit |
|---|---|---|---|---|---|---|---|
| sqlite-vec | No (loadable ext) | ~3–6 GB if you keep all in RAM; disk-backed otherwise | pip install sqlite-vec | Pairs with FTS5 (same DB) | cp file.db | Apache 2.0 | A |
| sqlite-vss | No (loadable ext) | Similar | Deprecated; uses Faiss | Same as above | cp file.db | Apache 2.0 | C (author moved to sqlite-vec) |
| LanceDB | No (embedded, Rust) | Columnar, mmap-friendly, very efficient | pip install lancedb | Built-in BM25 + vector in 0.4+ | Filesystem (Lance format) | Apache 2.0 | B+ (close runner-up) |
| ChromaDB | Optional (embedded or server) | 4–8 GB common | pip install chromadb | BM25 added but immature | Filesystem | Apache 2.0 | C (history of breaking changes, mid-tier perf) |
| Qdrant | Yes | 2–6 GB | Single binary or Docker | Sparse vectors + dense, no native BM25 | Snapshot API | Apache 2.0 | B (best vector DB IMO, but daemon) |
| Weaviate | Yes | 4–10 GB | Docker, GraphQL | Native hybrid (BM25 + vec) | Backup module | BSD-3 | C (operationally heavy, GraphQL tax) |
| pgvector | Yes (Postgres) | Depends on PG | pg extension | Pair with pg_trgm or ParadeDB | pg_dump | PostgreSQL | C (only if PG is already in play) |
| Milvus / Milvus Lite | Lite=embedded; full=daemons | Lite ~1–2 GB; full multi-GB | pip install pymilvus (Lite) | Lite has limited filtering; full has hybrid | Filesystem (Lite) | Apache 2.0 | C (Lite is workable but small ecosystem for hybrid) |
| Marqo | Yes (Docker, OpenSearch-backed) | 4+ GB | Docker compose | Native hybrid | OpenSearch snapshots | Apache 2.0 | C (way too heavy) |
Hybrid Patterns
Section titled “Hybrid Patterns”| Pattern | When to use | Cost | Complexity |
|---|---|---|---|
| BM25 + dense + RRF | Default for hybrid. ~60 lines of Python. | None beyond having both indexes | Low |
| Cross-encoder reranker (BGE-reranker-v2-m3) | When recall is fine but ranking is mushy. Rerank top 50 → top 10. | ~30–80ms/query on CPU for 50 candidates | Low–medium |
| ColBERT / late interaction | When you have research-grade quality needs. Storage cost is ~10× a normal embedding. | Large storage, gpu-friendly, cpu-painful | Medium–high |
| Cohere rerank API | If you want managed quality without local infra. | $$ per query, network roundtrip | Low |
Recommended Architecture (detailed)
Section titled “Recommended Architecture (detailed)”+----------------------------------------------------------+| events.db (single SQLite file, ~5–10 GB at 5M events) || || TABLE events || id INTEGER PRIMARY KEY || session_id TEXT || machine TEXT (mac | clippy | clarvis | imac | || cheesegrater) || variant TEXT (canonical name) || cwd TEXT || role TEXT (user | assistant | tool_result) || tool TEXT (NULL or tool name) || ts INTEGER (epoch seconds, indexed) || text TEXT (raw content) || meta_json TEXT || || VIRTUAL TABLE events_fts USING fts5( || text, content='events', content_rowid='id', || tokenize='porter unicode61') || || VIRTUAL TABLE events_vec USING vec0( || embedding float[768]) -- sqlite-vec || || INDEX idx_machine_variant_ts ON events(machine, variant,|| ts DESC) || INDEX idx_tool_ts ON events(tool, ts DESC) |+----------------------------------------------------------+Query patterns
Section titled “Query patterns”Pure structured: straight SQL with the composite indexes.
SELECT * FROM events WHERE machine='mac' AND variant='clars' AND ts > strftime('%s','2026-05-01') ORDER BY ts DESC LIMIT 50;Pure full-text:
SELECT e.* FROM events_fts f JOIN events e ON e.id = f.rowid WHERE events_fts MATCH 'tantivy OR meilisearch' ORDER BY bm25(events_fts) LIMIT 50;Hybrid (RRF in Python):
- Run FTS query → top-100 ids by BM25.
- Run vec0 query → top-100 ids by cosine.
- Fuse:
score(id) = Σ 1/(k + rank_i), k=60. - Apply structured filters as a final WHERE on the candidate set.
With reranker (Phase 3):
- Take the top-50 from the RRF fusion above.
- Score with BGE-reranker-v2-m3 (CPU is fine for 50 candidates).
- Return top-10.
Why one file, not two databases
Section titled “Why one file, not two databases”Vault canon principle: keep state durable and inspectable. One file =
sqlite3 events.db and you have a REPL. rsync events.db cheesegrater:/storage/jarvis/backups/ and you have a backup. No
schema-migration dance, no replication channel, no port to remember.
Embedding model (when you turn semantic on)
Section titled “Embedding model (when you turn semantic on)”- BGE-small-en-v1.5 (384-d) — cheap, fast on CPU, decent recall. Good default.
- BGE-base-en-v1.5 (768-d) — better recall, still CPU-tractable.
- Encode in batches at ingest, store in
events_vec, never re-embed unless the model changes. - 5M × 768 floats × 4 bytes = ~15 GB. sqlite-vec stores quantized and on disk; fine for this hardware if you don’t need every vector in RAM. Use scalar quantization (int8) in sqlite-vec — ~4 GB and negligible recall loss.
Honorable Mentions
Section titled “Honorable Mentions”LanceDB
Section titled “LanceDB”Strong second place. Genuinely well-designed columnar vector store with
built-in BM25 (added late 2024). Single-process, embedded, MIT/Apache. The
only reason it loses to SQLite+vec here is ecosystem familiarity: Wes has
SQLite muscle memory across the fleet (D1, fleet-node sidecars, local
.sqlite files everywhere) and a .lance directory is one more mental
model. Would win if: the corpus grows past 50M events or you need to do
real columnar analytics on event-level features.
DuckDB FTS
Section titled “DuckDB FTS”Wins the analytics lane outright. SELECT machine, COUNT(*) FROM events WHERE fts_main_events.match_bm25(text, 'mcp drift') > 0 GROUP BY machine; — the kind of query that’s miserable in SQLite. The
recommendation here is to keep SQLite as the row store and attach DuckDB
for ad-hoc analytics (duckdb> ATTACH 'events.db' AS s (TYPE sqlite);).
Best of both worlds, zero ingestion cost.
Qdrant (self-hosted)
Section titled “Qdrant (self-hosted)”The cleanest pure vector DB. If we were building a production retrieval service for an external user base, this would be the answer. Loses on the “daemon sprawl” criterion. Would win if: the corpus grows past 50M, or you need multi-tenant retrieval, or you start running ML retrieval as a shared service across the fleet.
Meilisearch / Typesense
Section titled “Meilisearch / Typesense”Both excellent products with painless single-binary deploys and good hybrid support. Each costs you one more systemd unit and one more snapshot lane to think about. Would win if: you wanted to expose search as a public API to clients (typo tolerance + ranking knobs are best-in-class).
BGE-reranker
Section titled “BGE-reranker”Don’t add it until you’ve measured the baseline. But when you do, this is the cheap quality lever — a cross-encoder reranker on top-50 candidates beats almost every embedding-only tweak. Local, CPU-friendly, non-controversial.
Rejected (with reasoning)
Section titled “Rejected (with reasoning)”| Tool | Why rejected |
|---|---|
| sqlite-vss | The author (Alex Garcia) explicitly migrated to sqlite-vec; vss is in maintenance mode and uses Faiss internally (heavier dependency). Don’t start a new project on it. |
| ChromaDB | History of breaking API changes through 2024, mid-tier performance, hybrid search is a recent and underbaked addition. Fine for notebooks; not a foundation. |
| Weaviate | GraphQL surface, Docker-first, multi-GB RAM floor. Designed for teams running it as a platform service. Wrong shape for solo Cheesegrater. |
| Milvus (full) | Cluster mental model even in standalone mode; etcd, MinIO, proxy, query node, data node. Lite version is closer to viable but the ecosystem and docs assume the full product. |
| Marqo | OpenSearch backend. You’d be running OpenSearch. Nope. |
| ParadeDB | Means running Postgres. If Postgres were already in the stack this would be a top contender (BM25 via Tantivy + pgvector in one DB is genuinely nice). It isn’t, and adding it for this is a four-figure ops decision. |
| Bleve | Library is fine, but the binding story from Python is messy. No reason to take on a Go dep here. |
| Tantivy / pytantivy | Excellent engine. Loses on two fronts: (1) you’d still bolt on a separate vector store, (2) the index format isn’t as friendly to “just rsync the file.” Worth a second look only if FTS5 quality turns out to be insufficient. |
| Manticore | Capable, but it’s a daemon and the operational story (mysqld protocol, RT indexes vs. plain indexes, .conf file) is too much for a solo workload at this scale. |
| pgvector | Same blocker as ParadeDB — requires Postgres. |
| ColBERT / late interaction | 10× storage and gpu-leaning compute, for a recall gain that BM25 + dense + reranker captures most of. Premature. |
| Cohere rerank API | Adds a network hop, a quota, and a vendor. The local BGE-reranker is good enough for transcript search. |
Real-World Examples (transcript-like corpora)
Section titled “Real-World Examples (transcript-like corpora)”These are the closest analogs I could surface from training data; spot-check before quoting any of them in a public artifact.
- Simon Willison’s
llmtool + Datasette logs all LLM conversations to SQLite and exposes FTS5 search. Same shape as this corpus, smaller scale. (https://llm.datasette.io) - Alex Garcia’s sqlite-vec demos include a “search your ChatGPT history” example with embeddings stored alongside FTS5. Direct precedent.
- Continue.dev and Aider both ship local conversation history in SQLite; neither does semantic search yet, but the storage shape matches.
- txtai (by NeumlAI) is the canonical “embedded hybrid search” library — wraps SQLite + Faiss/Hnswlib + a transformer. Worth knowing about as a prebuilt option if you’d rather not assemble the parts yourself, though it adds one more abstraction layer to debug.
- The
r/LocalLLaMAandr/Ragpractitioner pattern as of late 2025 leans heavily on SQLite/LanceDB for solo projects and only escalates to Qdrant/Weaviate when there’s a real team or external consumer.
Operational Footprint Summary
Section titled “Operational Footprint Summary”| Concern | This stack |
|---|---|
| Processes to supervise | 0 (it’s a library inside whatever Python service queries it) |
| Ports to open | 0 |
| Backups | rsync events.db cheesegrater:/storage/jarvis/backups/$(date +%Y%m%d)/ |
| Schema migrations | ALTER TABLE ADD COLUMN (per the standing memory rule — never drop) |
| Monitoring | File size, last-modified, row count via sqlite3 events.db 'SELECT COUNT(*), MAX(ts) FROM events;' |
| Disaster recovery | Re-run ingest from raw .jsonl in /storage/jarvis/staging/clean-desk-staging/. Deterministic. |
| Cross-machine visibility | The vault is already mounted on every fleet machine. Drop the db there and every variant can read it. |
Open Questions (worth benchmarking before committing hard)
Section titled “Open Questions (worth benchmarking before committing hard)”-
FTS5 query latency at 5M rows with a Porter + unicode61 tokenizer on Cheesegrater’s spinning disks. Expectation: <50ms p50 for 1–3 term queries with bm25 ordering. Reality: test it. If the vault sits on the spinning array, consider moving
events.dbto SSD. -
sqlite-vec performance with int8 quantization vs. f32 at 5M vectors. The library is young (1.0 shipped 2024). Recall delta on this corpus is the only thing that matters; benchmark with a held-out set of “I know this conversation exists” queries.
-
Embedding model choice. BGE-small vs. BGE-base vs. nomic-embed-v1.5 vs. mxbai-embed-large. All CPU-tractable; the right pick is whichever has highest recall@50 on actual transcript queries you care about. Embed 50k events, run 20 representative queries, measure.
-
Whether RRF alone is enough, or whether the reranker pulls its weight. Measure recall@10 with and without. If RRF is already at ~0.9, the reranker is not worth the latency.
-
DuckDB vs. SQLite for the analytics path. If you find yourself writing GROUP BY queries across machine × variant × week, switch the read path for those queries to DuckDB attached to the SQLite file. No migration; literally just a different connection string.
-
Ingest path: streaming vs. batch. The corpus is bounded (1-month window per extract.py), so batch ingest at end-of-day is fine. If you later want live indexing of in-progress sessions, FTS5 supports it trivially; sqlite-vec is also fine for incremental insert.
-
Confidence calibration. Per the fleet rule, anything in this doc below the ~94% confidence bar warrants a verification pass before building on it. The biggest items to verify against current vendor docs before committing: (a) sqlite-vec status (is it still pre-1.0 as of 2026-05? — check at https://github.com/asg017/sqlite-vec/releases), (b) LanceDB’s BM25 maturity, (c) Meilisearch hybrid pricing/features. The high-level shape of the recommendation doesn’t change if any of these moved a notch — it just affects which honorable mention is closest.
One-Line Recommendation
Section titled “One-Line Recommendation”Single SQLite file. FTS5 for full-text, sqlite-vec for embeddings, RRF in Python to fuse them, DuckDB attached for analytics. Add a BGE-reranker only if measurement shows you need it.