Indexing tech eval

05 — Indexing & Search Technology Evaluation

Corpus: ~1–10M events (one event = one Claude Code conversation turn), parsed from .jsonl transcripts across 5 fleet machines. Host: Cheesegrater (Ubuntu 24.04, 2009 Mac Pro, vault storage). Needs: full-text + structured filters (variant, machine, role, tool, date)

eventually semantic search. Constraints: single-machine, low ops, solo operator, <100ms typical queries.

TL;DR — Recommended Stack

Phase 1 (today, ship in a weekend): SQLite FTS5 + sqlite-vec, both as loadable extensions inside one .sqlite file. Single binary, no daemon, no backup story to invent (it’s a file — rsync it). BM25 ranking is built in; sqlite-vec is the actively maintained successor to sqlite-vss and works as a virtual table you join against. Hybrid search = two queries in the same transaction with Reciprocal Rank Fusion (RRF) in Python. Scales comfortably to ~5M events with <100ms p50 for typical queries on this hardware.

Phase 2 (only if Phase 1 hits a wall): DuckDB FTS for the analytics side (cohort queries across machine × variant × date) while keeping SQLite as the row store. DuckDB reads SQLite directly via its sqlite_scanner extension, so no migration cost.

Phase 3 (only if semantic recall is genuinely insufficient): add a BGE-reranker stage (cross-encoder, runs locally on CPU for top-50 candidates). Don’t reach for Qdrant/Tantivy/etc. until you’ve proven the embedded path can’t carry the load.

Why this and not the cool stuff: every other option on the list pulls in either a daemon, a multi-node mental model, or a query-language tax. Wes runs lean, hates daemon sprawl, and the corpus is small enough that a single SQLite file with two extensions wins every category that matters for a solo operator.

Comparison Matrix

Legend: A = great fit, B = workable, C = misfit for this use case. “Daemon” = does this process need to be supervised separately from the app?

Full-Text Engines

Engine	Daemon?	RAM @ 5M docs	Install	Hybrid story	Backup	License	Fit
SQLite FTS5	No (library)	~200–500 MB index, query RSS minimal	`apt install` / built into Python	Pairs with sqlite-vec in same DB	`cp file.db`	Public domain	A
DuckDB FTS	No (library)	Columnar, very cheap for analytics	`pip install duckdb`	No native vector; can attach	`cp file.db`	MIT	A-
Tantivy / pytantivy	No (library)	~1–2 GB index typical	Rust toolchain or wheel	Manual; bolt on a vec store	Filesystem copy	MIT/Apache	B
Bleve	No (library, Go)	Similar to Tantivy	Requires Go integration	Manual	Filesystem copy	Apache 2.0	C (Go ecosystem mismatch)
Manticore Search	Yes (mysqld-style)	1–3 GB	apt repo, systemd unit	Native vectors since 6.x	Snapshots, replication	GPL-2	B (overkill)
Meilisearch	Yes	~1–2 GB	Single binary, systemd	Hybrid search GA in 1.6+	Snapshot API	MIT	B
Typesense	Yes	1–2 GB	Single binary, systemd	Native hybrid in 0.25+	Snapshot API	GPL-3	B
ParadeDB	Yes (Postgres)	Heavy — full Postgres footprint	Docker / pg extension	Tantivy + pgvector in one DB	pg_dump	AGPL/PG	C (you’d be running Postgres)

Vector / Hybrid Engines

Engine	Daemon?	RAM @ 5M vecs (768-d)	Install	Hybrid (BM25)?	Backup	License	Fit
sqlite-vec	No (loadable ext)	~3–6 GB if you keep all in RAM; disk-backed otherwise	`pip install sqlite-vec`	Pairs with FTS5 (same DB)	`cp file.db`	Apache 2.0	A
sqlite-vss	No (loadable ext)	Similar	Deprecated; uses Faiss	Same as above	`cp file.db`	Apache 2.0	C (author moved to sqlite-vec)
LanceDB	No (embedded, Rust)	Columnar, mmap-friendly, very efficient	`pip install lancedb`	Built-in BM25 + vector in 0.4+	Filesystem (Lance format)	Apache 2.0	B+ (close runner-up)
ChromaDB	Optional (embedded or server)	4–8 GB common	`pip install chromadb`	BM25 added but immature	Filesystem	Apache 2.0	C (history of breaking changes, mid-tier perf)
Qdrant	Yes	2–6 GB	Single binary or Docker	Sparse vectors + dense, no native BM25	Snapshot API	Apache 2.0	B (best vector DB IMO, but daemon)
Weaviate	Yes	4–10 GB	Docker, GraphQL	Native hybrid (BM25 + vec)	Backup module	BSD-3	C (operationally heavy, GraphQL tax)
pgvector	Yes (Postgres)	Depends on PG	pg extension	Pair with `pg_trgm` or ParadeDB	pg_dump	PostgreSQL	C (only if PG is already in play)
Milvus / Milvus Lite	Lite=embedded; full=daemons	Lite ~1–2 GB; full multi-GB	`pip install pymilvus` (Lite)	Lite has limited filtering; full has hybrid	Filesystem (Lite)	Apache 2.0	C (Lite is workable but small ecosystem for hybrid)
Marqo	Yes (Docker, OpenSearch-backed)	4+ GB	Docker compose	Native hybrid	OpenSearch snapshots	Apache 2.0	C (way too heavy)

Hybrid Patterns

Pattern	When to use	Cost	Complexity
BM25 + dense + RRF	Default for hybrid. ~60 lines of Python.	None beyond having both indexes	Low
Cross-encoder reranker (BGE-reranker-v2-m3)	When recall is fine but ranking is mushy. Rerank top 50 → top 10.	~30–80ms/query on CPU for 50 candidates	Low–medium
ColBERT / late interaction	When you have research-grade quality needs. Storage cost is ~10× a normal embedding.	Large storage, gpu-friendly, cpu-painful	Medium–high
Cohere rerank API	If you want managed quality without local infra.	$$ per query, network roundtrip	Low

Recommended Architecture (detailed)

+----------------------------------------------------------+
|  events.db  (single SQLite file, ~5–10 GB at 5M events) |
|                                                          |
|  TABLE events                                            |
|    id INTEGER PRIMARY KEY                                |
|    session_id TEXT                                       |
|    machine TEXT       (mac | clippy | clarvis | imac |   |
|                        cheesegrater)                     |
|    variant TEXT       (canonical name)                   |
|    cwd TEXT                                              |
|    role TEXT          (user | assistant | tool_result)  |
|    tool TEXT          (NULL or tool name)                |
|    ts INTEGER         (epoch seconds, indexed)           |
|    text TEXT          (raw content)                      |
|    meta_json TEXT                                        |
|                                                          |
|  VIRTUAL TABLE events_fts USING fts5(                    |
|    text, content='events', content_rowid='id',           |
|    tokenize='porter unicode61')                          |
|                                                          |
|  VIRTUAL TABLE events_vec USING vec0(                    |
|    embedding float[768])     -- sqlite-vec               |
|                                                          |
|  INDEX idx_machine_variant_ts ON events(machine, variant,|
|                                          ts DESC)        |
|  INDEX idx_tool_ts ON events(tool, ts DESC)              |
+----------------------------------------------------------+

Query patterns

Pure structured: straight SQL with the composite indexes.

SELECT * FROM events
 WHERE machine='mac' AND variant='clars' AND ts > strftime('%s','2026-05-01')
 ORDER BY ts DESC LIMIT 50;

Pure full-text:

SELECT e.* FROM events_fts f
  JOIN events e ON e.id = f.rowid
 WHERE events_fts MATCH 'tantivy OR meilisearch'
 ORDER BY bm25(events_fts) LIMIT 50;

Hybrid (RRF in Python):

Run FTS query → top-100 ids by BM25.
Run vec0 query → top-100 ids by cosine.
Fuse: score(id) = Σ 1/(k + rank_i), k=60.
Apply structured filters as a final WHERE on the candidate set.

With reranker (Phase 3):

Take the top-50 from the RRF fusion above.
Score with BGE-reranker-v2-m3 (CPU is fine for 50 candidates).
Return top-10.

Why one file, not two databases

Vault canon principle: keep state durable and inspectable. One file = sqlite3 events.db and you have a REPL. rsync events.db cheesegrater:/storage/jarvis/backups/ and you have a backup. No schema-migration dance, no replication channel, no port to remember.

Embedding model (when you turn semantic on)

BGE-small-en-v1.5 (384-d) — cheap, fast on CPU, decent recall. Good default.
BGE-base-en-v1.5 (768-d) — better recall, still CPU-tractable.
Encode in batches at ingest, store in events_vec, never re-embed unless the model changes.
5M × 768 floats × 4 bytes = ~15 GB. sqlite-vec stores quantized and on disk; fine for this hardware if you don’t need every vector in RAM. Use scalar quantization (int8) in sqlite-vec — ~4 GB and negligible recall loss.

Honorable Mentions

LanceDB

Strong second place. Genuinely well-designed columnar vector store with built-in BM25 (added late 2024). Single-process, embedded, MIT/Apache. The only reason it loses to SQLite+vec here is ecosystem familiarity: Wes has SQLite muscle memory across the fleet (D1, fleet-node sidecars, local .sqlite files everywhere) and a .lance directory is one more mental model. Would win if: the corpus grows past 50M events or you need to do real columnar analytics on event-level features.

DuckDB FTS

Wins the analytics lane outright. SELECT machine, COUNT(*) FROM events WHERE fts_main_events.match_bm25(text, 'mcp drift') > 0 GROUP BY machine; — the kind of query that’s miserable in SQLite. The recommendation here is to keep SQLite as the row store and attach DuckDB for ad-hoc analytics (duckdb> ATTACH 'events.db' AS s (TYPE sqlite);). Best of both worlds, zero ingestion cost.

Qdrant (self-hosted)

The cleanest pure vector DB. If we were building a production retrieval service for an external user base, this would be the answer. Loses on the “daemon sprawl” criterion. Would win if: the corpus grows past 50M, or you need multi-tenant retrieval, or you start running ML retrieval as a shared service across the fleet.

Meilisearch / Typesense

Both excellent products with painless single-binary deploys and good hybrid support. Each costs you one more systemd unit and one more snapshot lane to think about. Would win if: you wanted to expose search as a public API to clients (typo tolerance + ranking knobs are best-in-class).

BGE-reranker

Don’t add it until you’ve measured the baseline. But when you do, this is the cheap quality lever — a cross-encoder reranker on top-50 candidates beats almost every embedding-only tweak. Local, CPU-friendly, non-controversial.

Rejected (with reasoning)

Tool	Why rejected
sqlite-vss	The author (Alex Garcia) explicitly migrated to sqlite-vec; vss is in maintenance mode and uses Faiss internally (heavier dependency). Don’t start a new project on it.
ChromaDB	History of breaking API changes through 2024, mid-tier performance, hybrid search is a recent and underbaked addition. Fine for notebooks; not a foundation.
Weaviate	GraphQL surface, Docker-first, multi-GB RAM floor. Designed for teams running it as a platform service. Wrong shape for solo Cheesegrater.
Milvus (full)	Cluster mental model even in standalone mode; etcd, MinIO, proxy, query node, data node. Lite version is closer to viable but the ecosystem and docs assume the full product.
Marqo	OpenSearch backend. You’d be running OpenSearch. Nope.
ParadeDB	Means running Postgres. If Postgres were already in the stack this would be a top contender (BM25 via Tantivy + pgvector in one DB is genuinely nice). It isn’t, and adding it for this is a four-figure ops decision.
Bleve	Library is fine, but the binding story from Python is messy. No reason to take on a Go dep here.
Tantivy / pytantivy	Excellent engine. Loses on two fronts: (1) you’d still bolt on a separate vector store, (2) the index format isn’t as friendly to “just `rsync` the file.” Worth a second look only if FTS5 quality turns out to be insufficient.
Manticore	Capable, but it’s a daemon and the operational story (mysqld protocol, RT indexes vs. plain indexes, .conf file) is too much for a solo workload at this scale.
pgvector	Same blocker as ParadeDB — requires Postgres.
ColBERT / late interaction	10× storage and gpu-leaning compute, for a recall gain that BM25 + dense + reranker captures most of. Premature.
Cohere rerank API	Adds a network hop, a quota, and a vendor. The local BGE-reranker is good enough for transcript search.

Real-World Examples (transcript-like corpora)

These are the closest analogs I could surface from training data; spot-check before quoting any of them in a public artifact.

Simon Willison’s llm tool + Datasette logs all LLM conversations to SQLite and exposes FTS5 search. Same shape as this corpus, smaller scale. (https://llm.datasette.io)
Alex Garcia’s sqlite-vec demos include a “search your ChatGPT history” example with embeddings stored alongside FTS5. Direct precedent.
Continue.dev and Aider both ship local conversation history in SQLite; neither does semantic search yet, but the storage shape matches.
txtai (by NeumlAI) is the canonical “embedded hybrid search” library — wraps SQLite + Faiss/Hnswlib + a transformer. Worth knowing about as a prebuilt option if you’d rather not assemble the parts yourself, though it adds one more abstraction layer to debug.
The r/LocalLLaMA and r/Rag practitioner pattern as of late 2025 leans heavily on SQLite/LanceDB for solo projects and only escalates to Qdrant/Weaviate when there’s a real team or external consumer.

Operational Footprint Summary

Concern	This stack
Processes to supervise	0 (it’s a library inside whatever Python service queries it)
Ports to open	0
Backups	`rsync events.db cheesegrater:/storage/jarvis/backups/$(date +%Y%m%d)/`
Schema migrations	`ALTER TABLE ADD COLUMN` (per the standing memory rule — never drop)
Monitoring	File size, last-modified, row count via `sqlite3 events.db 'SELECT COUNT(*), MAX(ts) FROM events;'`
Disaster recovery	Re-run ingest from raw `.jsonl` in `/storage/jarvis/staging/clean-desk-staging/`. Deterministic.
Cross-machine visibility	The vault is already mounted on every fleet machine. Drop the db there and every variant can read it.

Open Questions (worth benchmarking before committing hard)

FTS5 query latency at 5M rows with a Porter + unicode61 tokenizer on Cheesegrater’s spinning disks. Expectation: <50ms p50 for 1–3 term queries with bm25 ordering. Reality: test it. If the vault sits on the spinning array, consider moving events.db to SSD.
sqlite-vec performance with int8 quantization vs. f32 at 5M vectors. The library is young (1.0 shipped 2024). Recall delta on this corpus is the only thing that matters; benchmark with a held-out set of “I know this conversation exists” queries.
Embedding model choice. BGE-small vs. BGE-base vs. nomic-embed-v1.5 vs. mxbai-embed-large. All CPU-tractable; the right pick is whichever has highest recall@50 on actual transcript queries you care about. Embed 50k events, run 20 representative queries, measure.
Whether RRF alone is enough, or whether the reranker pulls its weight. Measure recall@10 with and without. If RRF is already at ~0.9, the reranker is not worth the latency.
DuckDB vs. SQLite for the analytics path. If you find yourself writing GROUP BY queries across machine × variant × week, switch the read path for those queries to DuckDB attached to the SQLite file. No migration; literally just a different connection string.
Ingest path: streaming vs. batch. The corpus is bounded (1-month window per extract.py), so batch ingest at end-of-day is fine. If you later want live indexing of in-progress sessions, FTS5 supports it trivially; sqlite-vec is also fine for incremental insert.
Confidence calibration. Per the fleet rule, anything in this doc below the ~94% confidence bar warrants a verification pass before building on it. The biggest items to verify against current vendor docs before committing: (a) sqlite-vec status (is it still pre-1.0 as of 2026-05? — check at https://github.com/asg017/sqlite-vec/releases), (b) LanceDB’s BM25 maturity, (c) Meilisearch hybrid pricing/features. The high-level shape of the recommendation doesn’t change if any of these moved a notch — it just affects which honorable mention is closest.

One-Line Recommendation

Single SQLite file. FTS5 for full-text, sqlite-vec for embeddings, RRF in Python to fuse them, DuckDB attached for analytics. Add a BGE-reranker only if measurement shows you need it.