Skip to content

Indexing tech eval

05 — Indexing & Search Technology Evaluation

Section titled “05 — Indexing & Search Technology Evaluation”

Corpus: ~1–10M events (one event = one Claude Code conversation turn), parsed from .jsonl transcripts across 5 fleet machines. Host: Cheesegrater (Ubuntu 24.04, 2009 Mac Pro, vault storage). Needs: full-text + structured filters (variant, machine, role, tool, date)

  • eventually semantic search. Constraints: single-machine, low ops, solo operator, <100ms typical queries.

Phase 1 (today, ship in a weekend): SQLite FTS5 + sqlite-vec, both as loadable extensions inside one .sqlite file. Single binary, no daemon, no backup story to invent (it’s a file — rsync it). BM25 ranking is built in; sqlite-vec is the actively maintained successor to sqlite-vss and works as a virtual table you join against. Hybrid search = two queries in the same transaction with Reciprocal Rank Fusion (RRF) in Python. Scales comfortably to ~5M events with <100ms p50 for typical queries on this hardware.

Phase 2 (only if Phase 1 hits a wall): DuckDB FTS for the analytics side (cohort queries across machine × variant × date) while keeping SQLite as the row store. DuckDB reads SQLite directly via its sqlite_scanner extension, so no migration cost.

Phase 3 (only if semantic recall is genuinely insufficient): add a BGE-reranker stage (cross-encoder, runs locally on CPU for top-50 candidates). Don’t reach for Qdrant/Tantivy/etc. until you’ve proven the embedded path can’t carry the load.

Why this and not the cool stuff: every other option on the list pulls in either a daemon, a multi-node mental model, or a query-language tax. Wes runs lean, hates daemon sprawl, and the corpus is small enough that a single SQLite file with two extensions wins every category that matters for a solo operator.


Legend: A = great fit, B = workable, C = misfit for this use case. “Daemon” = does this process need to be supervised separately from the app?

EngineDaemon?RAM @ 5M docsInstallHybrid storyBackupLicenseFit
SQLite FTS5No (library)~200–500 MB index, query RSS minimalapt install / built into PythonPairs with sqlite-vec in same DBcp file.dbPublic domainA
DuckDB FTSNo (library)Columnar, very cheap for analyticspip install duckdbNo native vector; can attachcp file.dbMITA-
Tantivy / pytantivyNo (library)~1–2 GB index typicalRust toolchain or wheelManual; bolt on a vec storeFilesystem copyMIT/ApacheB
BleveNo (library, Go)Similar to TantivyRequires Go integrationManualFilesystem copyApache 2.0C (Go ecosystem mismatch)
Manticore SearchYes (mysqld-style)1–3 GBapt repo, systemd unitNative vectors since 6.xSnapshots, replicationGPL-2B (overkill)
MeilisearchYes~1–2 GBSingle binary, systemdHybrid search GA in 1.6+Snapshot APIMITB
TypesenseYes1–2 GBSingle binary, systemdNative hybrid in 0.25+Snapshot APIGPL-3B
ParadeDBYes (Postgres)Heavy — full Postgres footprintDocker / pg extensionTantivy + pgvector in one DBpg_dumpAGPL/PGC (you’d be running Postgres)
EngineDaemon?RAM @ 5M vecs (768-d)InstallHybrid (BM25)?BackupLicenseFit
sqlite-vecNo (loadable ext)~3–6 GB if you keep all in RAM; disk-backed otherwisepip install sqlite-vecPairs with FTS5 (same DB)cp file.dbApache 2.0A
sqlite-vssNo (loadable ext)SimilarDeprecated; uses FaissSame as abovecp file.dbApache 2.0C (author moved to sqlite-vec)
LanceDBNo (embedded, Rust)Columnar, mmap-friendly, very efficientpip install lancedbBuilt-in BM25 + vector in 0.4+Filesystem (Lance format)Apache 2.0B+ (close runner-up)
ChromaDBOptional (embedded or server)4–8 GB commonpip install chromadbBM25 added but immatureFilesystemApache 2.0C (history of breaking changes, mid-tier perf)
QdrantYes2–6 GBSingle binary or DockerSparse vectors + dense, no native BM25Snapshot APIApache 2.0B (best vector DB IMO, but daemon)
WeaviateYes4–10 GBDocker, GraphQLNative hybrid (BM25 + vec)Backup moduleBSD-3C (operationally heavy, GraphQL tax)
pgvectorYes (Postgres)Depends on PGpg extensionPair with pg_trgm or ParadeDBpg_dumpPostgreSQLC (only if PG is already in play)
Milvus / Milvus LiteLite=embedded; full=daemonsLite ~1–2 GB; full multi-GBpip install pymilvus (Lite)Lite has limited filtering; full has hybridFilesystem (Lite)Apache 2.0C (Lite is workable but small ecosystem for hybrid)
MarqoYes (Docker, OpenSearch-backed)4+ GBDocker composeNative hybridOpenSearch snapshotsApache 2.0C (way too heavy)
PatternWhen to useCostComplexity
BM25 + dense + RRFDefault for hybrid. ~60 lines of Python.None beyond having both indexesLow
Cross-encoder reranker (BGE-reranker-v2-m3)When recall is fine but ranking is mushy. Rerank top 50 → top 10.~30–80ms/query on CPU for 50 candidatesLow–medium
ColBERT / late interactionWhen you have research-grade quality needs. Storage cost is ~10× a normal embedding.Large storage, gpu-friendly, cpu-painfulMedium–high
Cohere rerank APIIf you want managed quality without local infra.$$ per query, network roundtripLow

+----------------------------------------------------------+
| events.db (single SQLite file, ~5–10 GB at 5M events) |
| |
| TABLE events |
| id INTEGER PRIMARY KEY |
| session_id TEXT |
| machine TEXT (mac | clippy | clarvis | imac | |
| cheesegrater) |
| variant TEXT (canonical name) |
| cwd TEXT |
| role TEXT (user | assistant | tool_result) |
| tool TEXT (NULL or tool name) |
| ts INTEGER (epoch seconds, indexed) |
| text TEXT (raw content) |
| meta_json TEXT |
| |
| VIRTUAL TABLE events_fts USING fts5( |
| text, content='events', content_rowid='id', |
| tokenize='porter unicode61') |
| |
| VIRTUAL TABLE events_vec USING vec0( |
| embedding float[768]) -- sqlite-vec |
| |
| INDEX idx_machine_variant_ts ON events(machine, variant,|
| ts DESC) |
| INDEX idx_tool_ts ON events(tool, ts DESC) |
+----------------------------------------------------------+

Pure structured: straight SQL with the composite indexes.

SELECT * FROM events
WHERE machine='mac' AND variant='clars' AND ts > strftime('%s','2026-05-01')
ORDER BY ts DESC LIMIT 50;

Pure full-text:

SELECT e.* FROM events_fts f
JOIN events e ON e.id = f.rowid
WHERE events_fts MATCH 'tantivy OR meilisearch'
ORDER BY bm25(events_fts) LIMIT 50;

Hybrid (RRF in Python):

  1. Run FTS query → top-100 ids by BM25.
  2. Run vec0 query → top-100 ids by cosine.
  3. Fuse: score(id) = Σ 1/(k + rank_i), k=60.
  4. Apply structured filters as a final WHERE on the candidate set.

With reranker (Phase 3):

  • Take the top-50 from the RRF fusion above.
  • Score with BGE-reranker-v2-m3 (CPU is fine for 50 candidates).
  • Return top-10.

Vault canon principle: keep state durable and inspectable. One file = sqlite3 events.db and you have a REPL. rsync events.db cheesegrater:/storage/jarvis/backups/ and you have a backup. No schema-migration dance, no replication channel, no port to remember.

Embedding model (when you turn semantic on)

Section titled “Embedding model (when you turn semantic on)”
  • BGE-small-en-v1.5 (384-d) — cheap, fast on CPU, decent recall. Good default.
  • BGE-base-en-v1.5 (768-d) — better recall, still CPU-tractable.
  • Encode in batches at ingest, store in events_vec, never re-embed unless the model changes.
  • 5M × 768 floats × 4 bytes = ~15 GB. sqlite-vec stores quantized and on disk; fine for this hardware if you don’t need every vector in RAM. Use scalar quantization (int8) in sqlite-vec — ~4 GB and negligible recall loss.

Strong second place. Genuinely well-designed columnar vector store with built-in BM25 (added late 2024). Single-process, embedded, MIT/Apache. The only reason it loses to SQLite+vec here is ecosystem familiarity: Wes has SQLite muscle memory across the fleet (D1, fleet-node sidecars, local .sqlite files everywhere) and a .lance directory is one more mental model. Would win if: the corpus grows past 50M events or you need to do real columnar analytics on event-level features.

Wins the analytics lane outright. SELECT machine, COUNT(*) FROM events WHERE fts_main_events.match_bm25(text, 'mcp drift') > 0 GROUP BY machine; — the kind of query that’s miserable in SQLite. The recommendation here is to keep SQLite as the row store and attach DuckDB for ad-hoc analytics (duckdb> ATTACH 'events.db' AS s (TYPE sqlite);). Best of both worlds, zero ingestion cost.

The cleanest pure vector DB. If we were building a production retrieval service for an external user base, this would be the answer. Loses on the “daemon sprawl” criterion. Would win if: the corpus grows past 50M, or you need multi-tenant retrieval, or you start running ML retrieval as a shared service across the fleet.

Both excellent products with painless single-binary deploys and good hybrid support. Each costs you one more systemd unit and one more snapshot lane to think about. Would win if: you wanted to expose search as a public API to clients (typo tolerance + ranking knobs are best-in-class).

Don’t add it until you’ve measured the baseline. But when you do, this is the cheap quality lever — a cross-encoder reranker on top-50 candidates beats almost every embedding-only tweak. Local, CPU-friendly, non-controversial.


ToolWhy rejected
sqlite-vssThe author (Alex Garcia) explicitly migrated to sqlite-vec; vss is in maintenance mode and uses Faiss internally (heavier dependency). Don’t start a new project on it.
ChromaDBHistory of breaking API changes through 2024, mid-tier performance, hybrid search is a recent and underbaked addition. Fine for notebooks; not a foundation.
WeaviateGraphQL surface, Docker-first, multi-GB RAM floor. Designed for teams running it as a platform service. Wrong shape for solo Cheesegrater.
Milvus (full)Cluster mental model even in standalone mode; etcd, MinIO, proxy, query node, data node. Lite version is closer to viable but the ecosystem and docs assume the full product.
MarqoOpenSearch backend. You’d be running OpenSearch. Nope.
ParadeDBMeans running Postgres. If Postgres were already in the stack this would be a top contender (BM25 via Tantivy + pgvector in one DB is genuinely nice). It isn’t, and adding it for this is a four-figure ops decision.
BleveLibrary is fine, but the binding story from Python is messy. No reason to take on a Go dep here.
Tantivy / pytantivyExcellent engine. Loses on two fronts: (1) you’d still bolt on a separate vector store, (2) the index format isn’t as friendly to “just rsync the file.” Worth a second look only if FTS5 quality turns out to be insufficient.
ManticoreCapable, but it’s a daemon and the operational story (mysqld protocol, RT indexes vs. plain indexes, .conf file) is too much for a solo workload at this scale.
pgvectorSame blocker as ParadeDB — requires Postgres.
ColBERT / late interaction10× storage and gpu-leaning compute, for a recall gain that BM25 + dense + reranker captures most of. Premature.
Cohere rerank APIAdds a network hop, a quota, and a vendor. The local BGE-reranker is good enough for transcript search.

Real-World Examples (transcript-like corpora)

Section titled “Real-World Examples (transcript-like corpora)”

These are the closest analogs I could surface from training data; spot-check before quoting any of them in a public artifact.

  • Simon Willison’s llm tool + Datasette logs all LLM conversations to SQLite and exposes FTS5 search. Same shape as this corpus, smaller scale. (https://llm.datasette.io)
  • Alex Garcia’s sqlite-vec demos include a “search your ChatGPT history” example with embeddings stored alongside FTS5. Direct precedent.
  • Continue.dev and Aider both ship local conversation history in SQLite; neither does semantic search yet, but the storage shape matches.
  • txtai (by NeumlAI) is the canonical “embedded hybrid search” library — wraps SQLite + Faiss/Hnswlib + a transformer. Worth knowing about as a prebuilt option if you’d rather not assemble the parts yourself, though it adds one more abstraction layer to debug.
  • The r/LocalLLaMA and r/Rag practitioner pattern as of late 2025 leans heavily on SQLite/LanceDB for solo projects and only escalates to Qdrant/Weaviate when there’s a real team or external consumer.

ConcernThis stack
Processes to supervise0 (it’s a library inside whatever Python service queries it)
Ports to open0
Backupsrsync events.db cheesegrater:/storage/jarvis/backups/$(date +%Y%m%d)/
Schema migrationsALTER TABLE ADD COLUMN (per the standing memory rule — never drop)
MonitoringFile size, last-modified, row count via sqlite3 events.db 'SELECT COUNT(*), MAX(ts) FROM events;'
Disaster recoveryRe-run ingest from raw .jsonl in /storage/jarvis/staging/clean-desk-staging/. Deterministic.
Cross-machine visibilityThe vault is already mounted on every fleet machine. Drop the db there and every variant can read it.

Open Questions (worth benchmarking before committing hard)

Section titled “Open Questions (worth benchmarking before committing hard)”
  1. FTS5 query latency at 5M rows with a Porter + unicode61 tokenizer on Cheesegrater’s spinning disks. Expectation: <50ms p50 for 1–3 term queries with bm25 ordering. Reality: test it. If the vault sits on the spinning array, consider moving events.db to SSD.

  2. sqlite-vec performance with int8 quantization vs. f32 at 5M vectors. The library is young (1.0 shipped 2024). Recall delta on this corpus is the only thing that matters; benchmark with a held-out set of “I know this conversation exists” queries.

  3. Embedding model choice. BGE-small vs. BGE-base vs. nomic-embed-v1.5 vs. mxbai-embed-large. All CPU-tractable; the right pick is whichever has highest recall@50 on actual transcript queries you care about. Embed 50k events, run 20 representative queries, measure.

  4. Whether RRF alone is enough, or whether the reranker pulls its weight. Measure recall@10 with and without. If RRF is already at ~0.9, the reranker is not worth the latency.

  5. DuckDB vs. SQLite for the analytics path. If you find yourself writing GROUP BY queries across machine × variant × week, switch the read path for those queries to DuckDB attached to the SQLite file. No migration; literally just a different connection string.

  6. Ingest path: streaming vs. batch. The corpus is bounded (1-month window per extract.py), so batch ingest at end-of-day is fine. If you later want live indexing of in-progress sessions, FTS5 supports it trivially; sqlite-vec is also fine for incremental insert.

  7. Confidence calibration. Per the fleet rule, anything in this doc below the ~94% confidence bar warrants a verification pass before building on it. The biggest items to verify against current vendor docs before committing: (a) sqlite-vec status (is it still pre-1.0 as of 2026-05? — check at https://github.com/asg017/sqlite-vec/releases), (b) LanceDB’s BM25 maturity, (c) Meilisearch hybrid pricing/features. The high-level shape of the recommendation doesn’t change if any of these moved a notch — it just affects which honorable mention is closest.


Single SQLite file. FTS5 for full-text, sqlite-vec for embeddings, RRF in Python to fuse them, DuckDB attached for analytics. Add a BGE-reranker only if measurement shows you need it.