A pipeline from passive storage to ranked answers — and back again.
StrataFS is a layered Go application. Each layer is a small interface, so every backend, parser, chunker, and ranking signal is replaceable. The whole thing runs as one binary, with SQLite as the only durable store.
The 30-second tour
Storage sources flow into a change monitor, which writes jobs to a
SQLite-backed queue. Workers pull jobs, parse files, chunk text, embed
with a local ONNX model, and persist rows to a per-source SQLite database
with FTS5 + sqlite-vec extensions. A hybrid search engine
reads those databases and serves results over a REST API and a native MCP
server.
Why this shape? Each layer is a small, testable interface. SQLite makes the operational surface tiny. Per-source isolation means adding a new bucket can't break an unrelated index.
Layer by layer
1. Storage (pkg/storage/, pkg/filesystem/)
A factory pattern selects a backend: local, S3, GCS, Azure Blob, SharePoint, Google Drive, or Jira. Every backend implements one interface: list, stat, read. Read-only. StrataFS never writes to your sources.
2. Monitor (pkg/monitor/)
Local sources use fsnotify for sub-second change detection.
Remote sources poll via delta APIs (S3 ListObjectsV2, Graph delta,
Drive changes, Jira updated JQL). Each change becomes a job.
3. Queue (pkg/queue/)
A SQLite-backed job queue with priority, automatic retry (default 3 attempts with exponential backoff), and crash recovery. The queue lives in the same SQLite database as the index — so there is no separate broker to run.
4. Parsers (pkg/parsers/)
Thirty-five-plus file types: Markdown, plain text, PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, JSON, YAML, TOML, INI, and source code in Go, Python, JavaScript/TypeScript, Java, Kotlin, C/C++, Rust, Swift, Ruby, PHP, Shell, SQL — each registered against a content-type or extension. Adding a new parser is a single Go file.
5. Chunking (pkg/chunking/)
Four strategies — simple (fixed window with overlap),
sentence (boundary-aware), separator (markdown
headings, blank lines, commas), and token (budget-aware for
strict LLM contexts). The right strategy is picked per file type:
separator for markdown and code, sentence for PDFs and prose.
6. Embeddings (pkg/embeddings/)
FastEmbed-go calls into ONNX Runtime. Default model is BGE Base EN v1.5 (768 dimensions). Swap to BGE Small (384d) for memory-tight machines, or point at any ONNX file on disk. Nothing leaves the host.
7. Database (pkg/database/)
Each source gets its own SQLite database. A
files table tracks every observed file (with soft delete);
file_chunks stores the text. Content larger than 512 bytes
is gzip-compressed in content_compressed; an is_compressed
flag controls transparent decompression. FTS5 provides the BM25 index;
sqlite-vec stores the vector embeddings with HNSW.
8. Search (pkg/search/)
A single SQL query with three CTEs — FTS5 BM25, vector cosine, metadata — joined and re-ranked by configurable weights. Filters apply at the SQL level (source, path glob, file type, recency). See the search engine deep dive for the query shape.
9. APIs (pkg/api/, pkg/protocol/)
The REST API on port 8080 exposes /search, /chunks/{id},
/sources/stats, and /queue/stats. The MCP server on
port 8081 implements the Model Context Protocol over JSON-RPC 2.0,
pre-shaping results into LLM-friendly text blocks. See the MCP page
for the tool catalog.
10. FUSE bridge (pkg/fsbridge/)
On Linux and macOS the index can be mounted as a read-only filesystem via
FUSE; on Windows via WinFsp. Browse search results as paths;
cat, grep, or your editor work like usual —
against the semantic chunks rather than the raw files.
Design invariants
- Per-source isolation — failures contained, sources individually inspectable.
- Streaming pipeline — constant memory regardless of file size; multi-GB PDFs don't OOM.
- Soft delete — deleted files marked, hard-deleted after a configurable threshold.
- Compression-aware schema — 40–60% disk savings, transparent at query time.
- Read-only sources — no permission escalation;
chmod -R a-wwould not change behaviour. - One binary — Go static link, ONNX Runtime bundled in the installer.
Where to read the code
The Go module is structured for readability: cmd/stratafs/main.go
is the entry point, every layer lives in pkg/<name>/, and
internal/ holds utilities. The research/ directory
contains the benchmark suite and the reproducible LaTeX-ready experiment
framework.
Want to extend StrataFS?
Every layer is a small interface. Add a parser, a backend, a chunker, or a ranking signal in a single Go file.