Architecture

A pipeline from passive storage to ranked answers — and back again.

StrataFS is a layered Go application. Each layer is a small interface, so every backend, parser, chunker, and ranking signal is replaceable. The whole thing runs as one binary, with SQLite as the only durable store.

The 30-second tour

Storage sources flow into a change monitor, which writes jobs to a SQLite-backed queue. Workers pull jobs, parse files, chunk text, embed with a local ONNX model, and persist rows to a per-source SQLite database with FTS5 + sqlite-vec extensions. A hybrid search engine reads those databases and serves results over a REST API and a native MCP server.

Why this shape? Each layer is a small, testable interface. SQLite makes the operational surface tiny. Per-source isolation means adding a new bucket can't break an unrelated index.

Layer by layer

1. Storage (pkg/storage/, pkg/filesystem/)

A factory pattern selects a backend: local, S3, GCS, Azure Blob, SharePoint, Google Drive, or Jira. Every backend implements one interface: list, stat, read. Read-only. StrataFS never writes to your sources.

2. Monitor (pkg/monitor/)

Local sources use fsnotify for sub-second change detection. Remote sources poll via delta APIs (S3 ListObjectsV2, Graph delta, Drive changes, Jira updated JQL). Each change becomes a job.

3. Queue (pkg/queue/)

A SQLite-backed job queue with priority, automatic retry (default 3 attempts with exponential backoff), and crash recovery. The queue lives in the same SQLite database as the index — so there is no separate broker to run.

4. Parsers (pkg/parsers/)

Thirty-five-plus file types: Markdown, plain text, PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, JSON, YAML, TOML, INI, and source code in Go, Python, JavaScript/TypeScript, Java, Kotlin, C/C++, Rust, Swift, Ruby, PHP, Shell, SQL — each registered against a content-type or extension. Adding a new parser is a single Go file.

5. Chunking (pkg/chunking/)

Four strategies — simple (fixed window with overlap), sentence (boundary-aware), separator (markdown headings, blank lines, commas), and token (budget-aware for strict LLM contexts). The right strategy is picked per file type: separator for markdown and code, sentence for PDFs and prose.

6. Embeddings (pkg/embeddings/)

FastEmbed-go calls into ONNX Runtime. Default model is BGE Base EN v1.5 (768 dimensions). Swap to BGE Small (384d) for memory-tight machines, or point at any ONNX file on disk. Nothing leaves the host.

7. Database (pkg/database/)

Each source gets its own SQLite database. A files table tracks every observed file (with soft delete); file_chunks stores the text. Content larger than 512 bytes is gzip-compressed in content_compressed; an is_compressed flag controls transparent decompression. FTS5 provides the BM25 index; sqlite-vec stores the vector embeddings with HNSW.

8. Search (pkg/search/)

A single SQL query with three CTEs — FTS5 BM25, vector cosine, metadata — joined and re-ranked by configurable weights. Filters apply at the SQL level (source, path glob, file type, recency). See the search engine deep dive for the query shape.

9. APIs (pkg/api/, pkg/protocol/)

The REST API on port 8080 exposes /search, /chunks/{id}, /sources/stats, and /queue/stats. The MCP server on port 8081 implements the Model Context Protocol over JSON-RPC 2.0, pre-shaping results into LLM-friendly text blocks. See the MCP page for the tool catalog.

10. FUSE bridge (pkg/fsbridge/)

On Linux and macOS the index can be mounted as a read-only filesystem via FUSE; on Windows via WinFsp. Browse search results as paths; cat, grep, or your editor work like usual — against the semantic chunks rather than the raw files.

Design invariants

Where to read the code

The Go module is structured for readability: cmd/stratafs/main.go is the entry point, every layer lives in pkg/<name>/, and internal/ holds utilities. The research/ directory contains the benchmark suite and the reproducible LaTeX-ready experiment framework.

Want to extend StrataFS?

Every layer is a small interface. Add a parser, a backend, a chunker, or a ranking signal in a single Go file.