How are cloud sources kept up to date?

S3, GCS, and Azure Blob are polled at a configurable interval (default 5 minutes for buckets) using their delta/list APIs. Each source's SQLite database stores the cursor/etag so polls are incremental and cheap. Local sources use fsnotify for real-time updates.

What IAM permissions does StrataFS need?

Read-only: s3:ListBucket + s3:GetObject for AWS; storage.objects.get + storage.objects.list for GCS; Storage Blob Data Reader role for Azure. StrataFS will surface a permission denied error if it tries to perform any other action — by design.

Does StrataFS download all my bucket contents?

It reads file contents to extract text, chunk, and embed. The original files are never copied to your local disk — only the extracted chunks and their embeddings end up in the local SQLite database. Compression keeps the resulting database ~1.5–2× the original text size.

Tutorial

Index S3, GCS, and Azure Blob buckets with semantic search

Indexing cloud storage used to mean pipelines, copies, and a separate search service. StrataFS reads buckets in place with read-only credentials and exposes a hybrid search across all of them.

By Dipankar Sarkar May 30, 2026 11 min read

s3gcsazurecloud-storagetutorial

If your knowledge base spans cloud storage, indexing it has historically meant: write a pipeline, copy the files to a staging area, run them through your search ETL, push to a central search service, schedule re-runs, monitor lag. Three months of work; ongoing operational cost.

This article is the alternative: how to point StrataFS at S3, GCS, and Azure Blob containers and have a hybrid search ready in under an hour, with read-only credentials and no copying.

The shape of the integration

StrataFS treats every cloud bucket as a source. A source has:

A name (your handle).
A backend type (s3, gcs, azure, local).
Connection settings (bucket, prefix, region).
A credential reference (typically an env var or a standard SDK profile).
Optional include / exclude globs.

Each source gets its own SQLite database under .stratafs/sources/<name>.db. Sources are isolated: adding a bucket can’t break an existing index; revoking a source is deleting a file.

Step 1 — install and init

npm install -g stratafs       # or pip install / brew install / docker pull
stratafs config init

config init writes ~/.stratafs/config.toml with sensible defaults. We’ll edit it next.

Step 2 — add an S3 source

# ~/.stratafs/config.toml

[[sources]]
name = "engineering-docs"
type = "s3"
bucket = "acme-engineering"
region = "us-east-1"
prefix = "docs/"        # optional; restricts to a sub-tree
include = ["**/*.md", "**/*.pdf"]
exclude = [".git/**", "drafts/**"]
poll_interval = "5m"

Credentials come from the AWS SDK chain — ~/.aws/credentials, the standard env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), or an IAM role if you’re on EC2. The required IAM permissions are minimal:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "StratafsRead",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::acme-engineering"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::acme-engineering/docs/*"
    }
  ]
}

Two permissions. Scope the second Resource to your prefix if you want belt-and-braces.

Step 3 — add a GCS source

[[sources]]
name = "team-drive"
type = "gcs"
bucket = "acme-team-drive"
prefix = "shared/"
credentials_file = "/opt/stratafs/gcs-key.json"
poll_interval = "5m"

Create a service account with the storage.objects.get + storage.objects.list permissions on the bucket, download its JSON key, point credentials_file at it. Don’t paste the key contents into the config — point at the file.

Step 4 — add an Azure Blob source

[[sources]]
name = "ops-archive"
type = "azure"
account = "acmeops"
container = "archive"
prefix = "2024/"
auth = "env"   # AZURE_STORAGE_ACCOUNT + AZURE_STORAGE_KEY in env
poll_interval = "15m"

The Storage Blob Data Reader Azure role on the storage account is sufficient. For SAS-based access, set auth = "sas" and supply AZURE_STORAGE_SAS_TOKEN instead of the key.

Step 5 — bring it up

stratafs serve

That starts the REST API on port 8080 and the MCP server on port 8081. The first time you run it, StrataFS will list the buckets, queue every matching object, and start parsing/embedding through the worker pool. Status check:

stratafs queue stats
# → engineering-docs: 12 484 queued, 7 122 done, 0 failed
# → team-drive:        2 901 queued, 2 901 done, 0 failed
# → ops-archive:      14 002 queued, 8 451 done, 1 failed

The one failed job is loud, not silent — stratafs queue list --status failed will tell you which object and why. Most failures are “unsupported file type” or “object size exceeds 100MB limit” (configurable).

Step 6 — query

The most useful query in the early days is one that’s not scoped to a source — see what surfaces from the entire corpus:

stratafs search "annual security review checklist" --mode hybrid --limit 10

Once you trust the index, scope per query:

stratafs search "JWT refresh sequence" --source engineering-docs --path "**/*.md"

The REST equivalent is identical:

curl 'http://localhost:8080/search?q=JWT+refresh+sequence&source=engineering-docs&mode=hybrid&limit=10'

Polling vs. real time

Cloud sources poll. There’s no S3 equivalent of fsnotify; we use ListObjectsV2 with a saved cursor (an etag for GCS, a marker for Azure). The default 5-minute interval is appropriate for most cases — too aggressive wastes API calls, too slow lets the index drift.

If you need lower latency, set poll_interval = "30s". The cost is N more ListObjectsV2 calls per hour, where N depends on bucket size. S3 charges $0.005 per 1,000 LIST requests; even at 30-second polling on a multi-thousand-object bucket, the monthly bill is in the small-cents range.

For truly real-time updates, configure S3 Event Notifications → SNS → a small webhook receiver that calls stratafs sources sync <name> on each event. That’s a few hours of work; we’ll ship a built-in webhook receiver in a future release.

Disk math

A useful rule of thumb: the SQLite database for a source ends up around 1.5–2× the cumulative original text size of the indexed content, with compression on (40–60% savings from gzip). For a 10 GB documentation bucket, expect 15–20 GB of local index.

The wire cost is the original object size only — we read once, chunk once, embed once, store once. Re-runs use the saved cursor; only changed objects are re-processed.

Cross-source queries

The real payoff is the single query that fans out across sources:

stratafs search "incident response runbook" --mode hybrid
# Top result: ops-archive: 2024/runbooks/sev-1-response.pdf       score=0.84
# Then:       engineering-docs: docs/oncall/runbook.md            score=0.79
# Then:       team-drive:       shared/playbooks/incident.docx    score=0.72

Same engine, same ranking math, three different cloud providers. The agent (over MCP) sees one tool and gets a unified view. No application-side fan-out, no result merging.

Handing it to an agent

Once the cloud sources are wired, the MCP server becomes useful instantly. Point Claude Desktop, ChatGPT, or your custom agent at http://localhost:8081/mcp and the agent’s first call will likely be stratafs.list_sources to discover what’s indexed. From there, every stratafs.search call has the full cross-cloud corpus available.

What this isn’t

To set expectations right:

It’s not a backup tool. StrataFS reads from your buckets; you still need your own backup story.
It’s not a DLP tool. Indexing helps you find what’s in your buckets; it doesn’t classify or quarantine.
It’s not a sync tool. Original files stay in the bucket. Only the index lives locally.

For everything within those bounds — semantic search across cloud storage, agent-accessible knowledge bases, hybrid retrieval — StrataFS replaces the entire “pipeline + central search + sync” architecture with a single binary and a config file.

Try it. Point it at one bucket, get a sense of what surfaces, scale up. The config is text. The state is one SQLite file per source. The complexity is, finally, not yours to maintain.