Architecture deep-dive

Why Turbopuffer + Postgres + Redis + Docling, how requests flow end-to-end, and where the ingestion pipeline actually does its work.

bigRAG uses four distinct stores, each with a specific role, wired together through a FastAPI backend and Dramatiq worker queue.

Four-store design

Store	Role
Turbopuffer	Dense embeddings, chunk text, payload metadata, BM25 full-text search, filtered semantic search, collection namespaces
Postgres	Metadata and control plane: collections, documents, users, API keys, webhooks, query logs, audit trail, embedding cache. Transactions keep ingestion consistent across stores.
Redis	Dramatiq broker for ingestion, connector, webhook, and maintenance actors; processing leases, cancellation epochs, idempotency-key cache, auth/collection/query hot-path caches, batched document-progress lookups, short-lived platform/health caches
Local disk	Temporary ingestion staging shared by the API and worker. Staged originals are deleted after successful processing, final failure, cancellation, or document deletion.

Postgres and Redis run locally via docker compose up. Turbopuffer is the managed vector and full-text search backend.

The backend talks to Turbopuffer through the official Python client for namespace writes, deletes, health listing, vector queries, and BM25 queries.

The admin UI uses TanStack Router's built-in route code splitting — each route transition pays for one route chunk, not a route chunk plus a nested page chunk. Postgres uses composite newest-first indexes for collection document lists and common access-log filters to avoid full sorts as data grows.

Identifier model

bigRAG-owned records use native Postgres uuid columns generated as UUIDv7 values before insert. This covers users, sessions, API keys, collections, documents, upload sessions, connectors, webhooks, query logs, embedding presets, audit logs, and access logs — one canonical identifier shape, time-ordered for healthier newest-first indexes.

Existing UUIDv4 rows remain valid because UUIDv4 and UUIDv7 share the same Postgres uuid type. Foreign keys store raw UUID values internally; the REST API and SDKs expose them as strings.

Not every field ending in id is a bigRAG record ID:

External provider identifiers (S3 object keys, collection names, settings keys, session tokens, API keys, webhook secrets) remain opaque strings.
Vector-store point IDs are deterministic UUIDv5 values derived from the caller-facing vector or chunk ID, so Turbopuffer writes, deletes, and exports use stable point IDs without changing API-visible IDs.

Request flow — `POST /v1/collections/{name}/documents`

Client
  │ multipart POST (file, metadata)
  ▼
FastAPI
  │ 1. auth — session cookie or API key; enforces scoped-key permissions
  │ 2. idempotency — replay cached response if Idempotency-Key hits
  │    for small non-multipart mutating requests
  │ 3. content-hash dedup — sha256 of bytes; if a dupe exists, return it
  │ 4. validate — extension, magic bytes, zip-bomb guard, metadata schema
  │ 5. INSERT documents row → commit
  ▼
Dramatiq Redis broker
  ▼
Dramatiq ingestion worker
  │ a. Docling parse — OCR + layout analysis, export to markdown
  │ b. chunk_document — paragraph or recursive strategy; tracks char offsets
  │ c. embedding_cache.get_many — skip chunks we've embedded before
  │ d. embed the misses — batched, with tiktoken truncation
  │ e. vector_store.insert — backend upsert with dynamic metadata
  │ f. UPDATE documents status = 'ready' and publish progress snapshots
  ▼
Webhook outbox actor
  │ data-operation fan-out through pending delivery rows

Failure at any step marks the document failed with a typed error message. Active workers renew Redis leases while long jobs run. Transient failures are rescheduled as delayed Dramatiq messages; exhausted jobs are retained in the dead-letter list for operator visibility.

Request flow — `POST /v1/collections/{name}/query`

Client → FastAPI
  │ 1. auth → scope check (query:read)
  │ 2. embed query when search mode needs vectors
  │ 3. Turbopuffer semantic search (with metadata filters)
  │ 4. optional hybrid fusion — Turbopuffer BM25 + reciprocal rank fusion
  │ 5. optional rerank — Cohere Rerank v3.5
  │ 6. attach timings (embed/search/rerank/cache/total)
  ▼
Response JSON with results and timings

Keyword mode: skips embeddings; runs BM25 over Turbopuffer full-text chunk data.
Semantic mode: uses the collection's configured embedding model.
Hybrid mode: fuses Turbopuffer ANN results with BM25 results via reciprocal rank fusion.
Cache hits: return the cached chunks with the current cache lookup latency, not the original uncached retrieval timings.

Why Docling

Alternatives considered:

Alternative	Why not
Unstructured.io	Requires network calls and a hosted inference server by default; bigRAG needs to run offline
PyMuPDF + LangChain	OK for PDFs only; loses layout structure
LlamaParse	Hosted only

Docling runs entirely locally, handles DOCX / PPTX / HTML / Markdown / images, and exposes layout provenance for citation metadata (page_no, bbox). PDFs with embedded text use a faster direct extractor; local OCR for scanned PDFs is enabled by default.

Why Turbopuffer

Capability	Why it matters
Managed vector engine	Keeps ANN search and payload filtering outside the transactional database; Postgres stays the metadata and control plane
Full-text plus vectors	Stores chunk text with full-text search enabled, so keyword search runs as BM25 and hybrid search fuses BM25 with semantic results
Namespace isolation	Each collection maps to one namespace; deletion, truncation, export, and raw vector writes are scoped to one backend object

For single-tenant deployments under ~1M vectors, pgvector would work well. bigRAG keeps Postgres focused on durable metadata and uses Turbopuffer for retrieval-specific storage and query execution.

Scaling notes

Horizontal API workers: FastAPI is stateless past the lifespan. Run N replicas behind a load balancer; they share Postgres, Redis, and the configured vector-store clients. API replicas enqueue work but do not run background jobs in-process.
Dramatiq workers: run one or more bigrag-worker processes for ingestion, connector syncs, webhook outbox delivery, and cleanup. Worker heartbeat and queue state are visible through /v1/stats. Each worker initializes runtime dependencies before reserving jobs, aborts boot on failure, refreshes queue-scoped heartbeats while idle, and recovers stale processing leases on startup. Periodic maintenance and webhook outbox ticks are seeded after boot through Redis scheduler guards, so restarts and replicas do not create duplicate delayed messages.
Vector backend: Turbopuffer namespaces are created on first write and store bigRAG vector IDs as payload attributes while using backend-safe IDs internally for writes, deletes, and exports.
Postgres replication: the control plane is read-light; a warm-standby suffices for failover.
Redis persistence: enable appendonly yes (already set in the shipped compose file). Pending Dramatiq messages, delayed retries, processing leases, and dead-letter entries survive process restarts within the Redis durability window.

Connector service layout

Connector routes stay provider-neutral through connector_registry.py. Reusable connector behavior lives in bigrag.services.connectors:

Account/config management
Source and sync-job persistence
Progress payloads and manifest updates
Sync status handling
Document handoff to the ingestion actor

The S3 sync runner keeps job lookup, lifecycle state transitions, page scanning, stale remote deletion, and finalization in separate modules. Queue saturation completes the current sync as deferred work instead of mixing that state transition into the scanning loop.

Worker service boundaries

The ingestion queue keeps Redis lease/recovery state separate from per-document processing so retry and dead-letter behavior can evolve without changing route-level APIs. Per-document processing depends on an explicit queue runtime interface for Redis stats, progress publishing, retry admission, conversion, and chunk embedding; the processing path no longer reaches through private queue internals.

Document conversion is stage-based:

Storage staging downloads the original into a temporary worker-local file.
Plain-text files are decoded directly.
Structured files go through the isolated Docling converter.
PDFs first try embedded text extraction, then fall back to chunked OCR when enabled.
Progress events are emitted by the conversion boundary rather than by route handlers.

Upload-session file requests follow the same route-thin pattern. The route validates auth, collection scope, and embedding readiness, then delegates item reservation, staged-file validation, persistence, deduplication, progress hydration, and failure response shaping to bigrag.services.upload_session_files.

Retrieval is split by concern: cache lookup and timing attribution, search-mode dispatch, optional reranking, outcome logging, and multi-collection fanout live in separate retrieval modules behind the public retrieve and retrieve_multi exports.

Scheduled connector ticks run as Dramatiq maintenance actors and claim due sources with row locks so multiple worker replicas do not create duplicate scheduled sync jobs. The S3 provider keeps credential validation, object listing, source helpers, and sync adapter wiring in separate modules.

Architecture deep-dive

On this page