Documents

How bigRAG ingests, parses, chunks, embeds, and stores documents in Turbopuffer.

Documents are files uploaded to a collection. Once uploaded they are automatically parsed, chunked, embedded, and written to the collection's Turbopuffer namespace for semantic, keyword, and hybrid search.

Supported Formats

bigRAG extracts embedded text from PDFs directly and uses Docling for other rich document parsing:

Format	Extensions	Notes
PDF	`.pdf`	Fast embedded-text extraction; scanned-PDF OCR enabled by default
Microsoft Word	`.docx`	Full layout support
Microsoft PowerPoint	`.pptx`	Slide content extraction
Microsoft Excel	`.xlsx`	Table data extraction
HTML	`.html`, `.htm`	Web page content
Markdown	`.md`	Native support
Plain Text	`.txt`	Direct ingestion
CSV / TSV	`.csv`, `.tsv`	Tabular data
XML	`.xml`	Structured data
JSON	`.json`	Structured data
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.gif`	OCR text extraction

Ingestion Methods

Method	Best for
Single upload	One interactive file
Batch upload	Small API batches of up to 100 files
Upload session	Large local folder/file imports up to the configured session limits
Connector sync	Object-storage files mirrored from an S3-compatible bucket prefix

Upload sessions accept one file per request under a durable session ID. The admin UI uses this path for local files and folders to keep browser memory bounded, retry individual failures, and restore progress after navigation.

Ingestion Pipeline

Every document goes through the same pipeline regardless of ingestion method:

Store — file saved to the local upload directory
Queue — document sent to the Dramatiq Redis broker with status pending
Parse — worker extracts text directly for text PDFs, or parses richer formats with Docling
Elements — multimodal collections store document elements: headings, tables, equations, images, page bounds, captions, and nearby context
Chunk — extracted text split into chunks based on the collection's chunk_size and chunk_overlap
Embed — each chunk embedded using the collection's configured model
Store — embeddings, chunk text, and metadata batch-inserted into Turbopuffer
Ready — document status updated to ready with the chunk count

Workers renew processing leases while a document is active. Transient failures are rescheduled as delayed Dramatiq messages; exhausted jobs are marked failed and retained in the dead-letter list.

When multimodal_enrichment_enabled is set on the collection, a follow-up worker job generates summaries for image, table, and equation elements. This enrichment is asynchronous and does not block the document from becoming searchable.

Processing Status

Status	Description
`pending`	Queued, waiting for a worker
`processing`	Being parsed, chunked, and embedded
`ready`	Successfully processed, searchable
`failed`	Processing failed (see `error_message`)

Filter documents by status:

curl "http://localhost:4000/v1/collections/research/documents?status=failed" \
  -H "Authorization: Bearer $BIGRAG_API_KEY"

Chunking Strategy

Chunking splits document text into overlapping segments for embedding and retrieval.

Setting	Default	Range	Description
`chunk_size`	512	64–10,000	Maximum characters per chunk
`chunk_overlap`	50	0–5,000	Overlap characters between adjacent chunks

Smaller chunks (256–512) — better for precise answers and factual retrieval
Larger chunks (1,000–2,000) — more context per result
Overlap — ensures important content at chunk boundaries is not lost

API-Client Status Polling

Poll the document record to monitor processing:

let document = await client.documents.get("research", "DOC_ID");

while (document.status === "pending" || document.status === "processing") {
  await new Promise((resolve) => setTimeout(resolve, 2000));
  document = await client.documents.get("research", document.id);
  console.log(`${document.progress?.message ?? document.status}`);
}

The document response includes:

Field	Type	Description
`status`	string	Current status (`pending`, `processing`, `ready`, `failed`)
`chunk_count`	integer	Number of chunks currently stored for the document
`multimodal_element_count`	integer	Number of stored document elements for multimodal collections
`progress.step`	string	Current ingestion step (`queued`, `ocr`, `embedding`, `complete`, etc.)
`progress.message`	string	Latest human-readable progress update
`progress.progress`	float	Completed fraction from `0` to `1`
`progress.detail`	object	Step-specific counters such as page ranges or batch numbers
`error_message`	string	Failure detail when status is `failed`
`updated_at`	string	Timestamp for the latest document update

Document Elements

For collections created or updated with multimodal_enabled: true:

curl "http://localhost:4000/v1/collections/research/documents/DOC_ID/elements" \
  -H "Authorization: Bearer $BIGRAG_API_KEY"

Each element includes kind, extracted text, optional summary, caption, page_no, bbox, character offsets, nearby context, enrichment status, and source metadata. asset_path remains null because bigRAG does not retain multimodal binary assets after ingestion staging is cleaned up. Existing text-only retrieval still works; element records add provenance for query and chat clients that want richer context.

Batch Operations

Upload, check status, or delete multiple documents in a single request:

# Batch upload (up to 100 files)
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/upload \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -F "files=@paper1.pdf" \
  -F "files=@paper2.pdf" \
  -F 'metadata={"source": "batch-import"}'

# Batch status check
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/status \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch get full document metadata
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/get \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

# Batch delete
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/delete \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"document_ids": ["doc-id-1", "doc-id-2"]}'

Batch upload, status, and get support up to 100 items per request. Batch delete accepts larger document ID lists and processes them in internal chunks. API clients can poll batch status after upload to read each document's latest progress snapshot until every document is ready or failed. Partial success is supported for batch delete — failed items are reported in the errors array.

Large Upload Sessions

Use upload sessions for local imports that are too large for one multipart request:

curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"total_files": 10000, "total_bytes": 2147483648, "metadata": {"source": "folder"}}'

curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions/SESSION_ID/files \
  -H "Authorization: Bearer $BIGRAG_API_KEY" \
  -F "client_item_id=000001" \
  -F "file=@paper.pdf"

curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions/SESSION_ID/complete \
  -H "Authorization: Bearer $BIGRAG_API_KEY"

GET /v1/collections/{collection}/upload-sessions/{session_id} returns aggregate progress and recent failures. The old batch upload endpoint remains useful for small API imports; upload sessions are the scalable path for thousands of browser-selected files.

Re-ingestion

bigRAG deletes staged originals once ingestion reaches a terminal state. To parse or embed the same source again, upload the file again or resync the connector source.

On this page