Documents
How bigRAG ingests, parses, chunks, embeds, and stores documents in Turbopuffer.
Documents are files uploaded to a collection. Once uploaded they are automatically parsed, chunked, embedded, and written to the collection's Turbopuffer namespace for semantic, keyword, and hybrid search.
Supported Formats
bigRAG extracts embedded text from PDFs directly and uses Docling for other rich document parsing:
| Format | Extensions | Notes |
|---|---|---|
.pdf | Fast embedded-text extraction; scanned-PDF OCR enabled by default | |
| Microsoft Word | .docx | Full layout support |
| Microsoft PowerPoint | .pptx | Slide content extraction |
| Microsoft Excel | .xlsx | Table data extraction |
| HTML | .html, .htm | Web page content |
| Markdown | .md | Native support |
| Plain Text | .txt | Direct ingestion |
| CSV / TSV | .csv, .tsv | Tabular data |
| XML | .xml | Structured data |
| JSON | .json | Structured data |
| Images | .png, .jpg, .jpeg, .tiff, .bmp, .gif | OCR text extraction |
Ingestion Methods
| Method | Best for |
|---|---|
| Single upload | One interactive file |
| Batch upload | Small API batches of up to 100 files |
| Upload session | Large local folder/file imports up to the configured session limits |
| Connector sync | Object-storage files mirrored from an S3-compatible bucket prefix |
Upload sessions accept one file per request under a durable session ID. The admin UI uses this path for local files and folders to keep browser memory bounded, retry individual failures, and restore progress after navigation.
Ingestion Pipeline
Every document goes through the same pipeline regardless of ingestion method:
- Store — file saved to the local upload directory
- Queue — document sent to the Dramatiq Redis broker with status
pending - Parse — worker extracts text directly for text PDFs, or parses richer formats with Docling
- Elements — multimodal collections store document elements: headings, tables, equations, images, page bounds, captions, and nearby context
- Chunk — extracted text split into chunks based on the collection's
chunk_sizeandchunk_overlap - Embed — each chunk embedded using the collection's configured model
- Store — embeddings, chunk text, and metadata batch-inserted into Turbopuffer
- Ready — document status updated to
readywith the chunk count
Workers renew processing leases while a document is active. Transient failures are rescheduled as delayed Dramatiq messages; exhausted jobs are marked failed and retained in the dead-letter list.
When multimodal_enrichment_enabled is set on the collection, a follow-up worker job generates summaries for image, table, and equation elements. This enrichment is asynchronous and does not block the document from becoming searchable.
Processing Status
| Status | Description |
|---|---|
pending | Queued, waiting for a worker |
processing | Being parsed, chunked, and embedded |
ready | Successfully processed, searchable |
failed | Processing failed (see error_message) |
Filter documents by status:
curl "http://localhost:4000/v1/collections/research/documents?status=failed" \
-H "Authorization: Bearer $BIGRAG_API_KEY"Chunking Strategy
Chunking splits document text into overlapping segments for embedding and retrieval.
| Setting | Default | Range | Description |
|---|---|---|---|
chunk_size | 512 | 64–10,000 | Maximum characters per chunk |
chunk_overlap | 50 | 0–5,000 | Overlap characters between adjacent chunks |
- Smaller chunks (256–512) — better for precise answers and factual retrieval
- Larger chunks (1,000–2,000) — more context per result
- Overlap — ensures important content at chunk boundaries is not lost
API-Client Status Polling
Poll the document record to monitor processing:
let document = await client.documents.get("research", "DOC_ID");
while (document.status === "pending" || document.status === "processing") {
await new Promise((resolve) => setTimeout(resolve, 2000));
document = await client.documents.get("research", document.id);
console.log(`${document.progress?.message ?? document.status}`);
}The document response includes:
| Field | Type | Description |
|---|---|---|
status | string | Current status (pending, processing, ready, failed) |
chunk_count | integer | Number of chunks currently stored for the document |
multimodal_element_count | integer | Number of stored document elements for multimodal collections |
progress.step | string | Current ingestion step (queued, ocr, embedding, complete, etc.) |
progress.message | string | Latest human-readable progress update |
progress.progress | float | Completed fraction from 0 to 1 |
progress.detail | object | Step-specific counters such as page ranges or batch numbers |
error_message | string | Failure detail when status is failed |
updated_at | string | Timestamp for the latest document update |
Document Elements
For collections created or updated with multimodal_enabled: true:
curl "http://localhost:4000/v1/collections/research/documents/DOC_ID/elements" \
-H "Authorization: Bearer $BIGRAG_API_KEY"Each element includes kind, extracted text, optional summary, caption, page_no, bbox, character offsets, nearby context, enrichment status, and source metadata. asset_path remains null because bigRAG does not retain multimodal binary assets after ingestion staging is cleaned up. Existing text-only retrieval still works; element records add provenance for query and chat clients that want richer context.
Batch Operations
Upload, check status, or delete multiple documents in a single request:
# Batch upload (up to 100 files)
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/upload \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-F "files=@paper1.pdf" \
-F "files=@paper2.pdf" \
-F 'metadata={"source": "batch-import"}'
# Batch status check
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/status \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"document_ids": ["doc-id-1", "doc-id-2"]}'
# Batch get full document metadata
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/get \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"document_ids": ["doc-id-1", "doc-id-2"]}'
# Batch delete
curl -X POST http://localhost:4000/v1/collections/docs/documents/batch/delete \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"document_ids": ["doc-id-1", "doc-id-2"]}'Batch upload, status, and get support up to 100 items per request. Batch delete accepts larger document ID lists and processes them in internal chunks. API clients can poll batch status after upload to read each document's latest progress snapshot until every document is ready or failed. Partial success is supported for batch delete — failed items are reported in the errors array.
Large Upload Sessions
Use upload sessions for local imports that are too large for one multipart request:
curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"total_files": 10000, "total_bytes": 2147483648, "metadata": {"source": "folder"}}'
curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions/SESSION_ID/files \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-F "client_item_id=000001" \
-F "file=@paper.pdf"
curl -X POST http://localhost:4000/v1/collections/docs/upload-sessions/SESSION_ID/complete \
-H "Authorization: Bearer $BIGRAG_API_KEY"GET /v1/collections/{collection}/upload-sessions/{session_id} returns aggregate progress and recent failures. The old batch upload endpoint remains useful for small API imports; upload sessions are the scalable path for thousands of browser-selected files.
Re-ingestion
bigRAG deletes staged originals once ingestion reaches a terminal state. To parse or embed the same source again, upload the file again or resync the connector source.