Embeddings
Supported embedding providers and models for Turbopuffer-backed retrieval.
bigRAG supports four provider families. Each collection picks one at creation time and stores its own copy of the config, so different collections on the same instance can use different models.
Providers
| Provider | Description |
|---|---|
openai | The OpenAI API |
cohere | The Cohere Embed API |
voyage | The Voyage AI Embed API (general-purpose, code, finance, and legal models) |
openai_compatible | Any HTTP endpoint that implements the OpenAI /embeddings shape — Ollama, vLLM, TEI / HuggingFace Text Embedding Inference, Infinity, LiteLLM, Azure OpenAI, Bedrock via LiteLLM, self-hosted models, and more |
Managed models
| Provider | Model | Dimensions | Notes |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | Default |
| OpenAI | text-embedding-3-large | 3072 | Best OpenAI quality |
| OpenAI | text-embedding-ada-002 | 1536 | Legacy |
| Cohere | embed-english-v3.0 | 1024 | English-optimized |
| Cohere | embed-multilingual-v3.0 | 1024 | 100+ languages |
| Cohere | embed-english-light-v3.0 | 384 | Cheap, lightweight |
| Cohere | embed-multilingual-light-v3.0 | 384 | Cheap, multilingual |
| Voyage | voyage-3-large | 1024 | Voyage flagship general-purpose |
| Voyage | voyage-3.5 | 1024 | Voyage default general-purpose |
| Voyage | voyage-3.5-lite | 1024 | Cheap, general-purpose |
| Voyage | voyage-code-3 | 1024 | Code-tuned |
| Voyage | voyage-finance-2 | 1024 | Finance-domain |
| Voyage | voyage-law-2 | 1024 | Legal-domain |
curl http://localhost:4000/v1/embeddings/models \
-H "Authorization: Bearer $BIGRAG_API_KEY"Configuring a collection
curl -X POST http://localhost:4000/v1/collections \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "openai_collection",
"embedding_provider": "openai",
"embedding_model": "text-embedding-3-small",
"embedding_api_key": "sk-...",
"dimension": 1536
}'curl -X POST http://localhost:4000/v1/collections \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "cohere_collection",
"embedding_provider": "cohere",
"embedding_model": "embed-english-v3.0",
"embedding_api_key": "co-...",
"dimension": 1024
}'curl -X POST http://localhost:4000/v1/collections \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "voyage_collection",
"embedding_provider": "voyage",
"embedding_model": "voyage-3.5",
"embedding_api_key": "pa-...",
"dimension": 1024
}'voyage-3-large, voyage-3.5, voyage-3.5-lite, and voyage-code-3 accept Matryoshka dimensions (256, 512, 1024, 2048). Pass dimension accordingly. voyage-finance-2 and voyage-law-2 are fixed at 1024.
curl -X POST http://localhost:4000/v1/collections \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "local_collection",
"embedding_provider": "openai_compatible",
"embedding_model": "nomic-embed-text",
"embedding_base_url": "http://ollama.internal:11434/v1",
"embedding_api_key": "ollama",
"dimension": 768
}'curl -X POST http://localhost:4000/v1/collections \
-H "Authorization: Bearer $BIGRAG_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "vllm_collection",
"embedding_provider": "openai_compatible",
"embedding_model": "BAAI/bge-large-en-v1.5",
"embedding_base_url": "http://vllm.internal:8000/v1",
"embedding_api_key": "dummy",
"dimension": 1024
}'embedding_api_key is always required — for self-hosted gateways that don't authenticate, pass any non-empty string.
OpenAI-compatible endpoints in practice
Any gateway that speaks the OpenAI /embeddings shape works:
| Tool | Typical embedding_base_url |
|---|---|
| Ollama | http://ollama:11434/v1 |
| vLLM | http://vllm:8000/v1 |
| TEI (HuggingFace) | http://tei:80/v1 |
| Infinity | http://infinity:7997/v1 |
| LiteLLM proxy | http://litellm:4000/v1 |
| Azure OpenAI (via LiteLLM) | http://litellm:4000/v1 |
| Bedrock (via LiteLLM) | http://litellm:4000/v1 |
How embedding works
- Text extraction (via Docling) splits the document into pages and paragraphs.
- The chunker produces chunks according to the collection's
chunk_size,chunk_overlap, andchunk_strategy. - Each chunk's SHA-256 content hash is looked up in the persistent
embedding_cache. Hits are reused; misses are sent to the provider. - New vectors are batched at
BIGRAG_INGESTION_BATCH_SIZEand written to the collection's Turbopuffer namespace. - Queries embed the same way, with short-lived Redis caching for repeated query embeddings.
The cache key is (content_hash, provider, model, dimension). A new collection using a different model starts cold for that model while previous cache entries remain available to collections still using the old model.
Persistent embedding-cache rows are encrypted by default with BIGRAG_MASTER_KEY. Set embedding_cache_mode to disabled in /settings to prefer provider cost over storing reusable vectors. The default retention window is 30 days after last use; admins can purge the cache from the Security settings tab.
Redis query caches are also encrypted when BIGRAG_MASTER_KEY is configured. The default query embedding TTL is 300 seconds; set query_embedding_cache_ttl to 0 to disable it.
Concurrency & throughput
Embedding requests are guarded by a Redis-backed adaptive limiter with embedding_concurrency as its ceiling (BIGRAG_EMBEDDING_CONCURRENCY, default 8):
- Raise it for high-QPS providers.
- Lower it if your embedding provider throttles requests.
When a provider returns a rate limit with Retry-After, retry-after-ms, or a message such as Please try again in 37ms, ingestion records a short Redis cooldown for that provider/model, waits for the hint, and retries the same batch without consuming the generic transient retry budget. Repeated rate limits are capped per batch, so sustained TPM pressure should be handled by lowering concurrency or batch size.
Changing embedding_concurrency in /settings resets the local and Redis limiter state so workers use the new ceiling immediately.
Token counting uses tiktoken where the provider ships a tokenizer and falls back to a 4-character-per-token heuristic otherwise.
Reranking
Reranking is a separate concern: enable reranking_enabled on a collection to add a Cohere cross-encoder pass after retrieval. It's configured independently (different key, different model) because it often makes sense to embed with a cheap local model and rerank with a strong managed one.
The embedding provider, model, dimension, and embedding_base_url cannot be changed after a collection is created — chunk vectors are only meaningful under the model that produced them. If you need to switch models, create a new collection and re-ingest or migrate the source documents into it.