Database / ChromaDB Interview Questions
ChromaDB is an open-source, AI-native vector database designed to store, index, and query high-dimensional embedding vectors efficiently. It was created specifically to make building LLM-powered applications easy — particularly for retrieval-augmented generation (RAG), semantic search, and recommendation systems.
Traditional databases store and search data using exact matches or SQL-style predicates. ChromaDB instead answers the question: "which stored items are most semantically similar to this query?" It does this by storing numerical vectors (embeddings) that represent the meaning of text, images, or other data, and finding nearest neighbours using vector similarity search.
| Feature | Detail |
|---|---|
| Language | Python-first (also JS/TS client) |
| License | Apache 2.0 open source |
| Storage modes | In-memory (ephemeral) or persistent (disk-backed) |
| Default embedding model | all-MiniLM-L6-v2 via sentence-transformers |
| Distance metrics | cosine, l2 (Euclidean), ip (inner product) |
| Primary use cases | RAG pipelines, semantic search, duplicate detection, recommendation |
pip install chromadb
import chromadb
# Quickstart — in-memory client
client = chromadb.Client()
collection = client.create_collection("my_docs")
collection.add(
documents=["ChromaDB is a vector database", "Python is great"],
ids=["doc1", "doc2"],
)
results = collection.query(query_texts=["vector store for AI"], n_results=1)
print(results["documents"]) # [["ChromaDB is a vector database"]]
An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data. Text, images, audio, and code can all be converted into embeddings by a neural network (embedding model). Items with similar meanings produce vectors that are close together in the high-dimensional vector space.
ChromaDB stores these vectors alongside the original data and metadata. When you query ChromaDB with a new piece of text, the same embedding model converts it to a vector, and ChromaDB uses an approximate nearest-neighbour (ANN) algorithm to find the stored vectors that are geometrically closest — these correspond to the most semantically relevant stored documents.
# Conceptual illustration
# "ChromaDB is a vector database" → [0.12, -0.45, 0.89, ..., 0.03] (384 numbers)
# "Vector stores for AI apps" → [0.14, -0.41, 0.91, ..., 0.01] (close!)
# "My cat loves tuna fish" → [-0.55, 0.72, -0.11, ..., 0.88] (far away)
import chromadb
from chromadb.utils import embedding_functions
# You can inspect the raw embedding vectors ChromaDB generates
client = chromadb.Client()
collection = client.create_collection("demo")
collection.add(documents=["Hello world"], ids=["1"])
# Get the stored embedding
result = collection.get(ids=["1"], include=["embeddings"])
print(len(result["embeddings"][0])) # 384 — length of the default model's vector
print(result["embeddings"][0][:5]) # first 5 of 384 floats| Model | Dimensions | Notes |
|---|---|---|
| all-MiniLM-L6-v2 (default) | 384 | Fast, small, good for English |
| text-embedding-ada-002 (OpenAI) | 1536 | High quality, API call required |
| text-embedding-3-small (OpenAI) | 1536 | Newer, cheaper than ada-002 |
| all-mpnet-base-v2 | 768 | Higher quality than MiniLM, slower |
| CLIP (images) | 512 | Multimodal — text and images same space |
ChromaDB uses a distance metric to measure how similar two vectors are during nearest-neighbour search. The metric is set at collection creation time and cannot be changed afterward. Choosing the wrong metric for your embedding model can significantly degrade search quality.
| Metric | hnsw:space value | Formula | Best for |
|---|---|---|---|
| L2 (Euclidean) | l2 (default) | √Σ(aᵢ−bᵢ)² | When vector magnitude matters; general purpose |
| Cosine similarity | cosine | 1 − (a·b)/(‖a‖‖b‖) | Text embeddings — focuses on direction not magnitude |
| Inner Product | ip | −(a·b) | When embeddings are pre-normalised to unit length |
import chromadb
# Set metric at collection creation — cannot change later!
collection_cosine = client.create_collection(
name="text_cosine",
metadata={"hnsw:space": "cosine"}, # recommended for text
)
collection_l2 = client.create_collection(
name="general_l2",
metadata={"hnsw:space": "l2"}, # default if not specified
)
collection_ip = client.create_collection(
name="normalised_ip",
metadata={"hnsw:space": "ip"}, # use when vectors are unit-normalised
)
# Query returns "distances" field — interpretation depends on metric:
# cosine: 0 = identical, 2 = opposite (lower = more similar)
# l2: 0 = identical, larger = more different (lower = more similar)
# ip: more negative = more similar (with normalised vectors)Rule of thumb: most popular text embedding models (OpenAI, Sentence Transformers) are optimised for cosine similarity. Use "hnsw:space": "cosine" for text RAG applications. L2 is the default but is less optimal for text embeddings that vary in magnitude.
A collection is ChromaDB's primary organisational unit -analogous to a table in SQL or an index in a search engine. Each collection stores documents, their embeddings, IDs, and optional metadata. All items in a collection share the same embedding function and distance metric.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
# CREATE a collection
collection = client.create_collection(
name="research_papers",
metadata={"hnsw:space": "cosine"},
# embedding_function defaults to all-MiniLM-L6-v2
)
# GET an existing collection (raises error if not found)
collection = client.get_collection("research_papers")
# GET or CREATE — idempotent, safe to call on every startup
collection = client.get_or_create_collection(
name="research_papers",
metadata={"hnsw:space": "cosine"},
)
# LIST all collections
collections = client.list_collections()
for col in collections:
print(col.name) # prints collection names
# COUNT documents in a collection
print(collection.count()) # number of items stored
# DELETE a collection and all its data
client.delete_collection("research_papers")
# MODIFY collection name or metadata
collection.modify(
name="arxiv_papers",
metadata={"hnsw:space": "cosine", "description": "arXiv CS papers"},
)| Method | Purpose | Raises if |
|---|---|---|
| create_collection(name) | Creates new collection | Name already exists |
| get_collection(name) | Gets existing collection | Name not found |
| get_or_create_collection(name) | Idempotent get/create | Never raises |
| list_collections() | Returns all collection names | - |
| delete_collection(name) | Permanently deletes collection + data | Name not found |
The collection.add() method inserts items into a collection. Each item requires a unique id. You can provide raw documents (strings) and let ChromaDB embed them, or supply pre-computed embeddings directly. Optional metadatas store filterable key-value pairs alongside each document.
import chromadb
client = chromadb.Client()
collection = client.create_collection("articles")
# Basic add — ChromaDB embeds documents automatically
collection.add(
documents=[
"ChromaDB is an open-source vector database.",
"Retrieval-augmented generation improves LLM accuracy.",
"Python is a popular language for data science.",
],
ids=["art-001", "art-002", "art-003"],
)
# Add with metadata — enables filtered queries later
collection.add(
documents=[
"FastAPI is a modern Python web framework.",
"React is a JavaScript library for building UIs.",
],
metadatas=[
{"source": "docs", "category": "backend", "year": 2024},
{"source": "docs", "category": "frontend", "year": 2024},
],
ids=["art-004", "art-005"],
)
# Add pre-computed embeddings (skip ChromaDB's embedding step)
import numpy as np
collection_custom = client.create_collection(
"custom_embeddings",
metadata={"hnsw:space": "cosine"},
)
collection_custom.add(
embeddings=[
[0.1, 0.5, -0.3, 0.8], # must match embedding_function dimension
[0.4, 0.2, 0.9, -0.1],
],
documents=["Doc A", "Doc B"], # stored as-is for retrieval
ids=["e-1", "e-2"],
)ID rules: IDs must be strings, must be unique within the collection, and must not be empty. Adding a duplicate ID raises a chromadb.errors.IDAlreadyExistsError.
The primary query method is collection.query(). You pass either query_texts (raw strings that ChromaDB embeds automatically) or query_embeddings (pre-computed vectors). ChromaDB returns the n_results nearest neighbours for each query.
import chromadb
client = chromadb.Client()
collection = client.create_collection("knowledge_base")
collection.add(
documents=[
"Python is great for data science and machine learning.",
"JavaScript is used for web development.",
"ChromaDB stores and retrieves vector embeddings.",
"Docker containers package applications with dependencies.",
],
ids=["d1", "d2", "d3", "d4"],
)
# Basic query — returns top 2 most similar documents
results = collection.query(
query_texts=["vector database for AI"],
n_results=2,
)
print(results["documents"]) # [[most_similar, second_most_similar]]
print(results["ids"]) # [["d3", "d1"]]
print(results["distances"]) # [[0.18, 0.74]] — lower = more similar
# Query multiple texts at once (batch query)
results = collection.query(
query_texts=["machine learning", "web frameworks"],
n_results=2,
)
# results["documents"][0] = top 2 for "machine learning"
# results["documents"][1] = top 2 for "web frameworks"
# Control what is returned with include=
results = collection.query(
query_texts=["Python programming"],
n_results=3,
include=["documents", "metadatas", "distances", "embeddings"],
)
# Default include: ["documents", "metadatas", "distances"]
# "embeddings" must be explicitly requested — adds response size| Field | Type | Description |
|---|---|---|
| ids | list[list[str]] | IDs of matching documents, outer list = per query |
| documents | list[list[str]] | Original text of matching documents |
| metadatas | list[list[dict]] | Metadata dicts of matching documents |
| distances | list[list[float]] | Similarity distances (lower = more similar for l2/cosine) |
| embeddings | list[list[list[float]]] | Raw vectors — only if include=['embeddings'] |
Beyond querying by similarity, ChromaDB supports exact lookups by ID with get(), in-place updates with update() or upsert(), and deletion with delete().
import chromadb
client = chromadb.Client()
col = client.create_collection("items")
col.add(
documents=["First document", "Second document", "Third document"],
metadatas=[{"v": 1}, {"v": 2}, {"v": 3}],
ids=["id1", "id2", "id3"],
)
# GET — fetch by specific IDs
result = col.get(ids=["id1", "id3"])
print(result["documents"]) # ["First document", "Third document"]
# GET all documents in the collection
all_docs = col.get() # no ids= returns everything
# GET with include control
result = col.get(
ids=["id1"],
include=["documents", "metadatas", "embeddings"],
)
# UPDATE — must already exist, updates only specified fields
col.update(
ids=["id1"],
documents=["Updated first document"],
metadatas=[{"v": 10, "updated": True}],
# ChromaDB re-embeds the new document text automatically
)
# UPSERT — insert if not exists, update if exists (idempotent)
col.upsert(
documents=["Brand new doc", "Updated second doc"],
metadatas=[{"v": 99}, {"v": 20}],
ids= ["id-new", "id2"],
)
# id-new is inserted; id2 is updated
# DELETE by IDs
col.delete(ids=["id3"])
print(col.count()) # 3 (id1, id2, id-new remain)
# DELETE by metadata filter (where clause)
col.delete(where={"v": {"$gt": 15}})| Method | Behaviour when ID exists | Behaviour when ID missing |
|---|---|---|
| add() | Raises IDAlreadyExistsError | Inserts new document |
| update() | Updates the document | Raises error — ID must exist |
| upsert() | Updates the document | Inserts new document |
| delete() | Removes the document | Silently ignores |
ChromaDB supports a MongoDB-style where clause for filtering by metadata fields. Filters can be applied during query() (combines semantic search with filtering) or during get() (exact retrieval with filtering). Filters run before or alongside the ANN search.
import chromadb
client = chromadb.Client()
col = client.create_collection("articles")
col.add(
documents=["Python intro", "Python advanced", "JS basics", "Rust guide", "Go tutorial"],
metadatas=[
{"lang": "python", "level": "beginner", "year": 2022},
{"lang": "python", "level": "advanced", "year": 2023},
{"lang": "js", "level": "beginner", "year": 2023},
{"lang": "rust", "level": "beginner", "year": 2024},
{"lang": "go", "level": "intermediate", "year": 2024},
],
ids=["a1","a2","a3","a4","a5"],
)
# Equality filter
results = col.query(
query_texts=["programming tutorial"],
n_results=3,
where={"lang": "python"}, # shorthand for $eq
)
# Comparison operators
results = col.query(
query_texts=["tutorial"],
n_results=5,
where={"year": {"$gte": 2023}}, # year >= 2023
)
# Logical AND — all conditions must match
results = col.query(
query_texts=["guide"],
n_results=3,
where={"$and": [
{"lang": {"$in": ["python", "go"]}},
{"level": {"$ne": "advanced"}},
]},
)
# Logical OR
results = col.query(
query_texts=["code"],
n_results=3,
where={"$or": [
{"year": {"$eq": 2024}},
{"level": {"$eq": "beginner"}},
]},
)
# Filter on document text content (where_document)
results = col.query(
query_texts=["programming"],
n_results=5,
where_document={"$contains": "Python"}, # document text contains "Python"
)| Operator | Meaning | Example |
|---|---|---|
| $eq | Equal | {"lang": {"$eq": "python"}} or {"lang": "python"} |
| $ne | Not equal | {"level": {"$ne": "advanced"}} |
| $gt / $gte | Greater than / or equal | {"year": {"$gte": 2023}} |
| $lt / $lte | Less than / or equal | {"year": {"$lt": 2024}} |
| $in | Value in list | {"lang": {"$in": ["python", "go"]}} |
| $nin | Value not in list | {"lang": {"$nin": ["js"]}} |
| $and | All conditions true | {"$and": [{...}, {...}]} |
| $or | Any condition true | {"$or": [{...}, {...}]} |
ChromaDB offers three client modes that control where data is stored. Choosing the right mode depends on whether you need data to survive restarts and whether you're running a single process or a shared service.
| Mode | Class | Data survives restart? | Best for |
|---|---|---|---|
| Ephemeral (in-memory) | chromadb.Client() | No — lost when process ends | Testing, prototyping, CI pipelines |
| Persistent (disk) | chromadb.PersistentClient(path=...) | Yes — written to SQLite + disk files | Single-process apps, local dev |
| HTTP Client | chromadb.HttpClient(host=..., port=...) | Yes — managed by server | Multi-process apps, production, shared access |
import chromadb
# 1. Ephemeral — data lives only in memory, lost on exit
client_mem = chromadb.Client()
# 2. Persistent — data saved to disk at ./my_chroma_db/
client_disk = chromadb.PersistentClient(path="./my_chroma_db")
# Creates the directory if it does not exist
# Data persists across Python restarts
# 3. HTTP Client — connects to a running ChromaDB server
client_http = chromadb.HttpClient(
host="localhost",
port=8000,
# ssl=True, headers={"Authorization": "Bearer token"} # if secured
)
# Start the server separately:
# chroma run --path ./chroma_data --port 8000
# Verify connection
client_http.heartbeat() # raises if server is unreachable
# EphemeralClient — explicit alias for chromadb.Client()
client_eph = chromadb.EphemeralClient()
# All three clients share the same collection API
collection = client_disk.get_or_create_collection("my_data")
collection.add(documents=["Persisted text"], ids=["p1"])
# Restart Python, create PersistentClient with same path → data still thereImportant: the persistent client uses SQLite under the hood. It is not designed for concurrent writes from multiple processes. For multi-process or multi-container production use, run ChromaDB as an HTTP server and use HttpClient.
When you create a collection without specifying an embedding function, ChromaDB uses the SentenceTransformerEmbeddingFunction backed by the all-MiniLM-L6-v2 model from the sentence-transformers library. This model is downloaded automatically on first use and cached locally.
import chromadb
from chromadb.utils import embedding_functions
# Default — uses all-MiniLM-L6-v2 automatically
client = chromadb.Client()
collection_default = client.create_collection("default_embeddings")
# Equivalent explicit usage
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2", # 384-dim, fast, good English quality
)
# Using a different Sentence Transformer model
ef_large = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2", # 768-dim, higher quality, slower
)
collection_large = client.create_collection(
name="large_model",
embedding_function=ef_large,
metadata={"hnsw:space": "cosine"},
)
# You can call embedding functions directly to inspect output
embed = embedding_functions.SentenceTransformerEmbeddingFunction()
vectors = embed(["Hello world", "ChromaDB is great"])
print(len(vectors)) # 2 — one vector per input
print(len(vectors[0])) # 384 — dimensions| Property | Value |
|---|---|
| Model name | all-MiniLM-L6-v2 |
| Output dimensions | 384 |
| Download size | ~80 MB (cached after first use) |
| Library required | sentence-transformers |
| Runs on | CPU (default) or GPU |
| Strength | Fast, good English semantic similarity |
| Limitation | Weaker on non-English, domain-specific text |
ChromaDB has a built-in OpenAIEmbeddingFunction that calls the OpenAI Embeddings API. This gives higher-quality embeddings than the default local model, at the cost of API latency and usage fees. Use text-embedding-3-small for a balance of quality and cost, or text-embedding-3-large for maximum quality.
import chromadb
from chromadb.utils import embedding_functions
import os
client = chromadb.PersistentClient(path="./chroma_openai")
# Built-in OpenAI embedding function
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small", # 1536-dim, fast and cheap
# model_name="text-embedding-3-large", # 3072-dim, highest quality
# model_name="text-embedding-ada-002", # legacy, 1536-dim
)
collection = client.get_or_create_collection(
name="openai_docs",
embedding_function=ef_openai,
metadata={"hnsw:space": "cosine"},
)
# Usage is identical to the default embedding function
collection.add(
documents=[
"FastAPI is a modern Python web framework for building APIs.",
"ChromaDB stores embeddings for semantic search.",
],
ids=["d1", "d2"],
)
# ChromaDB calls OpenAI API automatically on add() and query()
results = collection.query(
query_texts=["vector database for retrieval"],
n_results=1,
)
print(results["documents"]) # [["ChromaDB stores embeddings..."]]| Model | Dimensions | Cost | Notes |
|---|---|---|---|
| text-embedding-3-small | 1536 | ~$0.02/1M tokens | Best value — recommended default |
| text-embedding-3-large | 3072 | ~$0.13/1M tokens | Highest quality |
| text-embedding-ada-002 | 1536 | ~$0.10/1M tokens | Legacy, superseded by v3 |
Important consistency rule: you must use the exact same embedding model for both storing and querying. If you embed documents with text-embedding-3-small, all queries must also use text-embedding-3-small. Mixing models produces meaningless similarity scores.
ChromaDB provides a HuggingFaceEmbeddingFunction that calls the HuggingFace Inference API (cloud-hosted), and a SentenceTransformerEmbeddingFunction for running any Sentence Transformer model locally. For production use without per-call API costs, local Sentence Transformer models are the more common choice.
import chromadb
from chromadb.utils import embedding_functions
import os
client = chromadb.Client()
# Option 1: HuggingFace Inference API (cloud, requires API key)
ef_hf_api = embedding_functions.HuggingFaceEmbeddingFunction(
api_key=os.environ["HUGGINGFACE_API_KEY"],
model_name="sentence-transformers/all-MiniLM-L6-v2",
)
# Option 2: Local Sentence Transformers (no API key, runs on your machine)
ef_local = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2", # 384-dim, fast
# model_name="all-mpnet-base-v2", # 768-dim, higher quality
# model_name="BAAI/bge-large-en-v1.5", # excellent quality
device="cpu", # or "cuda" for GPU acceleration
)
collection = client.create_collection(
name="hf_docs",
embedding_function=ef_local,
metadata={"hnsw:space": "cosine"},
)
collection.add(
documents=[
"Open-source language models are becoming more powerful.",
"LLaMA and Mistral are popular open-source LLMs.",
],
ids=["h1", "h2"],
)
results = collection.query(
query_texts=["free LLM models"],
n_results=2,
)
print(results["documents"])
# Popular local models for RAG
models = {
"BAAI/bge-small-en-v1.5": "384-dim, excellent quality/speed ratio",
"BAAI/bge-large-en-v1.5": "1024-dim, top English quality",
"intfloat/e5-base-v2": "768-dim, strong multilingual",
"thenlper/gte-large": "1024-dim, great for retrieval",
}Trade-offs: HuggingFace Inference API requires no local GPU but costs money and adds latency. Local Sentence Transformers are free, fast (especially on GPU), run offline, and are privacy-preserving — preferred for sensitive data.
ChromaDB defines a simple protocol for embedding functions: a class with a __call__ method that accepts a list of strings and returns a list of embedding vectors. Implementing this interface lets you plug in any model — a local transformer, a third-party API, or even a mock for testing.
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
from typing import List
# Custom embedding function — must implement __call__
class MyCustomEmbeddingFunction(EmbeddingFunction):
"""Wraps any embedding model in ChromaDB's interface."""
def __init__(self, model_name: str = "my-model"):
# Load your model here
self.model_name = model_name
# self.model = load_model(model_name)
def __call__(self, input: Documents) -> Embeddings:
"""
input: list of strings to embed
return: list of lists of floats (one vector per string)
"""
embeddings = []
for text in input:
# Replace with your actual embedding logic
vector = self._embed_text(text)
embeddings.append(vector)
return embeddings
def _embed_text(self, text: str) -> List[float]:
# Example: fixed-dim hash-based mock (not for production)
import hashlib
h = hashlib.md5(text.encode()).digest()
return [b / 255.0 for b in h] # 16-dim mock vector
# Use your custom function exactly like a built-in one
client = chromadb.Client()
custom_ef = MyCustomEmbeddingFunction()
collection = client.create_collection(
name="custom_embed",
embedding_function=custom_ef,
)
collection.add(
documents=["Test document one", "Test document two"],
ids=["c1", "c2"],
)
results = collection.query(
query_texts=["test"],
n_results=1,
)
print(results["ids"]) # [["c1"]] or [["c2"]]When to write a custom embedding function:
- Your company uses a proprietary or self-hosted embedding model
- You need to embed data from a provider not in ChromaDB's built-in list
- You want to add preprocessing (text cleaning, chunking, domain adaptation) before embedding
- Testing — inject a deterministic mock that returns predictable vectors
The PersistentClient stores data in a directory you specify. Inside, ChromaDB uses SQLite for metadata (IDs, document text, metadata key-value pairs) and binary files for the HNSW vector index. All writes are flushed to disk automatically — there is no explicit save/commit step.
import chromadb
import os
# Create or open a persistent database
client = chromadb.PersistentClient(path="./my_vector_db")
# After this call, ./my_vector_db/ contains:
# - chroma.sqlite3 (metadata, documents, IDs)
# - <uuid>/ (one folder per collection)
# - header.bin (HNSW index configuration)
# - data_level0.bin (HNSW graph layer 0)
# - length.bin (element count)
col = client.get_or_create_collection("notes")
col.add(
documents=["Remember to buy milk", "Meeting at 3pm tomorrow"],
ids=["n1", "n2"],
)
# Data is persisted immediately — no commit needed
# Verify data survives restart:
del client, col # simulate process exit
client2 = chromadb.PersistentClient(path="./my_vector_db")
col2 = client2.get_collection("notes")
print(col2.count()) # 2 — still there!
print(col2.get(ids=["n1"])["documents"]) # ["Remember to buy milk"]
# Check the files on disk
for root, dirs, files in os.walk("./my_vector_db"):
for f in files:
print(os.path.join(root, f))| Limitation | Detail |
|---|---|
| Single writer only | SQLite allows only one writer at a time — concurrent writes from multiple processes cause errors |
| No built-in replication | The SQLite file is a single point of failure; back it up manually |
| No horizontal scaling | Cannot distribute load across multiple machines |
| File locking | Moving or copying the directory while the client is open can corrupt data |
| Migration | Upgrading ChromaDB versions may require running migration scripts on the SQLite DB |
For multi-process or production deployments, prefer running chroma run --path ./data as a server and connecting with HttpClient.
ChromaDB uses HNSW (Hierarchical Navigable Small World) as its Approximate Nearest Neighbour (ANN) index. HNSW builds a layered graph structure where each node connects to its closest neighbours — queries traverse this graph efficiently to find approximate nearest neighbours in O(log n) time instead of exhaustive O(n) linear scan.
import chromadb
client = chromadb.Client()
# HNSW parameters are set as metadata at collection creation
collection = client.create_collection(
name="tuned_collection",
metadata={
"hnsw:space": "cosine", # distance metric
"hnsw:construction_ef": 200, # default 100
# Controls quality of index during construction.
# Higher = better recall, slower inserts.
"hnsw:search_ef": 100, # default 10
# Controls quality of search at query time.
# Higher = better recall, slower queries.
"hnsw:M": 16, # default 16
# Number of bi-directional links per node.
# Higher = better recall + more memory + slower inserts.
# Typical range: 4-64.
},
)
# Note: HNSW parameters cannot be changed after collection creation
# You would need to recreate the collection and re-insert data
collection.add(
documents=[f"Document number {i}" for i in range(10000)],
ids=[str(i) for i in range(10000)],
)| Parameter | Default | Effect of increasing | Effect of decreasing |
|---|---|---|---|
| hnsw:space | l2 | Changes metric (cosine/ip) | — |
| hnsw:M | 16 | Better recall, more memory, slower inserts | Faster inserts, less memory, lower recall |
| hnsw:construction_ef | 100 | Better index quality, slower inserts | Faster inserts, lower quality graph |
| hnsw:search_ef | 10 | Better recall, slower queries | Faster queries, lower recall |
For most RAG use cases, the defaults work well for collections under ~100K documents. For large collections or when recall matters, increase hnsw:search_ef to 50–200 and set hnsw:construction_ef to at least 200 when building the index.
Adding tens of thousands of documents one at a time is slow because each call triggers embedding computation and index updates. The right approach is to batch documents into groups of 100–500 and add each batch with a single add() call — this amortises embedding overhead and index writes.
import chromadb
from chromadb.utils import embedding_functions
from typing import List
client = chromadb.PersistentClient(path="./bulk_db")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
"large_corpus", embedding_function=ef
)
# Simulate a large list of documents
documents = [f"Article about topic {i}" for i in range(10_000)]
ids = [f"doc-{i}" for i in range(10_000)]
metadatas = [{"index": i, "batch": i // 500} for i in range(10_000)]
# Efficient batch insertion
BATCH_SIZE = 500
for start in range(0, len(documents), BATCH_SIZE):
end = start + BATCH_SIZE
collection.add(
documents=documents[start:end],
ids=ids[start:end],
metadatas=metadatas[start:end],
)
print(f"Added batch {start // BATCH_SIZE + 1}, total: {collection.count()}")
print(f"Final count: {collection.count()}") # 10000
# Alternative: provide pre-computed embeddings to skip re-embedding
# (useful when you already called the embedding API)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs_batch = documents[:500]
vectors = model.encode(docs_batch, batch_size=64, show_progress_bar=True)
collection.add(
embeddings=vectors.tolist(),
documents=docs_batch,
ids=ids[:500],
)| Tip | Reason |
|---|---|
| Batch size 100–500 | Balances memory use and embedding throughput |
| Pre-compute embeddings externally | Avoid re-embedding if you already have vectors from an API call |
| Use GPU for local models | SentenceTransformer encodes ~100x faster on GPU |
| Upsert instead of add in loops | upsert() is safe to re-run; add() fails on duplicate IDs |
ChromaDB provides two types of filters that can be used together or separately: where filters on metadata fields (structured key-value pairs), while where_document filters on the raw text content of the stored documents. Both can be combined in a single query.
import chromadb
client = chromadb.Client()
col = client.create_collection("mixed_docs")
col.add(
documents=[
"Python tutorial for beginners with examples",
"Advanced Python decorators and metaclasses",
"JavaScript async/await guide",
"Python data science with pandas and numpy",
"Rust memory safety tutorial",
],
metadatas=[
{"lang": "python", "level": "beginner"},
{"lang": "python", "level": "advanced"},
{"lang": "js", "level": "intermediate"},
{"lang": "python", "level": "intermediate"},
{"lang": "rust", "level": "beginner"},
],
ids=["d1","d2","d3","d4","d5"],
)
# where_document: filter on text content
results = col.query(
query_texts=["programming guide"],
n_results=5,
where_document={"$contains": "tutorial"}, # text must contain "tutorial"
)
print(results["ids"]) # d1, d2 (Python tut), d5 (Rust tut) — JS has no "tutorial"
# where_document with NOT
results = col.query(
query_texts=["programming"],
n_results=5,
where_document={"$not_contains": "JavaScript"},
)
# Combine where (metadata) + where_document (text content)
results = col.query(
query_texts=["learning to code"],
n_results=3,
where={"lang": "python"}, # metadata filter
where_document={"$contains": "tutorial"}, # content filter
# Only Python docs whose text contains "tutorial"
)
print(results["documents"])
# Only matches d1: "Python tutorial for beginners with examples"| Filter | Operates on | Supported operators |
|---|---|---|
| where | Metadata key-value fields | $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or |
| where_document | Raw document text content | $contains, $not_contains |
Both query() and get() accept an include parameter — a list of strings specifying which fields to return. Omitting fields you don't need reduces network payload and memory, which matters for large result sets.
import chromadb
client = chromadb.Client()
col = client.create_collection("demo")
col.add(
documents=["Alpha document", "Beta document", "Gamma document"],
metadatas=[{"tag": "a"}, {"tag": "b"}, {"tag": "c"}],
ids=["id1", "id2", "id3"],
)
# Default include: documents, metadatas, distances (for query)
# Default include for get(): documents, metadatas (no distances)
results = col.query(query_texts=["document"], n_results=2)
print(results.keys())
# dict_keys(["ids", "distances", "metadatas", "embeddings", "documents", "uris", "data"])
# embeddings, uris, data are None by default
# Only return IDs and distances — smallest possible response
results = col.query(
query_texts=["alpha"],
n_results=2,
include=["distances"], # ids are always returned
)
print(results["documents"]) # None
print(results["distances"]) # [[0.05, 0.72]]
# Include raw embedding vectors (large! use only when needed)
results = col.query(
query_texts=["beta"],
n_results=1,
include=["documents", "metadatas", "distances", "embeddings"],
)
print(len(results["embeddings"][0][0])) # 384 floats per vector
# get() include — embeddings must be explicitly requested
all_data = col.get(
include=["documents", "metadatas", "embeddings"],
)
print(len(all_data["embeddings"])) # 3
# get() without include — minimal response
ids_only = col.get()
print(ids_only["ids"]) # ["id1", "id2", "id3"]
print(ids_only["documents"]) # ["Alpha...", "Beta...", "Gamma..."]| Value | Returned in query()? | Returned in get()? |
|---|---|---|
| documents | Yes (default) | Yes (default) |
| metadatas | Yes (default) | Yes (default) |
| distances | Yes (default) | No — not applicable |
| embeddings | No (must request) | No (must request) |
| uris | No (multimodal only) | No (multimodal only) |
| data | No (multimodal only) | No (multimodal only) |
Metadata in ChromaDB is stored as flat key-value dictionaries where values must be strings, integers, or floats (not nested dicts or lists). Good metadata design makes the difference between fast, precise filtered queries and slow full-collection scans.
import chromadb
from datetime import datetime
client = chromadb.Client()
col = client.create_collection("knowledge_base")
# Good metadata design — flat, filterable fields
col.add(
documents=[
"Introduction to transformer architecture in deep learning.",
"BERT: Pre-training of Deep Bidirectional Transformers.",
"GPT-4 technical report overview.",
],
metadatas=[
{
"source": "textbook",
"author": "Vaswani",
"year": 2017, # int — supports $gt, $lt
"category": "architecture",
"citations": 50000, # int — sortable
"language": "en",
# timestamp as int for range queries
"added_ts": int(datetime(2024,1,1).timestamp()),
},
{
"source": "paper",
"author": "Devlin",
"year": 2018,
"category": "pretraining",
"citations": 40000,
"language": "en",
"added_ts": int(datetime(2024,1,2).timestamp()),
},
{
"source": "report",
"author": "OpenAI",
"year": 2023,
"category": "LLM",
"citations": 5000,
"language": "en",
"added_ts": int(datetime(2024,1,3).timestamp()),
},
],
ids=["p1","p2","p3"],
)
# Effective filtered queries
results = col.query(
query_texts=["neural network architecture"],
n_results=5,
where={"$and": [
{"year": {"$gte": 2017}},
{"citations":{"$gte": 10000}},
{"language": "en"},
]},
)
# Anti-patterns to avoid in metadata:
# BAD: {"tags": ["python", "nlp"]} — lists not supported
# BAD: {"author": {"name": "Vaswani", "affiliation": "Google"}} — nested not supported
# GOOD: {"tag_python": 1, "tag_nlp": 1} — flatten list membership to bool ints
# GOOD: {"author_name": "Vaswani", "author_org": "Google"} — flatten nested| Type | Supported? | Supports range filters? |
|---|---|---|
| str | Yes | Only $eq, $ne, $in, $nin |
| int | Yes | Yes — $gt, $gte, $lt, $lte |
| float | Yes | Yes — $gt, $gte, $lt, $lte |
| bool | No — use int 0/1 | — |
| list | No | — |
| dict (nested) | No | — |
ChromaDB provides several methods to examine what is stored in a collection — useful for debugging, verifying ingestion, and monitoring collection health.
import chromadb
client = chromadb.PersistentClient(path="./inspect_demo")
col = client.get_or_create_collection(
"articles",
metadata={"hnsw:space": "cosine"},
)
col.add(
documents=[f"Article {i} about topic {i%3}" for i in range(20)],
metadatas=[{"topic": i%3, "idx": i} for i in range(20)],
ids=[f"art-{i}" for i in range(20)],
)
# 1. Count documents
print(col.count()) # 20
# 2. Peek — quick look at first n items (default n=10)
peek = col.peek(limit=5)
print(peek["ids"]) # first 5 IDs
print(peek["documents"]) # first 5 documents
# 3. Get all (careful with large collections!)
all_items = col.get()
print(len(all_items["ids"])) # 20
# 4. Get a page of results (offset-based)
page = col.get(
limit=5,
offset=10, # skip first 10
)
print(page["ids"]) # art-10 through art-14
# 5. Inspect collection metadata and config
print(col.name) # "articles"
print(col.id) # UUID
print(col.metadata) # {"hnsw:space": "cosine"}
# 6. List all collections
for c in client.list_collections():
print(c) # prints collection name
# 7. Check if a document exists by ID
result = col.get(ids=["art-5"])
if result["ids"]:
print("Found:", result["documents"][0])
else:
print("Not found")| Method / Property | Purpose |
|---|---|
| collection.count() | Number of documents stored |
| collection.peek(limit=10) | Quick sample of first N items |
| collection.get() | Retrieve all items (paginate large collections) |
| collection.get(limit=N, offset=M) | Paginate through collection |
| collection.name | Collection name string |
| collection.metadata | Dict of collection settings (hnsw:space etc.) |
| client.list_collections() | Names of all collections |
RAG combines ChromaDB's semantic retrieval with an LLM's generation ability. The pipeline has two phases: indexing (chunk documents, embed, store in ChromaDB) and retrieval (embed the user query, fetch similar chunks, inject into LLM prompt).
import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import os
# --- INDEXING PHASE (run once) ---
chroma_client = chromadb.PersistentClient(path="./rag_db")
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small",
)
collection = chroma_client.get_or_create_collection(
"company_docs", embedding_function=ef, metadata={"hnsw:space": "cosine"}
)
# Chunk and index your knowledge base
documents = [
"ChromaDB supports cosine, l2, and inner-product distance metrics.",
"Persistent storage in ChromaDB uses SQLite under the hood.",
"The default embedding model is all-MiniLM-L6-v2 with 384 dimensions.",
"ChromaDB collections support metadata filtering with $eq, $gt, $in operators.",
]
collection.add(
documents=documents,
ids=[f"doc-{i}" for i in range(len(documents))],
)
# --- RETRIEVAL + GENERATION PHASE (run per query) ---
def rag_answer(user_question: str, n_results: int = 3) -> str:
# 1. Retrieve relevant chunks from ChromaDB
results = collection.query(
query_texts=[user_question],
n_results=n_results,
include=["documents", "distances"],
)
context_chunks = results["documents"][0] # list of retrieved texts
context = "\n\n".join(
f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks)
)
# 2. Build an augmented prompt
prompt = f"""Answer the question using ONLY the context below.
If the answer is not in the context, say "I don't know."
Context:
{context}
Question: {user_question}
Answer:"""
# 3. Generate answer with LLM
openai_client = OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
print(rag_answer("What distance metrics does ChromaDB support?"))
Before adding documents to ChromaDB, long texts must be split into chunks that fit within the embedding model's token limit and contain cohesive information. Chunk size and overlap directly affect retrieval quality.
# pip install langchain-text-splitters
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
TokenTextSplitter,
)
import chromadb
# RecursiveCharacterTextSplitter — tries to split at natural boundaries
# (paragraphs → sentences → words → characters)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # characters per chunk (aim for ~200–400 tokens)
chunk_overlap=50, # overlap prevents losing context at chunk boundaries
separators=["\n\n", "\n", ". ", " ", ""],
)
long_document = """ChromaDB is an open-source vector database.
It supports multiple embedding functions including OpenAI and HuggingFace.
ChromaDB uses HNSW for approximate nearest-neighbour search.
You can filter results using metadata fields.
Persistent storage uses SQLite under the hood.
""" * 20 # repeat to make it long
chunks = splitter.split_text(long_document)
print(f"Split into {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])} chars")
# Add chunks to ChromaDB with source metadata
client = chromadb.Client()
col = client.create_collection("chunked_docs")
col.add(
documents=chunks,
metadatas=[{"source": "chroma_guide.txt", "chunk_idx": i}
for i in range(len(chunks))],
ids=[f"chunk-{i}" for i in range(len(chunks))],
)| Strategy | Chunk size | Overlap | Best for |
|---|---|---|---|
| Small chunks | 100–200 tokens | 10–20 tokens | Precise retrieval, FAQ-style docs |
| Medium chunks | 300–500 tokens | 50 tokens | Most RAG use cases — good balance |
| Large chunks | 800–1000 tokens | 100 tokens | Long-form prose where context matters |
| Semantic chunking | Variable | 0 | Academic papers, structured content |
Key rule: chunk overlap prevents the situation where a sentence spanning a chunk boundary gets split, losing its meaning in both halves. Typical overlap is 10–20% of chunk size.
LangChain provides a first-class Chroma vector store integration that wraps ChromaDB's API with LangChain's retriever interface. This enables plugging ChromaDB into LangChain RAG chains, agents, and pipelines without writing low-level ChromaDB code.
# pip install langchain langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# --- Option 1: Create from documents ---
docs = [
Document(page_content="ChromaDB is a vector database.", metadata={"source": "intro"}),
Document(page_content="HNSW is used for ANN search.", metadata={"source": "tech"}),
Document(page_content="RAG improves LLM accuracy.", metadata={"source": "ai"}),
]
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
collection_name="lc_demo",
persist_directory="./lc_chroma", # persistent storage
)
# --- Option 2: Load existing ChromaDB ---
vectorstore = Chroma(
collection_name="lc_demo",
embedding_function=embeddings,
persist_directory="./lc_chroma",
)
# Similarity search
results = vectorstore.similarity_search("vector databases", k=2)
for doc in results:
print(doc.page_content)
# As retriever (for use in chains)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3, "filter": {"source": "tech"}},
)
# Build a simple RAG chain with LCEL
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template(
"Answer based on context:\n\n{context}\n\nQuestion: {question}"
)
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
print(rag_chain.invoke("What search algorithm does ChromaDB use?"))
ChromaDB does not have built-in user-level access control, but you can implement logical isolation between tenants using separate collections per tenant (strong isolation) or metadata-based filtering (lighter weight). Choose based on your security and scale requirements.
import chromadb
client = chromadb.PersistentClient(path="./multi_tenant")
# --- Strategy 1: Separate collection per tenant ---
# Strong isolation — one tenant cannot accidentally access another's data
def get_tenant_collection(tenant_id: str):
collection_name = f"tenant_{tenant_id}" # e.g. "tenant_acme_corp"
return client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine", "tenant": tenant_id},
)
col_acme = get_tenant_collection("acme_corp")
col_globex = get_tenant_collection("globex_inc")
col_acme.add(
documents=["ACME internal policy v1"],
ids=["acme-doc-1"],
)
col_globex.add(
documents=["Globex product catalogue"],
ids=["globex-doc-1"],
)
# ACME queries can never return Globex data — total isolation
# --- Strategy 2: Metadata filtering (shared collection) ---
# Lighter weight — all tenants share one collection, filtered at query time
shared_col = client.get_or_create_collection("shared_docs")
shared_col.add(
documents=["ACME policy", "Globex catalogue"],
metadatas=[{"tenant_id": "acme"}, {"tenant_id": "globex"}],
ids=["s1", "s2"],
)
def tenant_query(tenant_id: str, query: str, n: int = 3):
return shared_col.query(
query_texts=[query],
n_results=n,
where={"tenant_id": tenant_id}, # ALWAYS filter by tenant
)
results = tenant_query("acme", "company policies")
print(results["documents"]) # Only ACME docs returned| Strategy | Isolation | Overhead | Best for |
|---|---|---|---|
| Separate collections | Strong — no cross-tenant risk | More collections to manage | High-security, regulated industries |
| Metadata filter | Logical — relies on query discipline | Single collection, simpler ops | Many small tenants, lower risk |
Embedding consistency means using the exact same embedding model and version for both indexing (adding documents) and querying. If you embed documents with model A but query with model B, the resulting vectors live in incompatible geometric spaces — similarity distances become meaningless and retrieval quality collapses.
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="./consistency_demo")
# CORRECT: same embedding function for add and query
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
"correct_usage",
embedding_function=ef, # stored on collection
)
collection.add(
documents=["Hello world"],
ids=["d1"],
)
# query() automatically uses the same ef stored on the collection
results = collection.query(query_texts=["greetings"], n_results=1)
# Works correctly — ef is applied to both document and query
# ---
# PITFALL 1: switching models between sessions
# Session 1: add with all-MiniLM-L6-v2 (384 dims)
# Session 2: accidentally use all-mpnet-base-v2 (768 dims) → dimension mismatch error!
# PITFALL 2: updating embedding model version
# Model v1.0 and v1.1 may produce different vector spaces
# Always re-embed ALL documents when upgrading the embedding model
# BEST PRACTICE: store the model name in collection metadata
collection_safe = client.get_or_create_collection(
"safe_collection",
embedding_function=ef,
metadata={
"hnsw:space": "cosine",
"embedding_model": "all-MiniLM-L6-v2", # document which model was used
"embedding_dim": "384",
},
)
# On load, verify the model matches what is stored:
meta = collection_safe.metadata
print(meta["embedding_model"]) # "all-MiniLM-L6-v2"
print(meta["embedding_dim"]) # "384"
# When you need to upgrade the embedding model:
# 1. Create a NEW collection with the new model
# 2. Re-embed and re-insert all documents
# 3. Run validation queries to confirm quality
# 4. Delete the old collection| Check | Why |
|---|---|
| Same model name | Different models produce vectors in different spaces |
| Same model version | Even minor version updates can shift the vector space |
| Same preprocessing | Lowercasing, truncation, etc. must be identical |
| Store model name in metadata | Documents which model was used for future reference |
| Re-embed on model upgrade | Old and new vectors cannot coexist in the same collection |
For production or multi-process environments, run ChromaDB as a persistent HTTP server and connect all clients via chromadb.HttpClient(). This removes the single-writer SQLite limitation and allows any number of clients — including different languages — to share the same database.
# --- SERVER SIDE ---
# Install: pip install chromadb
# Start the server from the command line:
# chroma run --path ./chroma_data --port 8000 --host 0.0.0.0
# Or run programmatically (e.g. in tests):
import chromadb
from chromadb.config import Settings
# --- CLIENT SIDE ---
client = chromadb.HttpClient(
host="localhost",
port=8000,
)
# Verify server is reachable
client.heartbeat() # raises ConnectionError if server is down
# Usage is identical to PersistentClient
collection = client.get_or_create_collection(
"shared_docs",
metadata={"hnsw:space": "cosine"},
)
collection.add(
documents=["Shared document from client 1"],
ids=["s1"],
)
results = collection.query(query_texts=["shared content"], n_results=1)
print(results["documents"])
# With authentication (chromadb server configured with auth)
client_auth = chromadb.HttpClient(
host="my-server.example.com",
port=443,
ssl=True,
headers={"Authorization": "Bearer my-token"},
)# docker-compose.yml — containerised ChromaDB server
# version: "3.9"
# services:
# chromadb:
# image: chromadb/chroma:latest
# ports:
# - "8000:8000"
# volumes:
# - chroma_data:/chroma/chroma
# environment:
# - IS_PERSISTENT=TRUE
# - ANONYMIZED_TELEMETRY=FALSE
# volumes:
# chroma_data:| Mode | Concurrency | Network | Use case |
|---|---|---|---|
| EphemeralClient | Single process only | None | Tests, notebooks |
| PersistentClient | Single writer only | None | Local scripts, dev |
| HttpClient | Multiple clients | HTTP/HTTPS | Production, microservices |
upsert() is the idempotent write operation in ChromaDB: it inserts a document if the ID does not exist, or updates it if the ID already exists. This makes it safe to call repeatedly without checking whether a document has been indexed before — a critical property for ETL pipelines, scheduled sync jobs, and incremental indexing.
import chromadb
from datetime import datetime
client = chromadb.PersistentClient(path="./upsert_demo")
col = client.get_or_create_collection("products")
# Pattern 1: Safe initial load
# Can re-run the script without duplicate ID errors
def sync_products(products: list[dict]):
col.upsert(
documents=[p["description"] for p in products],
ids= [str(p["id"]) for p in products],
metadatas= [{"name": p["name"], "price": p["price"], "updated": int(datetime.now().timestamp())}
for p in products],
)
products_v1 = [
{"id": 1, "name": "Widget", "description": "A blue widget", "price": 9.99},
{"id": 2, "name": "Gadget", "description": "A red gadget", "price": 14.99},
]
sync_products(products_v1) # inserts both
print(col.count()) # 2
# Product 1 description changed — upsert handles it cleanly
products_v2 = [
{"id": 1, "name": "Widget", "description": "An improved blue widget v2", "price": 11.99},
{"id": 3, "name": "Doohickey", "description": "A green doohickey", "price": 4.99},
]
sync_products(products_v2) # updates id=1, inserts id=3
print(col.count()) # 3
# Verify the update
result = col.get(ids=["1"])
print(result["documents"][0]) # "An improved blue widget v2"
print(result["metadatas"][0]["price"]) # 11.99
# Pattern 2: Incremental indexing — only upsert changed documents
def incremental_sync(items, last_sync_ts: int):
changed = [i for i in items if i["updated_at"] > last_sync_ts]
if changed:
col.upsert(
documents=[i["body"] for i in changed],
ids= [i["id"] for i in changed],
metadatas= [{"updated_at": i["updated_at"]} for i in changed],
)| Scenario | Use |
|---|---|
| First-time bulk load with guaranteed unique IDs | add() — faster, errors catch duplicate bugs |
| Recurring sync job (daily/hourly) | upsert() — safe to re-run without cleanup |
| User-triggered document update | upsert() — don't need to check if doc exists first |
| Append-only event log | add() — duplicates should be errors, not updates |
Collection-level metadata (set via create_collection(metadata=...)) stores configuration about the collection itself. Document-level metadata (set per document via add(metadatas=[...])) enables filtered retrieval. Both need thoughtful design for maintainable production systems.
import chromadb
from datetime import datetime
client = chromadb.PersistentClient(path="./prod_db")
# Good collection-level metadata: document operational details
collection = client.get_or_create_collection(
name="support_tickets_v2",
metadata={
# HNSW config
"hnsw:space": "cosine",
"hnsw:construction_ef": 200,
"hnsw:search_ef": 100,
# Operational metadata
"embedding_model": "text-embedding-3-small",
"embedding_dims": "1536",
"schema_version": "2",
"created_at": "2024-01-15",
"description": "Customer support ticket embeddings for semantic search",
},
)
# Good document-level metadata: filterable, flat, typed
def add_ticket(ticket: dict):
collection.upsert(
documents=[ticket["description"]],
ids=[f"ticket-{ticket['id']}"],
metadatas=[{
# Filterable dimensions
"status": ticket["status"], # "open"/"closed"/"pending"
"priority": ticket["priority"], # "low"/"medium"/"high"
"category": ticket["category"], # "billing"/"technical"/"general"
"agent_id": ticket["agent_id"], # str identifier
# Date as Unix timestamp (int) — enables $gt/$lt range queries
"created_ts": int(datetime.fromisoformat(ticket["created_at"]).timestamp()),
"year": int(ticket["created_at"][:4]),
# Boolean as int — ChromaDB does not support bool type
"is_escalated": int(ticket.get("escalated", False)),
}],
)
# Effective compound filter
results = collection.query(
query_texts=["payment failed cannot checkout"],
n_results=10,
where={"$and": [
{"status": "open"},
{"priority": {"$in": ["high", "medium"]}},
{"category": "billing"},
{"created_ts":{"$gte": int(datetime(2024, 1, 1).timestamp())}},
]},
)Key rules: store dates as Unix timestamps (int) for range filtering. Store booleans as 0/1 integers. Keep metadata keys short and snake_case. Document your schema in collection-level metadata so future developers know what fields exist.
FAISS (Facebook AI Similarity Search) and ChromaDB both store and search embedding vectors, but they are designed for very different use cases. FAISS is a low-level library optimised for raw performance; ChromaDB is a higher-level database designed for developer ergonomics and full-stack AI applications.
| Feature | ChromaDB | FAISS |
|---|---|---|
| Type | Vector database (full-stack) | Vector index library (low-level) |
| Storage | Persistent SQLite + HNSW files | In-memory or flat files (manual) |
| Metadata | Built-in key-value filtering | No metadata — must manage separately |
| Documents | Stores original text alongside vectors | Stores vectors only — text management is manual |
| Persistence | Built-in PersistentClient | Manual save/load with faiss.write_index() |
| CRUD | add, get, update, delete, upsert | Add only — no update/delete without rebuilding |
| API | High-level Python + REST | Low-level Python/C++ bindings |
| Performance | Good for <10M docs | Excellent for 10M+ docs (GPU-accelerated) |
| Embedding function | Built-in (auto-embed text) | You must manage embeddings yourself |
| Best for | RAG apps, prototyping, small-medium scale | High-throughput ML systems, research, scale |
# FAISS — lower level, manage everything manually
import faiss
import numpy as np
# Build index manually
dim = 384
index = faiss.IndexFlatIP(dim) # inner product
vectors = np.random.rand(1000, dim).astype("float32")
faiss.normalize_L2(vectors)
index.add(vectors) # add vectors
D, I = index.search(query_vec, k=5) # search
faiss.write_index(index, "index.faiss") # save manually
# ChromaDB — higher level, text in, results out
import chromadb
client = chromadb.Client()
col = client.create_collection("demo")
col.add(documents=["text one", "text two"], ids=["1","2"])
results = col.query(query_texts=["similar text"], n_results=2)
# Embeddings, persistence, metadata all handled automatically
ChromaDB raises specific exception types that should be caught and handled gracefully in production applications. Understanding the error hierarchy helps you write resilient ingestion pipelines and retrieval code.
import chromadb
from chromadb.errors import (
InvalidCollectionException,
IDAlreadyExistsError,
InvalidDimensionException,
)
client = chromadb.PersistentClient(path="./error_demo")
# --- Error 1: Collection not found ---
try:
col = client.get_collection("does_not_exist")
except InvalidCollectionException as e:
print(f"Collection missing: {e}")
col = client.create_collection("does_not_exist") # create it
# --- Error 2: Duplicate ID ---
col.add(documents=["Original doc"], ids=["doc-1"])
try:
col.add(documents=["Duplicate doc"], ids=["doc-1"])
except IDAlreadyExistsError:
print("ID already exists — use upsert() instead")
col.upsert(documents=["Updated doc"], ids=["doc-1"]) # safe
# --- Error 3: Dimension mismatch ---
# Occurs when pre-computed embeddings don't match collection's embedding dimensions
col2 = client.create_collection("fixed_dim")
col2.add(embeddings=[[0.1, 0.2, 0.3]], documents=["Doc"], ids=["x"])
try:
col2.add(embeddings=[[0.1, 0.2]], documents=["Wrong dim"], ids=["y"]) # 2-dim
except InvalidDimensionException as e:
print(f"Dimension mismatch: {e}")
# --- Error 4: Connection error (HttpClient) ---
try:
remote = chromadb.HttpClient(host="bad-host", port=9999)
remote.heartbeat()
except Exception as e:
print(f"Server unreachable: {e}")
# --- Production pattern: retry wrapper ---
import time
from functools import wraps
def with_retry(max_attempts=3, delay=1.0):
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return fn(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
print(f"Attempt {attempt+1} failed: {e}. Retrying...")
time.sleep(delay * (attempt + 1))
return wrapper
return decorator
@with_retry(max_attempts=3)
def safe_add(collection, documents, ids):
collection.upsert(documents=documents, ids=ids)| Exception | Cause | Fix |
|---|---|---|
| InvalidCollectionException | get_collection() on non-existent name | Use get_or_create_collection() |
| IDAlreadyExistsError | add() with duplicate IDs | Use upsert() for idempotent writes |
| InvalidDimensionException | Pre-computed embeddings wrong size | Match dimensions to collection's model |
| ValueError | Empty IDs, bad metadata types | Validate inputs before calling ChromaDB |
| ConnectionError / requests exception | HttpClient cannot reach server | Check server health, retry with backoff |
A PersistentClient database is simply a directory on disk. Backing it up is as straightforward as copying that directory — but you must ensure no writes are occurring during the copy to avoid a corrupted SQLite file.
import chromadb
import shutil
import os
from datetime import datetime
DB_PATH = "./my_chroma_db"
BACKUP_DIR = "./backups"
# --- Backup strategy 1: Simple directory copy ---
# SAFE when: no active PersistentClient writes during the copy
os.makedirs(BACKUP_DIR, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = os.path.join(BACKUP_DIR, f"chroma_backup_{timestamp}")
shutil.copytree(DB_PATH, backup_path)
print(f"Backup saved to {backup_path}")
# --- Backup strategy 2: SQLite online backup (safe during reads) ---
import sqlite3
def backup_sqlite(db_path: str, backup_path: str):
"""SQLite online backup — safe even with active readers."""
src = sqlite3.connect(os.path.join(db_path, "chroma.sqlite3"))
dst = sqlite3.connect(os.path.join(backup_path, "chroma.sqlite3"))
os.makedirs(backup_path, exist_ok=True)
with dst:
src.backup(dst, pages=100, progress=lambda s,p,r: print(f"Backed up {p} pages"))
dst.close()
src.close()
# Also copy the HNSW index binary files
for root, dirs, files in os.walk(db_path):
for f in files:
if f != "chroma.sqlite3":
rel = os.path.relpath(root, db_path)
dest_dir = os.path.join(backup_path, rel)
os.makedirs(dest_dir, exist_ok=True)
shutil.copy2(os.path.join(root, f), os.path.join(dest_dir, f))
# --- Restore ---
def restore_backup(backup_path: str, restore_path: str):
if os.path.exists(restore_path):
shutil.rmtree(restore_path) # remove current
shutil.copytree(backup_path, restore_path)
print(f"Restored from {backup_path} to {restore_path}")
# Verify restored database
client = chromadb.PersistentClient(path=restore_path)
for col in client.list_collections():
print(f" {col}: {client.get_collection(col).count()} documents")For the HttpClient / server mode: stop the ChromaDB server before copying the data directory, or use SQLite's online backup API. Never copy a SQLite file while it has active writers — this can produce a corrupted backup.
ChromaDB stores document text and vectors persistently, but it does not store which embedding function was used. When you reopen a PersistentClient, you must re-supply the same embedding function to the collection — otherwise ChromaDB may default to a different model, producing embedding mismatches.
import chromadb
from chromadb.utils import embedding_functions
import os
DB_PATH = "./persistent_ef_demo"
# === SESSION 1: Create and populate collection ===
client1 = chromadb.PersistentClient(path=DB_PATH)
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small",
)
col1 = client1.get_or_create_collection(
name="my_docs",
embedding_function=ef_openai, # set the EF
metadata={"hnsw:space": "cosine",
"embedding_model": "text-embedding-3-small"}, # document it
)
col1.add(documents=["ChromaDB is great"], ids=["d1"])
print("Session 1 done, process exits...")
del client1, col1
# === SESSION 2: Reopen — MUST re-supply the same embedding function ===
client2 = chromadb.PersistentClient(path=DB_PATH)
# WRONG: ChromaDB defaults to all-MiniLM-L6-v2 (384-dim)
# Querying with a different model produces wrong results!
# col_wrong = client2.get_collection("my_docs") # DO NOT DO THIS
# CORRECT: Re-supply the exact same embedding function
ef_openai_v2 = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small", # must match session 1
)
col2 = client2.get_collection(
name="my_docs",
embedding_function=ef_openai_v2, # required!
)
results = col2.query(query_texts=["vector databases"], n_results=1)
print(results["documents"]) # correct result
# TIP: Read model name from collection metadata to avoid hardcoding
saved_model = col2.metadata.get("embedding_model", "all-MiniLM-L6-v2")
print(f"Using model: {saved_model}")| Scenario | Problem | Solution |
|---|---|---|
| Reopen collection without EF | Defaults to all-MiniLM-L6-v2, mismatches stored vectors | Always pass embedding_function= on get_collection() |
| Upgrade embedding model | Old vectors incompatible with new model | Create new collection, re-embed all docs, migrate |
| Team member uses different EF | Silent quality degradation | Store model name in collection metadata, document in README |
ChromaDB query results include a distances field. The interpretation depends on the distance metric. Raw distances are not directly comparable across metrics, but they can be normalised into a [0, 1] relevance score for display or thresholding.
import chromadb
client = chromadb.Client()
col = client.create_collection("relevance_demo", metadata={"hnsw:space": "cosine"})
col.add(
documents=[
"ChromaDB is an open-source vector database",
"Python is a popular programming language",
"The Eiffel Tower is in Paris France",
],
ids=["d1","d2","d3"],
)
results = col.query(
query_texts=["vector database for AI"],
n_results=3,
include=["documents","distances"],
)
raw_distances = results["distances"][0]
print("Raw cosine distances:", raw_distances)
# e.g. [0.18, 0.72, 1.31]
# cosine distance: 0 = identical, 2 = completely opposite
# Convert cosine distance to similarity score [0, 1]
def cosine_distance_to_score(distance: float) -> float:
"""cosine distance [0,2] → relevance score [0,1]"""
return 1 - (distance / 2)
for doc, dist in zip(results["documents"][0], raw_distances):
score = cosine_distance_to_score(dist)
print(f" Score: {score:.3f} | {doc[:50]}")
# Score: 0.910 | ChromaDB is an open-source vector database
# Score: 0.640 | Python is a popular programming language
# Score: 0.345 | The Eiffel Tower is in Paris France
# Threshold: only return results above minimum relevance
MIN_SCORE = 0.7
filtered = [
(doc, cosine_distance_to_score(dist))
for doc, dist in zip(results["documents"][0], raw_distances)
if cosine_distance_to_score(dist) >= MIN_SCORE
]
print(f"\nResults above {MIN_SCORE} threshold: {len(filtered)}")
for doc, score in filtered:
print(f" {score:.3f}: {doc}")| Metric | Range | Most similar | Conversion to [0,1] score |
|---|---|---|---|
| cosine | 0 to 2 | 0 (identical) | score = 1 - distance/2 |
| l2 (Euclidean) | 0 to ∞ | 0 (identical) | score = 1 / (1 + distance) |
| ip (inner product) | -∞ to 0 (normalised) | Most negative = most similar | score = -distance (normalised vecs) |
ChromaDB does not impose hard document count limits, but practical performance degrades at different thresholds depending on storage mode, hardware, and HNSW configuration. Understanding these helps you plan capacity and know when to consider alternatives.
| Collection size | Storage mode | Typical behaviour |
|---|---|---|
| < 100K docs | PersistentClient or HttpClient | Excellent — sub-10ms query latency |
| 100K – 1M docs | HttpClient (server mode) | Good — 10–100ms queries with default settings |
| 1M – 10M docs | HttpClient + HNSW tuning | Acceptable — tune hnsw:M and hnsw:search_ef |
| > 10M docs | Consider FAISS or Weaviate | ChromaDB may struggle — these are better at extreme scale |
import chromadb
import time
client = chromadb.Client()
col = client.create_collection(
"scale_test",
metadata={
"hnsw:space": "cosine",
"hnsw:construction_ef": 200, # higher quality index
"hnsw:search_ef": 100, # higher recall at query time
"hnsw:M": 32, # more connections per node
},
)
# Batch insert 50,000 documents
BATCH = 500
for i in range(0, 50_000, BATCH):
col.add(
documents=[f"Document about topic {j % 100}" for j in range(i, i+BATCH)],
ids=[str(j) for j in range(i, i+BATCH)],
)
print(f"Collection has {col.count()} documents")
# Measure query latency
start = time.perf_counter()
results = col.query(query_texts=["topic 42"], n_results=10)
elapsed = time.perf_counter() - start
print(f"Query latency: {elapsed*1000:.1f}ms")
# Memory footprint estimate:
# 384-dim float32 vectors: 384 * 4 bytes = 1.5 KB per doc
# 50K docs * 1.5 KB = ~75 MB just for vectors
# HNSW graph adds ~20-30% overhead → ~100 MB total for 50K docsMemory rule of thumb: each 384-dim vector requires ~1.5 KB. A 1M document collection with 384-dim embeddings needs ~1.5 GB just for vectors, plus HNSW graph overhead (~25%). Plan memory accordingly when deploying the ChromaDB server.
ChromaDB's similarity search makes it straightforward to detect semantic duplicates — documents that express the same idea with different wording. Before inserting a new document, query ChromaDB to see if a highly similar document already exists and decide whether to skip or replace it.
import chromadb
client = chromadb.Client()
col = client.create_collection(
"dedup_store",
metadata={"hnsw:space": "cosine"},
)
# Similarity threshold — tune based on your use case
DUPLICATE_THRESHOLD = 0.95 # cosine similarity >= 0.95 → treat as duplicate
def cosine_dist_to_score(d: float) -> float:
return 1 - d / 2
def add_if_unique(
collection,
document: str,
doc_id: str,
metadata: dict = None,
threshold: float = DUPLICATE_THRESHOLD,
) -> bool:
"""Returns True if document was added, False if it was a duplicate."""
if collection.count() == 0:
collection.add(documents=[document], ids=[doc_id],
metadatas=[metadata or {}])
return True
# Query for the nearest existing document
results = collection.query(
query_texts=[document],
n_results=1,
include=["documents", "distances"],
)
nearest_dist = results["distances"][0][0]
nearest_score = cosine_dist_to_score(nearest_dist)
nearest_doc = results["documents"][0][0]
if nearest_score >= threshold:
print(f"DUPLICATE detected (score={nearest_score:.3f}):")
print(f" New: {document[:60]}")
print(f" Existing: {nearest_doc[:60]}")
return False # skip insertion
collection.add(documents=[document], ids=[doc_id],
metadatas=[metadata or {}])
return True
# Test deduplication
phrases = [
("ChromaDB is a vector database for AI apps.", "p1"),
("Chroma DB is a vector store built for AI applications.", "p2"), # near-dup of p1
("Python is great for machine learning.", "p3"),
]
for text, pid in phrases:
added = add_if_unique(col, text, pid)
print(f"Added: {added} — {text[:40]}")
print(f"\nFinal collection size: {col.count()}") # 2 (p2 was duplicate of p1)Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues.
ChromaDB does not have a direct clear() or truncate() method. The idiomatic way to reset a collection is to delete it and recreate it with the same parameters. For selective deletion, use delete() with ID lists or where filters.
import chromadb
client = chromadb.PersistentClient(path="./reset_demo")
# Setup
col = client.get_or_create_collection(
"my_col",
metadata={"hnsw:space": "cosine", "version": "1"},
)
col.add(
documents=[f"Document {i}" for i in range(100)],
ids=[str(i) for i in range(100)],
metadatas=[{"batch": i // 10} for i in range(100)],
)
print(col.count()) # 100
# --- Option 1: Reset (delete all + recreate) ---
def reset_collection(client, name: str, metadata: dict = None):
"""Delete and recreate a collection, preserving its configuration."""
saved_meta = {}
try:
saved_meta = client.get_collection(name).metadata or {}
except Exception:
pass
client.delete_collection(name)
return client.create_collection(
name=name,
metadata=metadata or saved_meta,
)
col = reset_collection(client, "my_col")
print(col.count()) # 0
# Re-add fresh data after reset
col.add(documents=["Fresh start"], ids=["new-1"])
# --- Option 2: Selective delete by filter ---
col2 = client.get_or_create_collection("selective")
col2.add(
documents=[f"Doc {i}" for i in range(20)],
ids=[str(i) for i in range(20)],
metadatas=[{"batch": i // 5} for i in range(20)],
)
# Delete only batch 0 (documents 0-4)
col2.delete(where={"batch": 0})
print(col2.count()) # 15 remaining
# Delete specific IDs
col2.delete(ids=["5","6","7"])
print(col2.count()) # 12 remaining
# Delete ALL via get + delete (when no useful metadata filter exists)
all_ids = col2.get(include=[])["ids"] # get all IDs
if all_ids:
col2.delete(ids=all_ids)
print(col2.count()) # 0| Method | When to use | Preserves schema? |
|---|---|---|
| delete_collection + create_collection | Full reset — cleanest approach | Yes (manual) |
| delete(where={...}) | Selective clear by metadata condition | Yes |
| delete(ids=[...]) | Remove specific known documents | Yes |
| get all IDs then delete | Clear all without metadata | Yes |
By default, ChromaDB sends anonymised usage telemetry to help the development team understand how the product is used. In enterprise or privacy-sensitive environments this should be disabled. ChromaDB also supports several configuration settings via environment variables and the Settings class.
import chromadb
from chromadb.config import Settings
import os
# --- Option 1: Disable telemetry via environment variable ---
os.environ["ANONYMIZED_TELEMETRY"] = "False"
# --- Option 2: Disable via Settings class ---
client = chromadb.PersistentClient(
path="./my_db",
settings=Settings(
anonymized_telemetry=False,
allow_reset=True, # enables client.reset() — wipes all data!
),
)
# --- Option 3: Disable telemetry for HttpClient ---
client_http = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False),
)
# Settings available via Settings class
all_settings = Settings(
anonymized_telemetry=False,
allow_reset=False, # default: False — prevents accidental wipe
# chroma_db_impl="duckdb+parquet", # legacy v0.3 setting (not used in v0.4+)
)
# allow_reset=True enables client.reset() — DELETES ALL DATA
# Only use in testing environments!
client_test = chromadb.EphemeralClient(
settings=Settings(allow_reset=True)
)
client_test.create_collection("temp")
client_test.reset() # wipes everything — use in test fixtures
print(client_test.list_collections()) # []| Setting | Default | Notes |
|---|---|---|
| anonymized_telemetry | True | Set False in production for privacy |
| allow_reset | False | Set True only in test environments — reset() wipes all data |
| ANONYMIZED_TELEMETRY env var | True | Environment variable alternative to Settings class |
Moving a ChromaDB application from prototype to production involves several architectural decisions around storage, concurrency, reliability, and observability. This checklist covers the key concerns.
| Area | Recommendation |
|---|---|
| Storage mode | Use HttpClient connecting to a ChromaDB server — not PersistentClient in multi-process apps |
| Embedding consistency | Store embedding model name in collection metadata; always re-supply EF on get_collection() |
| Distance metric | Set hnsw:space='cosine' at collection creation for text; cannot change later |
| Backups | Schedule regular directory snapshots or SQLite online backups; test restore procedure |
| Telemetry | Set ANONYMIZED_TELEMETRY=False for privacy |
| Batching | Insert in batches of 100–500; use upsert() for idempotent pipelines |
| Error handling | Catch IDAlreadyExistsError, InvalidCollectionException; implement retry logic for HttpClient |
| HNSW tuning | Increase hnsw:construction_ef to 200 and hnsw:search_ef to 50–100 for large collections |
| Metadata schema | Use ints for dates/booleans; document schema in collection metadata |
| Security | Run server behind a reverse proxy with TLS; add auth headers for HttpClient |
| Monitoring | Log query latency, collection size, and embedding function errors |
| Scale planning | Plan ~1.5 KB/doc for 384-dim vectors + 25% HNSW overhead; consider alternatives above 10M docs |
# Minimal production-ready ChromaDB setup
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "prod_knowledge_base"
def create_client():
return chromadb.HttpClient(
host=os.environ["CHROMA_HOST"],
port=int(os.environ.get("CHROMA_PORT", 8000)),
settings=Settings(anonymized_telemetry=False),
)
def get_collection(client):
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name=EMBEDDING_MODEL,
)
return client.get_or_create_collection(
name=COLLECTION_NAME,
embedding_function=ef,
metadata={
"hnsw:space": "cosine",
"hnsw:construction_ef": 200,
"hnsw:search_ef": 100,
"embedding_model": EMBEDDING_MODEL,
},
)
client = create_client()
client.heartbeat() # fail fast if server is unreachable
collection = get_collection(client)
logger.info(f"Connected to collection with {collection.count()} documents")
