Database / ChromaDB Interview Questions

1. What is ChromaDB and what problem does it solve? 2. What are embeddings and why are they central to how ChromaDB works? 3. What distance metrics does ChromaDB support and how do you choose between them? 4. What is a ChromaDB collection and how do you create, list, get, and delete collections? 5. How do you add documents to a ChromaDB collection? 6. How do you query a ChromaDB collection for similar documents? 7. How do you retrieve, update, and delete specific documents in ChromaDB? 8. How do you filter query results using metadata in ChromaDB? 9. What is the difference between ChromaDB's in-memory and persistent storage modes? 10. What is ChromaDB's default embedding function and how does it work? 11. How do you use the OpenAI embedding function with ChromaDB? 12. How do you use HuggingFace models as embedding functions in ChromaDB? 13. How do you create a custom embedding function for ChromaDB? 14. How does ChromaDB's PersistentClient store data on disk, and what are its limitations? 15. What is the HNSW index in ChromaDB and what parameters can you tune? 16. How do you efficiently add large numbers of documents to ChromaDB using batching? 17. What is the where_document filter in ChromaDB and how does it differ from where? 18. How do you control what data ChromaDB returns in query and get results using include? 19. How do you design metadata schemas for effective filtering in ChromaDB? 20. How do you inspect a ChromaDB collection's contents and configuration? 21. How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB? 22. What are effective document chunking strategies when indexing documents into ChromaDB for RAG? 23. How do you use ChromaDB as a vector store with LangChain? 24. How do you implement multi-tenancy or data isolation in ChromaDB? 25. What is embedding consistency and why is it critical in ChromaDB applications? 26. How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients? 27. When should you use upsert() instead of add() in ChromaDB, and what are common patterns? 28. What are best practices for structuring ChromaDB collection metadata for production use? 29. How does ChromaDB compare to FAISS, and when should you choose one over the other? 30. What are common ChromaDB errors and how do you handle them in production code? 31. How do you back up and restore a ChromaDB persistent database? 32. How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection? 33. How do you interpret ChromaDB query distances and convert them into meaningful relevance scores? 34. What are ChromaDB's practical size limits and performance characteristics at scale? 35. How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents? 36. How do you reset or clear a ChromaDB collection without deleting and recreating it? 37. What configuration settings does ChromaDB support and how do you disable telemetry? 38. What is a production readiness checklist for a ChromaDB-based application?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is ChromaDB and what problem does it solve?

ChromaDB is an open-source, AI-native vector database designed to store, index, and query high-dimensional embedding vectors efficiently. It was created specifically to make building LLM-powered applications easy — particularly for retrieval-augmented generation (RAG), semantic search, and recommendation systems.

Traditional databases store and search data using exact matches or SQL-style predicates. ChromaDB instead answers the question: "which stored items are most semantically similar to this query?" It does this by storing numerical vectors (embeddings) that represent the meaning of text, images, or other data, and finding nearest neighbours using vector similarity search.

ChromaDB at a glance
Feature	Detail
Language	Python-first (also JS/TS client)
License	Apache 2.0 open source
Storage modes	In-memory (ephemeral) or persistent (disk-backed)
Default embedding model	all-MiniLM-L6-v2 via sentence-transformers
Distance metrics	cosine, l2 (Euclidean), ip (inner product)
Primary use cases	RAG pipelines, semantic search, duplicate detection, recommendation

pip install chromadb

import chromadb

# Quickstart â in-memory client
client = chromadb.Client()
collection = client.create_collection("my_docs")
collection.add(
    documents=["ChromaDB is a vector database", "Python is great"],
    ids=["doc1", "doc2"],
)
results = collection.query(query_texts=["vector store for AI"], n_results=1)
print(results["documents"])  # [["ChromaDB is a vector database"]]

Take quiz

What type of database is ChromaDB?A relational SQL database

✗ Try again.

A key-value store

✗ Try again.

A vector database for storing and querying embedding vectors

✓ Correct! Well done.

A document store like MongoDB

✗ Try again.

What is the primary query type that ChromaDB is designed for?Exact keyword match

✗ Try again.

SQL JOIN queries

✗ Try again.

Nearest-neighbour semantic similarity search

✓ Correct! Well done.

Full-text inverted index search

✗ Try again.

2. What are embeddings and why are they central to how ChromaDB works?

An embedding is a dense numerical vector — a list of floating-point numbers — that represents the semantic meaning of a piece of data. Text, images, audio, and code can all be converted into embeddings by a neural network (embedding model). Items with similar meanings produce vectors that are close together in the high-dimensional vector space.

ChromaDB stores these vectors alongside the original data and metadata. When you query ChromaDB with a new piece of text, the same embedding model converts it to a vector, and ChromaDB uses an approximate nearest-neighbour (ANN) algorithm to find the stored vectors that are geometrically closest — these correspond to the most semantically relevant stored documents.

# Conceptual illustration
# "ChromaDB is a vector database"  â [0.12, -0.45, 0.89, ..., 0.03]  (384 numbers)
# "Vector stores for AI apps"      â [0.14, -0.41, 0.91, ..., 0.01]  (close!)
# "My cat loves tuna fish"         â [-0.55, 0.72, -0.11, ..., 0.88] (far away)

import chromadb
from chromadb.utils import embedding_functions

# You can inspect the raw embedding vectors ChromaDB generates
client = chromadb.Client()
collection = client.create_collection("demo")
collection.add(documents=["Hello world"], ids=["1"])

# Get the stored embedding
result = collection.get(ids=["1"], include=["embeddings"])
print(len(result["embeddings"][0]))   # 384 â length of the default model's vector
print(result["embeddings"][0][:5])    # first 5 of 384 floats

Embedding dimensions by model
Model	Dimensions	Notes
all-MiniLM-L6-v2 (default)	384	Fast, small, good for English
text-embedding-ada-002 (OpenAI)	1536	High quality, API call required
text-embedding-3-small (OpenAI)	1536	Newer, cheaper than ada-002
all-mpnet-base-v2	768	Higher quality than MiniLM, slower
CLIP (images)	512	Multimodal — text and images same space

Take quiz

What is an embedding in the context of ChromaDB?A compressed version of a document for storage efficiency

✗ Try again.

A dense numerical vector that represents the semantic meaning of data, enabling similarity comparisons

✓ Correct! Well done.

A unique identifier assigned to each document

✗ Try again.

A metadata tag attached to stored documents

✗ Try again.

Why do semantically similar texts produce vectors that are close together?ChromaDB manually groups similar texts during insertion

✗ Try again.

Embedding models are trained so that texts appearing in similar contexts produce similar output vectors — semantic similarity is encoded as geometric proximity

✓ Correct! Well done.

ChromaDB sorts vectors alphabetically by meaning

✗ Try again.

Similar texts happen to have similar character counts

✗ Try again.

3. What distance metrics does ChromaDB support and how do you choose between them?

ChromaDB uses a distance metric to measure how similar two vectors are during nearest-neighbour search. The metric is set at collection creation time and cannot be changed afterward. Choosing the wrong metric for your embedding model can significantly degrade search quality.

ChromaDB distance metrics
Metric	hnsw:space value	Formula	Best for
L2 (Euclidean)	l2 (default)	√Σ(aáµ¢−báµ¢)²	When vector magnitude matters; general purpose
Cosine similarity	cosine	1 − (a·b)/(-)	Text embeddings focuses on direction not magnitude
Inner Product	ip	−(a·b)	When embeddings are pre-normalised to unit length

import chromadb

# Set metric at collection creation Â cannot change later!
collection_cosine = client.create_collection(
    name="text_cosine",
    metadata={"hnsw:space": "cosine"},  # recommended for text
)

collection_l2 = client.create_collection(
    name="general_l2",
    metadata={"hnsw:space": "l2"},  # default if not specified
)

collection_ip = client.create_collection(
    name="normalised_ip",
    metadata={"hnsw:space": "ip"},  # use when vectors are unit-normalised
)

# Query returns "distances" field Â interpretation depends on metric:
# cosine: 0 = identical, 2 = opposite (lower = more similar)
# l2:     0 = identical, larger = more different (lower = more similar)
# ip:     more negative = more similar (with normalised vectors)

Rule of thumb: most popular text embedding models (OpenAI, Sentence Transformers) are optimised for cosine similarity. Use "hnsw:space": "cosine" for text RAG applications. L2 is the default but is less optimal for text embeddings that vary in magnitude.

Take quiz

Which distance metric is generally recommended for text embedding models like those from OpenAI or Sentence Transformers?l2 (Euclidean distance)

✗ Try again.

cosine

✓ Correct! Well done.

ip (inner product)

✗ Try again.

manhattan

✗ Try again.

When can you change the distance metric of an existing ChromaDB collection?Any time by calling collection.modify(metric='cosine')

✗ Try again.

Only before adding the first document

✗ Try again.

Never the metric is fixed at collection creation and cannot be changed afterward

✓ Correct! Well done.

Only by the database administrator

✗ Try again.

4. What is a ChromaDB collection and how do you create, list, get, and delete collections?

A collection is ChromaDB's primary organisational unit -analogous to a table in SQL or an index in a search engine. Each collection stores documents, their embeddings, IDs, and optional metadata. All items in a collection share the same embedding function and distance metric.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

# CREATE a collection
collection = client.create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"},
    # embedding_function defaults to all-MiniLM-L6-v2
)

# GET an existing collection (raises error if not found)
collection = client.get_collection("research_papers")

# GET or CREATE - idempotent, safe to call on every startup
collection = client.get_or_create_collection(
    name="research_papers",
    metadata={"hnsw:space": "cosine"},
)

# LIST all collections
collections = client.list_collections()
for col in collections:
    print(col.name)  # prints collection names

# COUNT documents in a collection
print(collection.count())  # number of items stored

# DELETE a collection and all its data
client.delete_collection("research_papers")

# MODIFY collection name or metadata
collection.modify(
    name="arxiv_papers",
    metadata={"hnsw:space": "cosine", "description": "arXiv CS papers"},
)

Collection management methods
Method	Purpose	Raises if
create_collection(name)	Creates new collection	Name already exists
get_collection(name)	Gets existing collection	Name not found
get_or_create_collection(name)	Idempotent get/create	Never raises
list_collections()	Returns all collection names	-
delete_collection(name)	Permanently deletes collection + data	Name not found

Take quiz

What is the safest method to use when initialising a collection at application startup, to avoid errors whether the collection already exists or not?create_collection()

✗ Try again.

get_collection()

✗ Try again.

get_or_create_collection()

✓ Correct! Well done.

list_collections()

✗ Try again.

What does collection.count() return?The number of collections in the database

✗ Try again.

The total number of documents (items) stored in that collection

✓ Correct! Well done.

The number of dimensions in the embedding vectors

✗ Try again.

The number of pending write operations

✗ Try again.

5. How do you add documents to a ChromaDB collection?

The collection.add() method inserts items into a collection. Each item requires a unique id. You can provide raw documents (strings) and let ChromaDB embed them, or supply pre-computed embeddings directly. Optional metadatas store filterable key-value pairs alongside each document.

import chromadb

client = chromadb.Client()
collection = client.create_collection("articles")

# Basic add â ChromaDB embeds documents automatically
collection.add(
    documents=[
        "ChromaDB is an open-source vector database.",
        "Retrieval-augmented generation improves LLM accuracy.",
        "Python is a popular language for data science.",
    ],
    ids=["art-001", "art-002", "art-003"],
)

# Add with metadata â enables filtered queries later
collection.add(
    documents=[
        "FastAPI is a modern Python web framework.",
        "React is a JavaScript library for building UIs.",
    ],
    metadatas=[
        {"source": "docs", "category": "backend",  "year": 2024},
        {"source": "docs", "category": "frontend", "year": 2024},
    ],
    ids=["art-004", "art-005"],
)

# Add pre-computed embeddings (skip ChromaDB's embedding step)
import numpy as np
collection_custom = client.create_collection(
    "custom_embeddings",
    metadata={"hnsw:space": "cosine"},
)
collection_custom.add(
    embeddings=[
        [0.1, 0.5, -0.3, 0.8],  # must match embedding_function dimension
        [0.4, 0.2,  0.9, -0.1],
    ],
    documents=["Doc A", "Doc B"],  # stored as-is for retrieval
    ids=["e-1", "e-2"],
)

ID rules: IDs must be strings, must be unique within the collection, and must not be empty. Adding a duplicate ID raises a chromadb.errors.IDAlreadyExistsError.

Take quiz

What happens if you call collection.add() with an ID that already exists in the collection?The existing document is silently overwritten

✗ Try again.

ChromaDB raises an IDAlreadyExistsError

✓ Correct! Well done.

ChromaDB appends a suffix to make the ID unique

✗ Try again.

The add operation is silently ignored

✗ Try again.

When would you pass embeddings= instead of documents= to collection.add()?When you want faster insertion speed

✗ Try again.

When you have already computed the embedding vectors externally (e.g. from the OpenAI API) and want to avoid re-embedding the same data

✓ Correct! Well done.

When the documents are longer than 512 tokens

✗ Try again.

When using a persistent client

✗ Try again.

6. How do you query a ChromaDB collection for similar documents?

The primary query method is collection.query(). You pass either query_texts (raw strings that ChromaDB embeds automatically) or query_embeddings (pre-computed vectors). ChromaDB returns the n_results nearest neighbours for each query.

import chromadb

client = chromadb.Client()
collection = client.create_collection("knowledge_base")
collection.add(
    documents=[
        "Python is great for data science and machine learning.",
        "JavaScript is used for web development.",
        "ChromaDB stores and retrieves vector embeddings.",
        "Docker containers package applications with dependencies.",
    ],
    ids=["d1", "d2", "d3", "d4"],
)

# Basic query â returns top 2 most similar documents
results = collection.query(
    query_texts=["vector database for AI"],
    n_results=2,
)
print(results["documents"])  # [[most_similar, second_most_similar]]
print(results["ids"])        # [["d3", "d1"]]
print(results["distances"])  # [[0.18, 0.74]] â lower = more similar

# Query multiple texts at once (batch query)
results = collection.query(
    query_texts=["machine learning", "web frameworks"],
    n_results=2,
)
# results["documents"][0] = top 2 for "machine learning"
# results["documents"][1] = top 2 for "web frameworks"

# Control what is returned with include=
results = collection.query(
    query_texts=["Python programming"],
    n_results=3,
    include=["documents", "metadatas", "distances", "embeddings"],
)
# Default include: ["documents", "metadatas", "distances"]
# "embeddings" must be explicitly requested â adds response size

Query result fields
Field	Type	Description
ids	list[list[str]]	IDs of matching documents, outer list = per query
documents	list[list[str]]	Original text of matching documents
metadatas	list[list[dict]]	Metadata dicts of matching documents
distances	list[list[float]]	Similarity distances (lower = more similar for l2/cosine)
embeddings	list[list[list[float]]]	Raw vectors — only if include=['embeddings']

Take quiz

What does n_results=5 mean in a ChromaDB query?Only 5 documents can be stored in the collection

✗ Try again.

Return the 5 most similar documents for each query text

✓ Correct! Well done.

Limit the query to 5 milliseconds

✗ Try again.

Sample 5 random documents from the collection

✗ Try again.

In a ChromaDB query result, what do the distance values represent?The number of shared words between query and document

✗ Try again.

A similarity score where lower values indicate higher similarity (for cosine and l2 metrics)

✓ Correct! Well done.

The position of the document in the collection

✗ Try again.

The confidence percentage of the match

✗ Try again.

7. How do you retrieve, update, and delete specific documents in ChromaDB?

Beyond querying by similarity, ChromaDB supports exact lookups by ID with get(), in-place updates with update() or upsert(), and deletion with delete().

import chromadb

client = chromadb.Client()
col = client.create_collection("items")
col.add(
    documents=["First document", "Second document", "Third document"],
    metadatas=[{"v": 1}, {"v": 2}, {"v": 3}],
    ids=["id1", "id2", "id3"],
)

# GET â fetch by specific IDs
result = col.get(ids=["id1", "id3"])
print(result["documents"])  # ["First document", "Third document"]

# GET all documents in the collection
all_docs = col.get()  # no ids= returns everything

# GET with include control
result = col.get(
    ids=["id1"],
    include=["documents", "metadatas", "embeddings"],
)

# UPDATE â must already exist, updates only specified fields
col.update(
    ids=["id1"],
    documents=["Updated first document"],
    metadatas=[{"v": 10, "updated": True}],
    # ChromaDB re-embeds the new document text automatically
)

# UPSERT â insert if not exists, update if exists (idempotent)
col.upsert(
    documents=["Brand new doc",        "Updated second doc"],
    metadatas=[{"v": 99},              {"v": 20}],
    ids=       ["id-new",              "id2"],
)
# id-new is inserted; id2 is updated

# DELETE by IDs
col.delete(ids=["id3"])
print(col.count())  # 3 (id1, id2, id-new remain)

# DELETE by metadata filter (where clause)
col.delete(where={"v": {"$gt": 15}})

CRUD methods summary
Method	Behaviour when ID exists	Behaviour when ID missing
add()	Raises IDAlreadyExistsError	Inserts new document
update()	Updates the document	Raises error — ID must exist
upsert()	Updates the document	Inserts new document
delete()	Removes the document	Silently ignores

Take quiz

What is the difference between update() and upsert() in ChromaDB?update() is faster than upsert()

✗ Try again.

update() raises an error for non-existent IDs; upsert() inserts new documents if the ID does not exist and updates if it does

✓ Correct! Well done.

upsert() can only update metadata, not document text

✗ Try again.

They are identical — upsert() is just an alias

✗ Try again.

When you call collection.update(ids=['x'], documents=['new text']), what does ChromaDB do with the stored embedding?Keeps the old embedding unchanged

✗ Try again.

Re-embeds the new document text using the collection's embedding function and stores the updated vector

✓ Correct! Well done.

Deletes the embedding and requires you to provide a new one

✗ Try again.

Averages the old and new embeddings

✗ Try again.

8. How do you filter query results using metadata in ChromaDB?

ChromaDB supports a MongoDB-style where clause for filtering by metadata fields. Filters can be applied during query() (combines semantic search with filtering) or during get() (exact retrieval with filtering). Filters run before or alongside the ANN search.

import chromadb

client = chromadb.Client()
col = client.create_collection("articles")
col.add(
    documents=["Python intro", "Python advanced", "JS basics", "Rust guide", "Go tutorial"],
    metadatas=[
        {"lang": "python", "level": "beginner", "year": 2022},
        {"lang": "python", "level": "advanced", "year": 2023},
        {"lang": "js",     "level": "beginner", "year": 2023},
        {"lang": "rust",   "level": "beginner", "year": 2024},
        {"lang": "go",     "level": "intermediate", "year": 2024},
    ],
    ids=["a1","a2","a3","a4","a5"],
)

# Equality filter
results = col.query(
    query_texts=["programming tutorial"],
    n_results=3,
    where={"lang": "python"},  # shorthand for $eq
)

# Comparison operators
results = col.query(
    query_texts=["tutorial"],
    n_results=5,
    where={"year": {"$gte": 2023}},  # year >= 2023
)

# Logical AND â all conditions must match
results = col.query(
    query_texts=["guide"],
    n_results=3,
    where={"$and": [
        {"lang":  {"$in":  ["python", "go"]}},
        {"level": {"$ne":  "advanced"}},
    ]},
)

# Logical OR
results = col.query(
    query_texts=["code"],
    n_results=3,
    where={"$or": [
        {"year": {"$eq": 2024}},
        {"level": {"$eq": "beginner"}},
    ]},
)

# Filter on document text content (where_document)
results = col.query(
    query_texts=["programming"],
    n_results=5,
    where_document={"$contains": "Python"},  # document text contains "Python"
)

ChromaDB where clause operators
Operator	Meaning	Example
$eq	Equal	{"lang": {"$eq": "python"}} or {"lang": "python"}
$ne	Not equal	{"level": {"$ne": "advanced"}}
$gt / $gte	Greater than / or equal	{"year": {"$gte": 2023}}
$lt / $lte	Less than / or equal	{"year": {"$lt": 2024}}
$in	Value in list	{"lang": {"$in": ["python", "go"]}}
$nin	Value not in list	{"lang": {"$nin": ["js"]}}
$and	All conditions true	{"$and": [{...}, {...}]}
$or	Any condition true	{"$or": [{...}, {...}]}

Take quiz

What is the where_document parameter used for in a ChromaDB query?Filtering by metadata fields attached to documents

✗ Try again.

Filtering based on the actual text content of stored documents (e.g. $contains)

✓ Correct! Well done.

Specifying which document fields to return

✗ Try again.

Setting a maximum document length for results

✗ Try again.

How do you filter for documents where lang is 'python' AND year is 2023 or later?where={"lang": "python", "year": {"$gte": 2023}}

✗ Try again.

where={"$and": [{"lang": "python"}, {"year": {"$gte": 2023}}]}

✓ Correct! Well done.

Both A and B are valid ChromaDB syntax

✗ Try again.

where={"lang": "python"} and where={"year": 2023} as separate calls

✗ Try again.

9. What is the difference between ChromaDB's in-memory and persistent storage modes?

ChromaDB offers three client modes that control where data is stored. Choosing the right mode depends on whether you need data to survive restarts and whether you're running a single process or a shared service.

ChromaDB client modes
Mode	Class	Data survives restart?	Best for
Ephemeral (in-memory)	chromadb.Client()	No — lost when process ends	Testing, prototyping, CI pipelines
Persistent (disk)	chromadb.PersistentClient(path=...)	Yes — written to SQLite + disk files	Single-process apps, local dev
HTTP Client	chromadb.HttpClient(host=..., port=...)	Yes — managed by server	Multi-process apps, production, shared access

import chromadb

# 1. Ephemeral â data lives only in memory, lost on exit
client_mem = chromadb.Client()

# 2. Persistent â data saved to disk at ./my_chroma_db/
client_disk = chromadb.PersistentClient(path="./my_chroma_db")
# Creates the directory if it does not exist
# Data persists across Python restarts

# 3. HTTP Client â connects to a running ChromaDB server
client_http = chromadb.HttpClient(
    host="localhost",
    port=8000,
    # ssl=True, headers={"Authorization": "Bearer token"}  # if secured
)

# Start the server separately:
# chroma run --path ./chroma_data --port 8000

# Verify connection
client_http.heartbeat()  # raises if server is unreachable

# EphemeralClient â explicit alias for chromadb.Client()
client_eph = chromadb.EphemeralClient()

# All three clients share the same collection API
collection = client_disk.get_or_create_collection("my_data")
collection.add(documents=["Persisted text"], ids=["p1"])
# Restart Python, create PersistentClient with same path â data still there

Important: the persistent client uses SQLite under the hood. It is not designed for concurrent writes from multiple processes. For multi-process or multi-container production use, run ChromaDB as an HTTP server and use HttpClient.

Take quiz

Which ChromaDB client should you use in production when multiple services need to read and write to the same database?chromadb.Client() (ephemeral)

✗ Try again.

chromadb.PersistentClient()

✗ Try again.

chromadb.HttpClient() connecting to a ChromaDB server

✓ Correct! Well done.

Any client — they all support concurrent access

✗ Try again.

What happens to data stored with chromadb.Client() (EphemeralClient) when the Python process exits?It is automatically saved to a default path

✗ Try again.

It is lost permanently — ephemeral storage exists only in RAM for the lifetime of the process

✓ Correct! Well done.

It is written to a temporary file that persists until the OS cleans it

✗ Try again.

It is backed up to ChromaDB's cloud service

✗ Try again.

10. What is ChromaDB's default embedding function and how does it work?

When you create a collection without specifying an embedding function, ChromaDB uses the SentenceTransformerEmbeddingFunction backed by the all-MiniLM-L6-v2 model from the sentence-transformers library. This model is downloaded automatically on first use and cached locally.

import chromadb
from chromadb.utils import embedding_functions

# Default â uses all-MiniLM-L6-v2 automatically
client = chromadb.Client()
collection_default = client.create_collection("default_embeddings")

# Equivalent explicit usage
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2",  # 384-dim, fast, good English quality
)

# Using a different Sentence Transformer model
ef_large = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2",  # 768-dim, higher quality, slower
)

collection_large = client.create_collection(
    name="large_model",
    embedding_function=ef_large,
    metadata={"hnsw:space": "cosine"},
)

# You can call embedding functions directly to inspect output
embed = embedding_functions.SentenceTransformerEmbeddingFunction()
vectors = embed(["Hello world", "ChromaDB is great"])
print(len(vectors))      # 2 â one vector per input
print(len(vectors[0]))   # 384 â dimensions

Default model properties
Property	Value
Model name	all-MiniLM-L6-v2
Output dimensions	384
Download size	~80 MB (cached after first use)
Library required	sentence-transformers
Runs on	CPU (default) or GPU
Strength	Fast, good English semantic similarity
Limitation	Weaker on non-English, domain-specific text

Take quiz

What model does ChromaDB use as its default embedding function?text-embedding-ada-002 from OpenAI

✗ Try again.

all-MiniLM-L6-v2 from Sentence Transformers

✓ Correct! Well done.

BERT-base-uncased

✗ Try again.

FastText

✗ Try again.

What happens the first time you create a ChromaDB collection with the default embedding function?ChromaDB raises an error asking you to specify a model

✗ Try again.

The all-MiniLM-L6-v2 model is downloaded automatically from the internet and cached locally

✓ Correct! Well done.

ChromaDB uses a random vector as a placeholder

✗ Try again.

You must manually install sentence-transformers and call the model first

✗ Try again.

11. How do you use the OpenAI embedding function with ChromaDB?

ChromaDB has a built-in OpenAIEmbeddingFunction that calls the OpenAI Embeddings API. This gives higher-quality embeddings than the default local model, at the cost of API latency and usage fees. Use text-embedding-3-small for a balance of quality and cost, or text-embedding-3-large for maximum quality.

import chromadb
from chromadb.utils import embedding_functions
import os

client = chromadb.PersistentClient(path="./chroma_openai")

# Built-in OpenAI embedding function
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",  # 1536-dim, fast and cheap
    # model_name="text-embedding-3-large",  # 3072-dim, highest quality
    # model_name="text-embedding-ada-002",  # legacy, 1536-dim
)

collection = client.get_or_create_collection(
    name="openai_docs",
    embedding_function=ef_openai,
    metadata={"hnsw:space": "cosine"},
)

# Usage is identical to the default embedding function
collection.add(
    documents=[
        "FastAPI is a modern Python web framework for building APIs.",
        "ChromaDB stores embeddings for semantic search.",
    ],
    ids=["d1", "d2"],
)

# ChromaDB calls OpenAI API automatically on add() and query()
results = collection.query(
    query_texts=["vector database for retrieval"],
    n_results=1,
)
print(results["documents"])  # [["ChromaDB stores embeddings..."]]

OpenAI embedding models comparison
Model	Dimensions	Cost	Notes
text-embedding-3-small	1536	~$0.02/1M tokens	Best value — recommended default
text-embedding-3-large	3072	~$0.13/1M tokens	Highest quality
text-embedding-ada-002	1536	~$0.10/1M tokens	Legacy, superseded by v3

Important consistency rule: you must use the exact same embedding model for both storing and querying. If you embed documents with text-embedding-3-small, all queries must also use text-embedding-3-small. Mixing models produces meaningless similarity scores.

Take quiz

Why must you use the same embedding model for both adding documents and querying in ChromaDB?ChromaDB enforces this with a runtime error

✗ Try again.

Different models produce vectors in different semantic spaces — comparing vectors from different models yields meaningless distances

✓ Correct! Well done.

OpenAI's API requires consistent model usage per session

✗ Try again.

The vector dimensions must match, and all OpenAI models have the same dimensions

✗ Try again.

Which OpenAI model is recommended as the best balance of cost and quality for new ChromaDB projects?text-embedding-ada-002

✗ Try again.

text-embedding-3-large

✗ Try again.

text-embedding-3-small

✓ Correct! Well done.

gpt-4-turbo

✗ Try again.

12. How do you use HuggingFace models as embedding functions in ChromaDB?

ChromaDB provides a HuggingFaceEmbeddingFunction that calls the HuggingFace Inference API (cloud-hosted), and a SentenceTransformerEmbeddingFunction for running any Sentence Transformer model locally. For production use without per-call API costs, local Sentence Transformer models are the more common choice.

import chromadb
from chromadb.utils import embedding_functions
import os

client = chromadb.Client()

# Option 1: HuggingFace Inference API (cloud, requires API key)
ef_hf_api = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key=os.environ["HUGGINGFACE_API_KEY"],
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

# Option 2: Local Sentence Transformers (no API key, runs on your machine)
ef_local = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2",    # 384-dim, fast
    # model_name="all-mpnet-base-v2", # 768-dim, higher quality
    # model_name="BAAI/bge-large-en-v1.5",  # excellent quality
    device="cpu",  # or "cuda" for GPU acceleration
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=ef_local,
    metadata={"hnsw:space": "cosine"},
)

collection.add(
    documents=[
        "Open-source language models are becoming more powerful.",
        "LLaMA and Mistral are popular open-source LLMs.",
    ],
    ids=["h1", "h2"],
)

results = collection.query(
    query_texts=["free LLM models"],
    n_results=2,
)
print(results["documents"])

# Popular local models for RAG
models = {
    "BAAI/bge-small-en-v1.5":  "384-dim, excellent quality/speed ratio",
    "BAAI/bge-large-en-v1.5":  "1024-dim, top English quality",
    "intfloat/e5-base-v2":     "768-dim, strong multilingual",
    "thenlper/gte-large":      "1024-dim, great for retrieval",
}

Trade-offs: HuggingFace Inference API requires no local GPU but costs money and adds latency. Local Sentence Transformers are free, fast (especially on GPU), run offline, and are privacy-preserving — preferred for sensitive data.

Take quiz

What is the main advantage of using a local SentenceTransformerEmbeddingFunction over the HuggingFace Inference API?Local models always produce better embeddings

✗ Try again.

No API key or per-call cost, works offline, data stays private, and runs faster when using a GPU

✓ Correct! Well done.

Local models support more languages

✗ Try again.

The HuggingFace Inference API does not support ChromaDB

✗ Try again.

Which local HuggingFace model family is widely considered top-quality for English retrieval tasks in ChromaDB?GPT-2

✗ Try again.

BERT-base-uncased

✗ Try again.

BAAI/bge models (e.g. bge-large-en-v1.5)

✓ Correct! Well done.

Word2Vec

✗ Try again.

13. How do you create a custom embedding function for ChromaDB?

ChromaDB defines a simple protocol for embedding functions: a class with a __call__ method that accepts a list of strings and returns a list of embedding vectors. Implementing this interface lets you plug in any model — a local transformer, a third-party API, or even a mock for testing.

import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
from typing import List

# Custom embedding function â must implement __call__
class MyCustomEmbeddingFunction(EmbeddingFunction):
    """Wraps any embedding model in ChromaDB's interface."""

    def __init__(self, model_name: str = "my-model"):
        # Load your model here
        self.model_name = model_name
        # self.model = load_model(model_name)

    def __call__(self, input: Documents) -> Embeddings:
        """
        input:  list of strings to embed
        return: list of lists of floats (one vector per string)
        """
        embeddings = []
        for text in input:
            # Replace with your actual embedding logic
            vector = self._embed_text(text)
            embeddings.append(vector)
        return embeddings

    def _embed_text(self, text: str) -> List[float]:
        # Example: fixed-dim hash-based mock (not for production)
        import hashlib
        h = hashlib.md5(text.encode()).digest()
        return [b / 255.0 for b in h]  # 16-dim mock vector


# Use your custom function exactly like a built-in one
client = chromadb.Client()
custom_ef = MyCustomEmbeddingFunction()

collection = client.create_collection(
    name="custom_embed",
    embedding_function=custom_ef,
)
collection.add(
    documents=["Test document one", "Test document two"],
    ids=["c1", "c2"],
)
results = collection.query(
    query_texts=["test"],
    n_results=1,
)
print(results["ids"])  # [["c1"]] or [["c2"]]

When to write a custom embedding function:

Your company uses a proprietary or self-hosted embedding model
You need to embed data from a provider not in ChromaDB's built-in list
You want to add preprocessing (text cleaning, chunking, domain adaptation) before embedding
Testing — inject a deterministic mock that returns predictable vectors

Take quiz

What is the minimum interface a custom ChromaDB embedding function must implement?__init__ and embed() methods

✗ Try again.

__call__(self, input: Documents) -> Embeddings — a callable that takes a list of strings and returns a list of float lists

✓ Correct! Well done.

encode() method matching the SentenceTransformer API

✗ Try again.

transform() and fit() methods like scikit-learn

✗ Try again.

Why might you create a custom embedding function for testing ChromaDB code?The built-in functions do not work in test environments

✗ Try again.

A deterministic mock embedding function returns predictable vectors, making tests fast, offline, and independent of external APIs or large model downloads

✓ Correct! Well done.

ChromaDB requires a custom function when using pytest

✗ Try again.

Testing requires lower-dimensional vectors that built-in models cannot produce

✗ Try again.

14. How does ChromaDB's PersistentClient store data on disk, and what are its limitations?

The PersistentClient stores data in a directory you specify. Inside, ChromaDB uses SQLite for metadata (IDs, document text, metadata key-value pairs) and binary files for the HNSW vector index. All writes are flushed to disk automatically — there is no explicit save/commit step.

import chromadb
import os

# Create or open a persistent database
client = chromadb.PersistentClient(path="./my_vector_db")

# After this call, ./my_vector_db/ contains:
# - chroma.sqlite3         (metadata, documents, IDs)
# - <uuid>/               (one folder per collection)
#   - header.bin          (HNSW index configuration)
#   - data_level0.bin     (HNSW graph layer 0)
#   - length.bin          (element count)

col = client.get_or_create_collection("notes")
col.add(
    documents=["Remember to buy milk", "Meeting at 3pm tomorrow"],
    ids=["n1", "n2"],
)
# Data is persisted immediately â no commit needed

# Verify data survives restart:
del client, col  # simulate process exit
client2 = chromadb.PersistentClient(path="./my_vector_db")
col2 = client2.get_collection("notes")
print(col2.count())   # 2 â still there!
print(col2.get(ids=["n1"])["documents"])  # ["Remember to buy milk"]

# Check the files on disk
for root, dirs, files in os.walk("./my_vector_db"):
    for f in files:
        print(os.path.join(root, f))

PersistentClient limitations
Limitation	Detail
Single writer only	SQLite allows only one writer at a time — concurrent writes from multiple processes cause errors
No built-in replication	The SQLite file is a single point of failure; back it up manually
No horizontal scaling	Cannot distribute load across multiple machines
File locking	Moving or copying the directory while the client is open can corrupt data
Migration	Upgrading ChromaDB versions may require running migration scripts on the SQLite DB

For multi-process or production deployments, prefer running chroma run --path ./data as a server and connecting with HttpClient.

Take quiz

What database engine does ChromaDB's PersistentClient use to store metadata and document text?PostgreSQL

✗ Try again.

MongoDB

✗ Try again.

SQLite

✓ Correct! Well done.

LevelDB

✗ Try again.

Why is PersistentClient not suitable for concurrent writes from multiple Python processes?PersistentClient only supports read operations

✗ Try again.

SQLite, which PersistentClient uses internally, allows only one concurrent writer — multiple processes writing simultaneously cause locking errors or data corruption

✓ Correct! Well done.

PersistentClient requires a network connection which blocks concurrent access

✗ Try again.

ChromaDB charges extra for multi-process access

✗ Try again.

15. What is the HNSW index in ChromaDB and what parameters can you tune?

ChromaDB uses HNSW (Hierarchical Navigable Small World) as its Approximate Nearest Neighbour (ANN) index. HNSW builds a layered graph structure where each node connects to its closest neighbours — queries traverse this graph efficiently to find approximate nearest neighbours in O(log n) time instead of exhaustive O(n) linear scan.

import chromadb

client = chromadb.Client()

# HNSW parameters are set as metadata at collection creation
collection = client.create_collection(
    name="tuned_collection",
    metadata={
        "hnsw:space":           "cosine",   # distance metric
        "hnsw:construction_ef": 200,         # default 100
        # Controls quality of index during construction.
        # Higher = better recall, slower inserts.

        "hnsw:search_ef":       100,         # default 10
        # Controls quality of search at query time.
        # Higher = better recall, slower queries.

        "hnsw:M":               16,          # default 16
        # Number of bi-directional links per node.
        # Higher = better recall + more memory + slower inserts.
        # Typical range: 4-64.
    },
)

# Note: HNSW parameters cannot be changed after collection creation
# You would need to recreate the collection and re-insert data

collection.add(
    documents=[f"Document number {i}" for i in range(10000)],
    ids=[str(i) for i in range(10000)],
)

HNSW tuning guide
Parameter	Default	Effect of increasing	Effect of decreasing
hnsw:space	l2	Changes metric (cosine/ip)	—
hnsw:M	16	Better recall, more memory, slower inserts	Faster inserts, less memory, lower recall
hnsw:construction_ef	100	Better index quality, slower inserts	Faster inserts, lower quality graph
hnsw:search_ef	10	Better recall, slower queries	Faster queries, lower recall

For most RAG use cases, the defaults work well for collections under ~100K documents. For large collections or when recall matters, increase hnsw:search_ef to 50–200 and set hnsw:construction_ef to at least 200 when building the index.

Take quiz

What does the hnsw:search_ef parameter control in ChromaDB?The number of dimensions in stored embedding vectors

✗ Try again.

The size of the dynamic candidate list during query-time search — higher values give better recall at the cost of slower queries

✓ Correct! Well done.

The number of results returned per query

✗ Try again.

The distance metric used during nearest-neighbour search

✗ Try again.

What type of algorithm is HNSW and why does ChromaDB use it instead of exact search?A sorting algorithm — it sorts vectors before comparison

✗ Try again.

An Approximate Nearest Neighbour algorithm — it finds very close (but not always the exact closest) neighbours in O(log n) time, making large-scale similarity search practical

✓ Correct! Well done.

A compression algorithm — it reduces vector dimensions before storing

✗ Try again.

A hashing algorithm — it assigns vectors to buckets for O(1) lookup

✗ Try again.

16. How do you efficiently add large numbers of documents to ChromaDB using batching?

Adding tens of thousands of documents one at a time is slow because each call triggers embedding computation and index updates. The right approach is to batch documents into groups of 100–500 and add each batch with a single add() call — this amortises embedding overhead and index writes.

import chromadb
from chromadb.utils import embedding_functions
from typing import List

client = chromadb.PersistentClient(path="./bulk_db")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
    "large_corpus", embedding_function=ef
)

# Simulate a large list of documents
documents = [f"Article about topic {i}" for i in range(10_000)]
ids       = [f"doc-{i}" for i in range(10_000)]
metadatas = [{"index": i, "batch": i // 500} for i in range(10_000)]

# Efficient batch insertion
BATCH_SIZE = 500

for start in range(0, len(documents), BATCH_SIZE):
    end = start + BATCH_SIZE
    collection.add(
        documents=documents[start:end],
        ids=ids[start:end],
        metadatas=metadatas[start:end],
    )
    print(f"Added batch {start // BATCH_SIZE + 1}, total: {collection.count()}")

print(f"Final count: {collection.count()}")  # 10000

# Alternative: provide pre-computed embeddings to skip re-embedding
# (useful when you already called the embedding API)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

docs_batch = documents[:500]
vectors = model.encode(docs_batch, batch_size=64, show_progress_bar=True)
collection.add(
    embeddings=vectors.tolist(),
    documents=docs_batch,
    ids=ids[:500],
)

Batching tips
Tip	Reason
Batch size 100–500	Balances memory use and embedding throughput
Pre-compute embeddings externally	Avoid re-embedding if you already have vectors from an API call
Use GPU for local models	SentenceTransformer encodes ~100x faster on GPU
Upsert instead of add in loops	upsert() is safe to re-run; add() fails on duplicate IDs

Take quiz

What is the main performance benefit of batching documents into groups before calling collection.add()?Batching bypasses the HNSW index for faster insertion

✗ Try again.

Batching amortises the embedding computation and index update overhead — embedding 500 texts in one call is far faster than 500 separate single-text calls

✓ Correct! Well done.

Batching allows ChromaDB to compress vectors more efficiently

✗ Try again.

Batching is required when using more than 100 documents

✗ Try again.

What is a good batch size for bulk insertion into ChromaDB?Exactly 10 — ChromaDB has a hard limit of 10 per call

✗ Try again.

1 document per call for maximum reliability

✗ Try again.

100–500 documents per batch

✓ Correct! Well done.

At least 10,000 for maximum throughput

✗ Try again.

17. What is the where_document filter in ChromaDB and how does it differ from where?

ChromaDB provides two types of filters that can be used together or separately: where filters on metadata fields (structured key-value pairs), while where_document filters on the raw text content of the stored documents. Both can be combined in a single query.

import chromadb

client = chromadb.Client()
col = client.create_collection("mixed_docs")
col.add(
    documents=[
        "Python tutorial for beginners with examples",
        "Advanced Python decorators and metaclasses",
        "JavaScript async/await guide",
        "Python data science with pandas and numpy",
        "Rust memory safety tutorial",
    ],
    metadatas=[
        {"lang": "python", "level": "beginner"},
        {"lang": "python", "level": "advanced"},
        {"lang": "js",     "level": "intermediate"},
        {"lang": "python", "level": "intermediate"},
        {"lang": "rust",   "level": "beginner"},
    ],
    ids=["d1","d2","d3","d4","d5"],
)

# where_document: filter on text content
results = col.query(
    query_texts=["programming guide"],
    n_results=5,
    where_document={"$contains": "tutorial"},  # text must contain "tutorial"
)
print(results["ids"])  # d1, d2 (Python tut), d5 (Rust tut) â JS has no "tutorial"

# where_document with NOT
results = col.query(
    query_texts=["programming"],
    n_results=5,
    where_document={"$not_contains": "JavaScript"},
)

# Combine where (metadata) + where_document (text content)
results = col.query(
    query_texts=["learning to code"],
    n_results=3,
    where={"lang": "python"},                    # metadata filter
    where_document={"$contains": "tutorial"},    # content filter
    # Only Python docs whose text contains "tutorial"
)
print(results["documents"])
# Only matches d1: "Python tutorial for beginners with examples"

where vs where_document
Filter	Operates on	Supported operators
where	Metadata key-value fields	$eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or
where_document	Raw document text content	$contains, $not_contains

Take quiz

What does the where_document={'$contains': 'Python'} filter do in a ChromaDB query?Returns only documents whose metadata has a 'Python' key

✗ Try again.

Restricts results to documents whose stored text content contains the substring 'Python'

✓ Correct! Well done.

Performs a semantic search for topics related to Python

✗ Try again.

Filters documents created by a user named Python

✗ Try again.

Can where and where_document be used together in a single ChromaDB query?No — only one filter type is allowed per query

✗ Try again.

Yes — both filters are applied simultaneously, narrowing results to documents that match both the metadata condition and the content condition

✓ Correct! Well done.

Only when using the HttpClient

✗ Try again.

Only for get() calls, not query() calls

✗ Try again.

18. How do you control what data ChromaDB returns in query and get results using include?

Both query() and get() accept an include parameter — a list of strings specifying which fields to return. Omitting fields you don't need reduces network payload and memory, which matters for large result sets.

import chromadb

client = chromadb.Client()
col = client.create_collection("demo")
col.add(
    documents=["Alpha document", "Beta document", "Gamma document"],
    metadatas=[{"tag": "a"}, {"tag": "b"}, {"tag": "c"}],
    ids=["id1", "id2", "id3"],
)

# Default include: documents, metadatas, distances (for query)
# Default include for get(): documents, metadatas (no distances)
results = col.query(query_texts=["document"], n_results=2)
print(results.keys())
# dict_keys(["ids", "distances", "metadatas", "embeddings", "documents", "uris", "data"])
# embeddings, uris, data are None by default

# Only return IDs and distances â smallest possible response
results = col.query(
    query_texts=["alpha"],
    n_results=2,
    include=["distances"],  # ids are always returned
)
print(results["documents"])   # None
print(results["distances"])   # [[0.05, 0.72]]

# Include raw embedding vectors (large! use only when needed)
results = col.query(
    query_texts=["beta"],
    n_results=1,
    include=["documents", "metadatas", "distances", "embeddings"],
)
print(len(results["embeddings"][0][0]))  # 384 floats per vector

# get() include â embeddings must be explicitly requested
all_data = col.get(
    include=["documents", "metadatas", "embeddings"],
)
print(len(all_data["embeddings"]))  # 3

# get() without include â minimal response
ids_only = col.get()
print(ids_only["ids"])         # ["id1", "id2", "id3"]
print(ids_only["documents"])   # ["Alpha...", "Beta...", "Gamma..."]

Available include values
Value	Returned in query()?	Returned in get()?
documents	Yes (default)	Yes (default)
metadatas	Yes (default)	Yes (default)
distances	Yes (default)	No — not applicable
embeddings	No (must request)	No (must request)
uris	No (multimodal only)	No (multimodal only)
data	No (multimodal only)	No (multimodal only)

Take quiz

Why would you exclude 'documents' from the include list in a ChromaDB query?To make the query faster by skipping semantic search

✗ Try again.

To reduce response payload size when you only need IDs and distances to decide which documents to fetch separately

✓ Correct! Well done.

Documents cannot be included when embeddings are requested

✗ Try again.

To avoid returning duplicate results

✗ Try again.

Are IDs always returned in ChromaDB query and get results?No — you must add 'ids' to the include list

✗ Try again.

Yes — IDs are always included regardless of the include parameter

✓ Correct! Well done.

Only when n_results=1

✗ Try again.

Only for get(), not query()

✗ Try again.

19. How do you design metadata schemas for effective filtering in ChromaDB?

Metadata in ChromaDB is stored as flat key-value dictionaries where values must be strings, integers, or floats (not nested dicts or lists). Good metadata design makes the difference between fast, precise filtered queries and slow full-collection scans.

import chromadb
from datetime import datetime

client = chromadb.Client()
col = client.create_collection("knowledge_base")

# Good metadata design â flat, filterable fields
col.add(
    documents=[
        "Introduction to transformer architecture in deep learning.",
        "BERT: Pre-training of Deep Bidirectional Transformers.",
        "GPT-4 technical report overview.",
    ],
    metadatas=[
        {
            "source":    "textbook",
            "author":    "Vaswani",
            "year":      2017,           # int â supports $gt, $lt
            "category":  "architecture",
            "citations": 50000,          # int â sortable
            "language":  "en",
            # timestamp as int for range queries
            "added_ts":  int(datetime(2024,1,1).timestamp()),
        },
        {
            "source":    "paper",
            "author":    "Devlin",
            "year":      2018,
            "category":  "pretraining",
            "citations": 40000,
            "language":  "en",
            "added_ts":  int(datetime(2024,1,2).timestamp()),
        },
        {
            "source":    "report",
            "author":    "OpenAI",
            "year":      2023,
            "category":  "LLM",
            "citations": 5000,
            "language":  "en",
            "added_ts":  int(datetime(2024,1,3).timestamp()),
        },
    ],
    ids=["p1","p2","p3"],
)

# Effective filtered queries
results = col.query(
    query_texts=["neural network architecture"],
    n_results=5,
    where={"$and": [
        {"year":     {"$gte": 2017}},
        {"citations":{"$gte": 10000}},
        {"language": "en"},
    ]},
)

# Anti-patterns to avoid in metadata:
# BAD:  {"tags": ["python", "nlp"]}  â lists not supported
# BAD:  {"author": {"name": "Vaswani", "affiliation": "Google"}}  â nested not supported
# GOOD: {"tag_python": 1, "tag_nlp": 1}  â flatten list membership to bool ints
# GOOD: {"author_name": "Vaswani", "author_org": "Google"}  â flatten nested

Metadata value types
Type	Supported?	Supports range filters?
str	Yes	Only $eq, $ne, $in, $nin
int	Yes	Yes — $gt, $gte, $lt, $lte
float	Yes	Yes — $gt, $gte, $lt, $lte
bool	No — use int 0/1	—
list	No	—
dict (nested)	No	—

Take quiz

What metadata value types does ChromaDB support?Any JSON-serialisable type including lists and nested dicts

✗ Try again.

Strings, integers, and floats only — booleans, lists, and nested dicts are not supported

✓ Correct! Well done.

Strings only

✗ Try again.

Integers and floats only — strings require a special wrapper

✗ Try again.

How would you store a list of tags like ['python', 'nlp'] as ChromaDB metadata?{'tags': ['python', 'nlp']} — lists are supported natively

✗ Try again.

Flatten to individual boolean fields: {'tag_python': 1, 'tag_nlp': 1} and filter with $eq: 1

✓ Correct! Well done.

Convert to a JSON string: {'tags': '["python","nlp"]'} and filter with $contains

✗ Try again.

Tags cannot be stored in ChromaDB metadata

✗ Try again.

20. How do you inspect a ChromaDB collection's contents and configuration?

ChromaDB provides several methods to examine what is stored in a collection — useful for debugging, verifying ingestion, and monitoring collection health.

import chromadb

client = chromadb.PersistentClient(path="./inspect_demo")
col = client.get_or_create_collection(
    "articles",
    metadata={"hnsw:space": "cosine"},
)
col.add(
    documents=[f"Article {i} about topic {i%3}" for i in range(20)],
    metadatas=[{"topic": i%3, "idx": i} for i in range(20)],
    ids=[f"art-{i}" for i in range(20)],
)

# 1. Count documents
print(col.count())  # 20

# 2. Peek â quick look at first n items (default n=10)
peek = col.peek(limit=5)
print(peek["ids"])        # first 5 IDs
print(peek["documents"])  # first 5 documents

# 3. Get all (careful with large collections!)
all_items = col.get()
print(len(all_items["ids"]))  # 20

# 4. Get a page of results (offset-based)
page = col.get(
    limit=5,
    offset=10,  # skip first 10
)
print(page["ids"])  # art-10 through art-14

# 5. Inspect collection metadata and config
print(col.name)      # "articles"
print(col.id)        # UUID
print(col.metadata)  # {"hnsw:space": "cosine"}

# 6. List all collections
for c in client.list_collections():
    print(c)  # prints collection name

# 7. Check if a document exists by ID
result = col.get(ids=["art-5"])
if result["ids"]:
    print("Found:", result["documents"][0])
else:
    print("Not found")

Collection inspection methods
Method / Property	Purpose
collection.count()	Number of documents stored
collection.peek(limit=10)	Quick sample of first N items
collection.get()	Retrieve all items (paginate large collections)
collection.get(limit=N, offset=M)	Paginate through collection
collection.name	Collection name string
collection.metadata	Dict of collection settings (hnsw:space etc.)
client.list_collections()	Names of all collections

Take quiz

What does collection.peek() return?A statistical summary of the stored vectors

✗ Try again.

A sample of the first N documents with their IDs, text, and metadata — useful for a quick sanity check during development

✓ Correct! Well done.

The collection's configuration metadata

✗ Try again.

A random sample of documents from throughout the collection

✗ Try again.

How do you paginate through a large ChromaDB collection to avoid loading everything into memory at once?Use query() with a large n_results value

✗ Try again.

Use get(limit=N, offset=M) to retrieve N items starting at position M

✓ Correct! Well done.

ChromaDB automatically paginates — no special handling needed

✗ Try again.

Use collection.stream() for pagination

✗ Try again.

21. How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB?

RAG combines ChromaDB's semantic retrieval with an LLM's generation ability. The pipeline has two phases: indexing (chunk documents, embed, store in ChromaDB) and retrieval (embed the user query, fetch similar chunks, inject into LLM prompt).

import chromadb
from chromadb.utils import embedding_functions
from openai import OpenAI
import os

# --- INDEXING PHASE (run once) ---
chroma_client = chromadb.PersistentClient(path="./rag_db")
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)
collection = chroma_client.get_or_create_collection(
    "company_docs", embedding_function=ef, metadata={"hnsw:space": "cosine"}
)

# Chunk and index your knowledge base
documents = [
    "ChromaDB supports cosine, l2, and inner-product distance metrics.",
    "Persistent storage in ChromaDB uses SQLite under the hood.",
    "The default embedding model is all-MiniLM-L6-v2 with 384 dimensions.",
    "ChromaDB collections support metadata filtering with $eq, $gt, $in operators.",
]
collection.add(
    documents=documents,
    ids=[f"doc-{i}" for i in range(len(documents))],
)

# --- RETRIEVAL + GENERATION PHASE (run per query) ---
def rag_answer(user_question: str, n_results: int = 3) -> str:
    # 1. Retrieve relevant chunks from ChromaDB
    results = collection.query(
        query_texts=[user_question],
        n_results=n_results,
        include=["documents", "distances"],
    )
    context_chunks = results["documents"][0]  # list of retrieved texts
    context = "\n\n".join(
        f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks)
    )

    # 2. Build an augmented prompt
    prompt = f"""Answer the question using ONLY the context below.
    If the answer is not in the context, say "I don't know."

    Context:
    {context}

    Question: {user_question}
    Answer:"""

    # 3. Generate answer with LLM
    openai_client = OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

print(rag_answer("What distance metrics does ChromaDB support?"))

Take quiz

In a RAG pipeline, what role does ChromaDB play?It generates the final answer using an LLM

✗ Try again.

It stores and retrieves semantically relevant document chunks that are injected as context into the LLM prompt

✓ Correct! Well done.

It fine-tunes the LLM on your documents

✗ Try again.

It handles the HTTP API layer between the user and the LLM

✗ Try again.

Why is retrieval-augmented generation (RAG) preferred over fine-tuning for adding domain knowledge to an LLM?RAG produces better answers than fine-tuning in all cases

✗ Try again.

RAG allows knowledge to be updated instantly by changing ChromaDB's documents without retraining the model — fine-tuning is expensive and static

✓ Correct! Well done.

RAG is cheaper because it does not use any LLM API calls

✗ Try again.

Fine-tuning is not supported by current LLMs

✗ Try again.

22. What are effective document chunking strategies when indexing documents into ChromaDB for RAG?

Before adding documents to ChromaDB, long texts must be split into chunks that fit within the embedding model's token limit and contain cohesive information. Chunk size and overlap directly affect retrieval quality.

# pip install langchain-text-splitters
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)
import chromadb

# RecursiveCharacterTextSplitter â tries to split at natural boundaries
# (paragraphs â sentences â words â characters)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # characters per chunk (aim for ~200â400 tokens)
    chunk_overlap=50,    # overlap prevents losing context at chunk boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
)

long_document = """ChromaDB is an open-source vector database.
It supports multiple embedding functions including OpenAI and HuggingFace.
ChromaDB uses HNSW for approximate nearest-neighbour search.
You can filter results using metadata fields.
Persistent storage uses SQLite under the hood.
""" * 20  # repeat to make it long

chunks = splitter.split_text(long_document)
print(f"Split into {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])} chars")

# Add chunks to ChromaDB with source metadata
client = chromadb.Client()
col = client.create_collection("chunked_docs")

col.add(
    documents=chunks,
    metadatas=[{"source": "chroma_guide.txt", "chunk_idx": i}
               for i in range(len(chunks))],
    ids=[f"chunk-{i}" for i in range(len(chunks))],
)

Chunking strategy comparison
Strategy	Chunk size	Overlap	Best for
Small chunks	100–200 tokens	10–20 tokens	Precise retrieval, FAQ-style docs
Medium chunks	300–500 tokens	50 tokens	Most RAG use cases — good balance
Large chunks	800–1000 tokens	100 tokens	Long-form prose where context matters
Semantic chunking	Variable	0	Academic papers, structured content

Key rule: chunk overlap prevents the situation where a sentence spanning a chunk boundary gets split, losing its meaning in both halves. Typical overlap is 10–20% of chunk size.

Take quiz

Why is chunk overlap important when splitting documents for ChromaDB RAG indexing?Overlap increases the total number of chunks stored, improving recall

✗ Try again.

Overlap ensures that sentences or ideas at chunk boundaries are not cut off — a key phrase split across two chunks appears fully in at least one

✓ Correct! Well done.

Overlap is required by ChromaDB's HNSW index

✗ Try again.

Overlap allows ChromaDB to detect duplicate content

✗ Try again.

What is a good general-purpose chunk size (in tokens) for RAG document indexing?10–20 tokens — smaller is always better

✗ Try again.

300–500 tokens — balances containing enough context with staying within embedding model limits

✓ Correct! Well done.

5000 tokens — the more context per chunk the better

✗ Try again.

Exactly 512 tokens — the only size supported by embedding models

✗ Try again.

23. How do you use ChromaDB as a vector store with LangChain?

LangChain provides a first-class Chroma vector store integration that wraps ChromaDB's API with LangChain's retriever interface. This enables plugging ChromaDB into LangChain RAG chains, agents, and pipelines without writing low-level ChromaDB code.

# pip install langchain langchain-chroma langchain-openai
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# --- Option 1: Create from documents ---
docs = [
    Document(page_content="ChromaDB is a vector database.", metadata={"source": "intro"}),
    Document(page_content="HNSW is used for ANN search.",  metadata={"source": "tech"}),
    Document(page_content="RAG improves LLM accuracy.",    metadata={"source": "ai"}),
]
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="lc_demo",
    persist_directory="./lc_chroma",  # persistent storage
)

# --- Option 2: Load existing ChromaDB ---
vectorstore = Chroma(
    collection_name="lc_demo",
    embedding_function=embeddings,
    persist_directory="./lc_chroma",
)

# Similarity search
results = vectorstore.similarity_search("vector databases", k=2)
for doc in results:
    print(doc.page_content)

# As retriever (for use in chains)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3, "filter": {"source": "tech"}},
)

# Build a simple RAG chain with LCEL
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n\n{context}\n\nQuestion: {question}"
)

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)
print(rag_chain.invoke("What search algorithm does ChromaDB use?"))

Take quiz

What does vectorstore.as_retriever() return in LangChain?A raw ChromaDB collection object

✗ Try again.

A LangChain Retriever interface — a standard object that can be plugged into any LangChain chain or agent to fetch relevant documents for a query

✓ Correct! Well done.

A list of all documents in the vector store

✗ Try again.

An OpenAI client configured to use ChromaDB as context

✗ Try again.

What is the main advantage of using LangChain's Chroma integration over using chromadb directly?LangChain's version of ChromaDB is faster

✗ Try again.

The LangChain Chroma wrapper implements standard retriever interfaces, making it composable with LangChain chains, agents, memory modules, and other ecosystem tools without custom glue code

✓ Correct! Well done.

LangChain Chroma supports more embedding models

✗ Try again.

LangChain Chroma automatically chunks documents

✗ Try again.

24. How do you implement multi-tenancy or data isolation in ChromaDB?

ChromaDB does not have built-in user-level access control, but you can implement logical isolation between tenants using separate collections per tenant (strong isolation) or metadata-based filtering (lighter weight). Choose based on your security and scale requirements.

import chromadb

client = chromadb.PersistentClient(path="./multi_tenant")

# --- Strategy 1: Separate collection per tenant ---
# Strong isolation â one tenant cannot accidentally access another's data
def get_tenant_collection(tenant_id: str):
    collection_name = f"tenant_{tenant_id}"  # e.g. "tenant_acme_corp"
    return client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine", "tenant": tenant_id},
    )

col_acme  = get_tenant_collection("acme_corp")
col_globex = get_tenant_collection("globex_inc")

col_acme.add(
    documents=["ACME internal policy v1"],
    ids=["acme-doc-1"],
)
col_globex.add(
    documents=["Globex product catalogue"],
    ids=["globex-doc-1"],
)
# ACME queries can never return Globex data â total isolation

# --- Strategy 2: Metadata filtering (shared collection) ---
# Lighter weight â all tenants share one collection, filtered at query time
shared_col = client.get_or_create_collection("shared_docs")

shared_col.add(
    documents=["ACME policy", "Globex catalogue"],
    metadatas=[{"tenant_id": "acme"}, {"tenant_id": "globex"}],
    ids=["s1", "s2"],
)

def tenant_query(tenant_id: str, query: str, n: int = 3):
    return shared_col.query(
        query_texts=[query],
        n_results=n,
        where={"tenant_id": tenant_id},  # ALWAYS filter by tenant
    )

results = tenant_query("acme", "company policies")
print(results["documents"])  # Only ACME docs returned

Multi-tenancy strategies
Strategy	Isolation	Overhead	Best for
Separate collections	Strong — no cross-tenant risk	More collections to manage	High-security, regulated industries
Metadata filter	Logical — relies on query discipline	Single collection, simpler ops	Many small tenants, lower risk

Take quiz

What is the risk of using metadata filtering for multi-tenancy in ChromaDB instead of separate collections?Metadata filtering is slower than collection separation

✗ Try again.

If a query accidentally omits the tenant_id filter, it may return data from other tenants — metadata isolation depends on query discipline, not database-level access control

✓ Correct! Well done.

ChromaDB charges more for metadata filtering

✗ Try again.

Metadata filtering does not work with cosine distance

✗ Try again.

How do you name collections for per-tenant isolation in ChromaDB?Use a single shared collection with a tenant field

✗ Try again.

Create a separate collection per tenant, named with a consistent pattern like 'tenant_{tenant_id}' and use get_or_create_collection for safe initialisation

✓ Correct! Well done.

ChromaDB has a built-in tenant() method for this purpose

✗ Try again.

Prefix all document IDs with the tenant ID in a shared collection

✗ Try again.

25. What is embedding consistency and why is it critical in ChromaDB applications?

Embedding consistency means using the exact same embedding model and version for both indexing (adding documents) and querying. If you embed documents with model A but query with model B, the resulting vectors live in incompatible geometric spaces — similarity distances become meaningless and retrieval quality collapses.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./consistency_demo")

# CORRECT: same embedding function for add and query
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection(
    "correct_usage",
    embedding_function=ef,  # stored on collection
)
collection.add(
    documents=["Hello world"],
    ids=["d1"],
)
# query() automatically uses the same ef stored on the collection
results = collection.query(query_texts=["greetings"], n_results=1)
# Works correctly â ef is applied to both document and query

# ---
# PITFALL 1: switching models between sessions
# Session 1: add with all-MiniLM-L6-v2 (384 dims)
# Session 2: accidentally use all-mpnet-base-v2 (768 dims) â dimension mismatch error!

# PITFALL 2: updating embedding model version
# Model v1.0 and v1.1 may produce different vector spaces
# Always re-embed ALL documents when upgrading the embedding model

# BEST PRACTICE: store the model name in collection metadata
collection_safe = client.get_or_create_collection(
    "safe_collection",
    embedding_function=ef,
    metadata={
        "hnsw:space": "cosine",
        "embedding_model": "all-MiniLM-L6-v2",  # document which model was used
        "embedding_dim":   "384",
    },
)
# On load, verify the model matches what is stored:
meta = collection_safe.metadata
print(meta["embedding_model"])  # "all-MiniLM-L6-v2"
print(meta["embedding_dim"])    # "384"

# When you need to upgrade the embedding model:
# 1. Create a NEW collection with the new model
# 2. Re-embed and re-insert all documents
# 3. Run validation queries to confirm quality
# 4. Delete the old collection

Embedding consistency checklist
Check	Why
Same model name	Different models produce vectors in different spaces
Same model version	Even minor version updates can shift the vector space
Same preprocessing	Lowercasing, truncation, etc. must be identical
Store model name in metadata	Documents which model was used for future reference
Re-embed on model upgrade	Old and new vectors cannot coexist in the same collection

Take quiz

What happens if you add documents to ChromaDB with one embedding model and query with a different one?ChromaDB automatically converts between model spaces

✗ Try again.

Similarity distances become meaningless — vectors from different models exist in incompatible spaces, causing retrieval to return wrong or random results

✓ Correct! Well done.

ChromaDB raises an error and rejects the query

✗ Try again.

The query returns zero results

✗ Try again.

What is the best practice for recording which embedding model was used for a ChromaDB collection?Write it in a separate README file

✗ Try again.

Store the model name and dimensions in the collection's metadata dictionary at creation time

✓ Correct! Well done.

ChromaDB records this automatically in its logs

✗ Try again.

It is not necessary — ChromaDB remembers the model internally

✗ Try again.

26. How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients?

For production or multi-process environments, run ChromaDB as a persistent HTTP server and connect all clients via chromadb.HttpClient(). This removes the single-writer SQLite limitation and allows any number of clients — including different languages — to share the same database.

# --- SERVER SIDE ---
# Install: pip install chromadb
# Start the server from the command line:
# chroma run --path ./chroma_data --port 8000 --host 0.0.0.0

# Or run programmatically (e.g. in tests):
import chromadb
from chromadb.config import Settings

# --- CLIENT SIDE ---
client = chromadb.HttpClient(
    host="localhost",
    port=8000,
)

# Verify server is reachable
client.heartbeat()  # raises ConnectionError if server is down

# Usage is identical to PersistentClient
collection = client.get_or_create_collection(
    "shared_docs",
    metadata={"hnsw:space": "cosine"},
)
collection.add(
    documents=["Shared document from client 1"],
    ids=["s1"],
)
results = collection.query(query_texts=["shared content"], n_results=1)
print(results["documents"])

# With authentication (chromadb server configured with auth)
client_auth = chromadb.HttpClient(
    host="my-server.example.com",
    port=443,
    ssl=True,
    headers={"Authorization": "Bearer my-token"},
)

# docker-compose.yml â containerised ChromaDB server
# version: "3.9"
# services:
#   chromadb:
#     image: chromadb/chroma:latest
#     ports:
#       - "8000:8000"
#     volumes:
#       - chroma_data:/chroma/chroma
#     environment:
#       - IS_PERSISTENT=TRUE
#       - ANONYMIZED_TELEMETRY=FALSE
# volumes:
#   chroma_data:

Client modes comparison
Mode	Concurrency	Network	Use case
EphemeralClient	Single process only	None	Tests, notebooks
PersistentClient	Single writer only	None	Local scripts, dev
HttpClient	Multiple clients	HTTP/HTTPS	Production, microservices

Take quiz

What command starts a ChromaDB HTTP server from the terminal?chromadb start --port 8000

✗ Try again.

chroma run --path ./data --port 8000

✓ Correct! Well done.

python -m chromadb.server

✗ Try again.

chromadb serve --dir ./data

✗ Try again.

Why is HttpClient preferred over PersistentClient in a multi-container deployment?HttpClient uses less memory

✗ Try again.

PersistentClient uses SQLite which allows only one writer at a time — HttpClient connects to a server process that handles concurrent access safely

✓ Correct! Well done.

HttpClient supports more distance metrics

✗ Try again.

PersistentClient cannot store more than 10,000 documents

✗ Try again.

27. When should you use upsert() instead of add() in ChromaDB, and what are common patterns?

upsert() is the idempotent write operation in ChromaDB: it inserts a document if the ID does not exist, or updates it if the ID already exists. This makes it safe to call repeatedly without checking whether a document has been indexed before — a critical property for ETL pipelines, scheduled sync jobs, and incremental indexing.

import chromadb
from datetime import datetime

client = chromadb.PersistentClient(path="./upsert_demo")
col = client.get_or_create_collection("products")

# Pattern 1: Safe initial load
# Can re-run the script without duplicate ID errors
def sync_products(products: list[dict]):
    col.upsert(
        documents=[p["description"] for p in products],
        ids=       [str(p["id"])   for p in products],
        metadatas= [{"name": p["name"], "price": p["price"], "updated": int(datetime.now().timestamp())}
                    for p in products],
    )

products_v1 = [
    {"id": 1, "name": "Widget", "description": "A blue widget", "price": 9.99},
    {"id": 2, "name": "Gadget", "description": "A red gadget",  "price": 14.99},
]
sync_products(products_v1)  # inserts both
print(col.count())  # 2

# Product 1 description changed â upsert handles it cleanly
products_v2 = [
    {"id": 1, "name": "Widget", "description": "An improved blue widget v2", "price": 11.99},
    {"id": 3, "name": "Doohickey", "description": "A green doohickey", "price": 4.99},
]
sync_products(products_v2)  # updates id=1, inserts id=3
print(col.count())          # 3

# Verify the update
result = col.get(ids=["1"])
print(result["documents"][0])   # "An improved blue widget v2"
print(result["metadatas"][0]["price"])  # 11.99

# Pattern 2: Incremental indexing â only upsert changed documents
def incremental_sync(items, last_sync_ts: int):
    changed = [i for i in items if i["updated_at"] > last_sync_ts]
    if changed:
        col.upsert(
            documents=[i["body"] for i in changed],
            ids=       [i["id"]  for i in changed],
            metadatas= [{"updated_at": i["updated_at"]} for i in changed],
        )

add vs upsert decision guide
Scenario	Use
First-time bulk load with guaranteed unique IDs	add() — faster, errors catch duplicate bugs
Recurring sync job (daily/hourly)	upsert() — safe to re-run without cleanup
User-triggered document update	upsert() — don't need to check if doc exists first
Append-only event log	add() — duplicates should be errors, not updates

Take quiz

Why is upsert() preferred over add() for a nightly ETL job that syncs a product catalogue into ChromaDB?upsert() is always faster than add()

✗ Try again.

upsert() is idempotent — if the job re-runs or a document already exists, it updates rather than raising an error, making the pipeline resilient to restarts

✓ Correct! Well done.

add() does not support metadata

✗ Try again.

ETL jobs require upsert() by ChromaDB's API contract

✗ Try again.

What happens to the stored embedding when you upsert() a document with updated text?The old embedding is kept — only metadata is updated

✗ Try again.

ChromaDB re-embeds the new document text and stores the updated vector alongside the new content

✓ Correct! Well done.

upsert() cannot change document text — only metadata

✗ Try again.

The embedding is deleted and must be provided manually

✗ Try again.

28. What are best practices for structuring ChromaDB collection metadata for production use?

Collection-level metadata (set via create_collection(metadata=...)) stores configuration about the collection itself. Document-level metadata (set per document via add(metadatas=[...])) enables filtered retrieval. Both need thoughtful design for maintainable production systems.

import chromadb
from datetime import datetime

client = chromadb.PersistentClient(path="./prod_db")

# Good collection-level metadata: document operational details
collection = client.get_or_create_collection(
    name="support_tickets_v2",
    metadata={
        # HNSW config
        "hnsw:space":           "cosine",
        "hnsw:construction_ef": 200,
        "hnsw:search_ef":       100,
        # Operational metadata
        "embedding_model":      "text-embedding-3-small",
        "embedding_dims":       "1536",
        "schema_version":       "2",
        "created_at":           "2024-01-15",
        "description":          "Customer support ticket embeddings for semantic search",
    },
)

# Good document-level metadata: filterable, flat, typed
def add_ticket(ticket: dict):
    collection.upsert(
        documents=[ticket["description"]],
        ids=[f"ticket-{ticket['id']}"],
        metadatas=[{
            # Filterable dimensions
            "status":    ticket["status"],         # "open"/"closed"/"pending"
            "priority":  ticket["priority"],       # "low"/"medium"/"high"
            "category":  ticket["category"],       # "billing"/"technical"/"general"
            "agent_id":  ticket["agent_id"],       # str identifier
            # Date as Unix timestamp (int) â enables $gt/$lt range queries
            "created_ts": int(datetime.fromisoformat(ticket["created_at"]).timestamp()),
            "year":       int(ticket["created_at"][:4]),
            # Boolean as int â ChromaDB does not support bool type
            "is_escalated": int(ticket.get("escalated", False)),
        }],
    )

# Effective compound filter
results = collection.query(
    query_texts=["payment failed cannot checkout"],
    n_results=10,
    where={"$and": [
        {"status":    "open"},
        {"priority":  {"$in": ["high", "medium"]}},
        {"category":  "billing"},
        {"created_ts":{"$gte": int(datetime(2024, 1, 1).timestamp())}},
    ]},
)

Key rules: store dates as Unix timestamps (int) for range filtering. Store booleans as 0/1 integers. Keep metadata keys short and snake_case. Document your schema in collection-level metadata so future developers know what fields exist.

Take quiz

Why should dates be stored as Unix timestamps (integers) rather than ISO date strings in ChromaDB metadata?ChromaDB cannot store strings in metadata

✗ Try again.

Integer timestamps support range operators ($gt, $gte, $lt, $lte) enabling 'find documents from last 30 days' queries — ISO date string comparison uses lexicographic ordering which may not match chronological order correctly

✓ Correct! Well done.

Timestamps use less storage space

✗ Try again.

ISO strings cause encoding errors in ChromaDB

✗ Try again.

How do you store a boolean value (True/False) in ChromaDB document metadata?Use Python True/False directly — ChromaDB supports booleans

✗ Try again.

Store as integer 1 or 0 — ChromaDB metadata does not support the bool type, but integers work with $eq filtering

✓ Correct! Well done.

Convert to string 'true' or 'false'

✗ Try again.

Use a nested dict: {'value': True, 'type': 'bool'}

✗ Try again.

29. How does ChromaDB compare to FAISS, and when should you choose one over the other?

FAISS (Facebook AI Similarity Search) and ChromaDB both store and search embedding vectors, but they are designed for very different use cases. FAISS is a low-level library optimised for raw performance; ChromaDB is a higher-level database designed for developer ergonomics and full-stack AI applications.

ChromaDB vs FAISS
Feature	ChromaDB	FAISS
Type	Vector database (full-stack)	Vector index library (low-level)
Storage	Persistent SQLite + HNSW files	In-memory or flat files (manual)
Metadata	Built-in key-value filtering	No metadata — must manage separately
Documents	Stores original text alongside vectors	Stores vectors only — text management is manual
Persistence	Built-in PersistentClient	Manual save/load with faiss.write_index()
CRUD	add, get, update, delete, upsert	Add only — no update/delete without rebuilding
API	High-level Python + REST	Low-level Python/C++ bindings
Performance	Good for <10M docs	Excellent for 10M+ docs (GPU-accelerated)
Embedding function	Built-in (auto-embed text)	You must manage embeddings yourself
Best for	RAG apps, prototyping, small-medium scale	High-throughput ML systems, research, scale

# FAISS â lower level, manage everything manually
import faiss
import numpy as np

# Build index manually
dim = 384
index = faiss.IndexFlatIP(dim)           # inner product
vectors = np.random.rand(1000, dim).astype("float32")
faiss.normalize_L2(vectors)
index.add(vectors)                        # add vectors
D, I = index.search(query_vec, k=5)      # search
faiss.write_index(index, "index.faiss")  # save manually

# ChromaDB â higher level, text in, results out
import chromadb
client = chromadb.Client()
col = client.create_collection("demo")
col.add(documents=["text one", "text two"], ids=["1","2"])
results = col.query(query_texts=["similar text"], n_results=2)
# Embeddings, persistence, metadata all handled automatically

Take quiz

What is the key limitation of FAISS compared to ChromaDB for RAG application development?FAISS does not support GPU acceleration

✗ Try again.

FAISS is a low-level index library — it stores only vectors with no built-in metadata filtering, text storage, persistence, or CRUD operations, requiring you to build all of those yourself

✓ Correct! Well done.

FAISS cannot handle more than 1 million vectors

✗ Try again.

FAISS does not support cosine similarity

✗ Try again.

In what scenario would you choose FAISS over ChromaDB?When you need metadata filtering on query results

✗ Try again.

When building a high-throughput research system with tens of millions of vectors and need maximum raw performance, GPU acceleration, and fine-grained control over index type

✓ Correct! Well done.

When building a simple RAG chatbot

✗ Try again.

When you need persistent storage across restarts

✗ Try again.

30. What are common ChromaDB errors and how do you handle them in production code?

ChromaDB raises specific exception types that should be caught and handled gracefully in production applications. Understanding the error hierarchy helps you write resilient ingestion pipelines and retrieval code.

import chromadb
from chromadb.errors import (
    InvalidCollectionException,
    IDAlreadyExistsError,
    InvalidDimensionException,
)

client = chromadb.PersistentClient(path="./error_demo")

# --- Error 1: Collection not found ---
try:
    col = client.get_collection("does_not_exist")
except InvalidCollectionException as e:
    print(f"Collection missing: {e}")
    col = client.create_collection("does_not_exist")  # create it

# --- Error 2: Duplicate ID ---
col.add(documents=["Original doc"], ids=["doc-1"])
try:
    col.add(documents=["Duplicate doc"], ids=["doc-1"])
except IDAlreadyExistsError:
    print("ID already exists â use upsert() instead")
    col.upsert(documents=["Updated doc"], ids=["doc-1"])  # safe

# --- Error 3: Dimension mismatch ---
# Occurs when pre-computed embeddings don't match collection's embedding dimensions
col2 = client.create_collection("fixed_dim")
col2.add(embeddings=[[0.1, 0.2, 0.3]], documents=["Doc"], ids=["x"])
try:
    col2.add(embeddings=[[0.1, 0.2]], documents=["Wrong dim"], ids=["y"])  # 2-dim
except InvalidDimensionException as e:
    print(f"Dimension mismatch: {e}")

# --- Error 4: Connection error (HttpClient) ---
try:
    remote = chromadb.HttpClient(host="bad-host", port=9999)
    remote.heartbeat()
except Exception as e:
    print(f"Server unreachable: {e}")

# --- Production pattern: retry wrapper ---
import time
from functools import wraps

def with_retry(max_attempts=3, delay=1.0):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    print(f"Attempt {attempt+1} failed: {e}. Retrying...")
                    time.sleep(delay * (attempt + 1))
        return wrapper
    return decorator

@with_retry(max_attempts=3)
def safe_add(collection, documents, ids):
    collection.upsert(documents=documents, ids=ids)

Common ChromaDB exceptions
Exception	Cause	Fix
InvalidCollectionException	get_collection() on non-existent name	Use get_or_create_collection()
IDAlreadyExistsError	add() with duplicate IDs	Use upsert() for idempotent writes
InvalidDimensionException	Pre-computed embeddings wrong size	Match dimensions to collection's model
ValueError	Empty IDs, bad metadata types	Validate inputs before calling ChromaDB
ConnectionError / requests exception	HttpClient cannot reach server	Check server health, retry with backoff

Take quiz

What is the correct fix when ChromaDB raises IDAlreadyExistsError during an add() call?Delete the collection and recreate it

✗ Try again.

Switch to upsert() which safely inserts new IDs and updates existing ones without raising errors

✓ Correct! Well done.

Filter out the duplicate IDs before calling add()

✗ Try again.

Use a transaction to roll back and retry

✗ Try again.

Which exception does client.get_collection('nonexistent') raise?FileNotFoundError

✗ Try again.

KeyError

✗ Try again.

InvalidCollectionException

✓ Correct! Well done.

chromadb.CollectionMissingError

✗ Try again.

31. How do you back up and restore a ChromaDB persistent database?

A PersistentClient database is simply a directory on disk. Backing it up is as straightforward as copying that directory — but you must ensure no writes are occurring during the copy to avoid a corrupted SQLite file.

import chromadb
import shutil
import os
from datetime import datetime

DB_PATH    = "./my_chroma_db"
BACKUP_DIR = "./backups"

# --- Backup strategy 1: Simple directory copy ---
# SAFE when: no active PersistentClient writes during the copy
os.makedirs(BACKUP_DIR, exist_ok=True)
timestamp   = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = os.path.join(BACKUP_DIR, f"chroma_backup_{timestamp}")
shutil.copytree(DB_PATH, backup_path)
print(f"Backup saved to {backup_path}")

# --- Backup strategy 2: SQLite online backup (safe during reads) ---
import sqlite3

def backup_sqlite(db_path: str, backup_path: str):
    """SQLite online backup â safe even with active readers."""
    src = sqlite3.connect(os.path.join(db_path, "chroma.sqlite3"))
    dst = sqlite3.connect(os.path.join(backup_path, "chroma.sqlite3"))
    os.makedirs(backup_path, exist_ok=True)
    with dst:
        src.backup(dst, pages=100, progress=lambda s,p,r: print(f"Backed up {p} pages"))
    dst.close()
    src.close()
    # Also copy the HNSW index binary files
    for root, dirs, files in os.walk(db_path):
        for f in files:
            if f != "chroma.sqlite3":
                rel = os.path.relpath(root, db_path)
                dest_dir = os.path.join(backup_path, rel)
                os.makedirs(dest_dir, exist_ok=True)
                shutil.copy2(os.path.join(root, f), os.path.join(dest_dir, f))

# --- Restore ---
def restore_backup(backup_path: str, restore_path: str):
    if os.path.exists(restore_path):
        shutil.rmtree(restore_path)  # remove current
    shutil.copytree(backup_path, restore_path)
    print(f"Restored from {backup_path} to {restore_path}")

# Verify restored database
client = chromadb.PersistentClient(path=restore_path)
for col in client.list_collections():
    print(f"  {col}: {client.get_collection(col).count()} documents")

For the HttpClient / server mode: stop the ChromaDB server before copying the data directory, or use SQLite's online backup API. Never copy a SQLite file while it has active writers — this can produce a corrupted backup.

Take quiz

What files make up a ChromaDB PersistentClient database that must be backed up together?Only the chroma.sqlite3 file

✗ Try again.

The chroma.sqlite3 metadata file AND the binary HNSW index files in per-collection subdirectories — both are required to restore a working database

✓ Correct! Well done.

Only the HNSW binary index files

✗ Try again.

A single backup.bin file that ChromaDB generates automatically

✗ Try again.

Why is it unsafe to copy a ChromaDB PersistentClient directory while the client has active write operations?ChromaDB uses file locks that prevent copying

✗ Try again.

SQLite files copied mid-write may be in an inconsistent state, producing a corrupted database that appears valid but returns wrong results or crashes on open

✓ Correct! Well done.

The copy will always fail with a PermissionError

✗ Try again.

ChromaDB encrypts the files during writes

✗ Try again.

32. How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection?

ChromaDB stores document text and vectors persistently, but it does not store which embedding function was used. When you reopen a PersistentClient, you must re-supply the same embedding function to the collection — otherwise ChromaDB may default to a different model, producing embedding mismatches.

import chromadb
from chromadb.utils import embedding_functions
import os

DB_PATH = "./persistent_ef_demo"

# === SESSION 1: Create and populate collection ===
client1 = chromadb.PersistentClient(path=DB_PATH)
ef_openai = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)
col1 = client1.get_or_create_collection(
    name="my_docs",
    embedding_function=ef_openai,                   # set the EF
    metadata={"hnsw:space": "cosine",
              "embedding_model": "text-embedding-3-small"},  # document it
)
col1.add(documents=["ChromaDB is great"], ids=["d1"])
print("Session 1 done, process exits...")
del client1, col1

# === SESSION 2: Reopen â MUST re-supply the same embedding function ===
client2 = chromadb.PersistentClient(path=DB_PATH)

# WRONG: ChromaDB defaults to all-MiniLM-L6-v2 (384-dim)
# Querying with a different model produces wrong results!
# col_wrong = client2.get_collection("my_docs")  # DO NOT DO THIS

# CORRECT: Re-supply the exact same embedding function
ef_openai_v2 = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",  # must match session 1
)
col2 = client2.get_collection(
    name="my_docs",
    embedding_function=ef_openai_v2,  # required!
)
results = col2.query(query_texts=["vector databases"], n_results=1)
print(results["documents"])  # correct result

# TIP: Read model name from collection metadata to avoid hardcoding
saved_model = col2.metadata.get("embedding_model", "all-MiniLM-L6-v2")
print(f"Using model: {saved_model}")

EF persistence gotchas
Scenario	Problem	Solution
Reopen collection without EF	Defaults to all-MiniLM-L6-v2, mismatches stored vectors	Always pass embedding_function= on get_collection()
Upgrade embedding model	Old vectors incompatible with new model	Create new collection, re-embed all docs, migrate
Team member uses different EF	Silent quality degradation	Store model name in collection metadata, document in README

Take quiz

What does ChromaDB use as the embedding function when you call get_collection() without specifying one?The last embedding function used for that collection

✗ Try again.

It falls back to the default all-MiniLM-L6-v2 Sentence Transformer model — which may differ from what was used during indexing

✓ Correct! Well done.

It raises an error requiring you to specify a function

✗ Try again.

It reads the embedding function from the collection's metadata automatically

✗ Try again.

What is the safest pattern for ensuring embedding function consistency across application restarts?Rely on ChromaDB to remember the embedding function automatically

✗ Try again.

Store the model name in the collection's metadata at creation, read it back on restart, and instantiate the same embedding function before calling get_collection()

✓ Correct! Well done.

Always recreate the collection from scratch on startup

✗ Try again.

Use only the default embedding function so there is no mismatch risk

✗ Try again.

33. How do you interpret ChromaDB query distances and convert them into meaningful relevance scores?

ChromaDB query results include a distances field. The interpretation depends on the distance metric. Raw distances are not directly comparable across metrics, but they can be normalised into a [0, 1] relevance score for display or thresholding.

import chromadb

client = chromadb.Client()
col = client.create_collection("relevance_demo", metadata={"hnsw:space": "cosine"})
col.add(
    documents=[
        "ChromaDB is an open-source vector database",
        "Python is a popular programming language",
        "The Eiffel Tower is in Paris France",
    ],
    ids=["d1","d2","d3"],
)

results = col.query(
    query_texts=["vector database for AI"],
    n_results=3,
    include=["documents","distances"],
)

raw_distances = results["distances"][0]
print("Raw cosine distances:", raw_distances)
# e.g. [0.18, 0.72, 1.31]
# cosine distance: 0 = identical, 2 = completely opposite

# Convert cosine distance to similarity score [0, 1]
def cosine_distance_to_score(distance: float) -> float:
    """cosine distance [0,2] â relevance score [0,1]"""
    return 1 - (distance / 2)

for doc, dist in zip(results["documents"][0], raw_distances):
    score = cosine_distance_to_score(dist)
    print(f"  Score: {score:.3f} | {doc[:50]}")
# Score: 0.910 | ChromaDB is an open-source vector database
# Score: 0.640 | Python is a popular programming language
# Score: 0.345 | The Eiffel Tower is in Paris France

# Threshold: only return results above minimum relevance
MIN_SCORE = 0.7
filtered = [
    (doc, cosine_distance_to_score(dist))
    for doc, dist in zip(results["documents"][0], raw_distances)
    if cosine_distance_to_score(dist) >= MIN_SCORE
]
print(f"\nResults above {MIN_SCORE} threshold: {len(filtered)}")
for doc, score in filtered:
    print(f"  {score:.3f}: {doc}")

Distance metric interpretation
Metric	Range	Most similar	Conversion to [0,1] score
cosine	0 to 2	0 (identical)	score = 1 - distance/2
l2 (Euclidean)	0 to ∞	0 (identical)	score = 1 / (1 + distance)
ip (inner product)	-∞ to 0 (normalised)	Most negative = most similar	score = -distance (normalised vecs)

Take quiz

For cosine distance in ChromaDB, what does a distance value of 0 indicate?The vectors are completely unrelated

✗ Try again.

The query and document vectors are identical — maximum similarity

✓ Correct! Well done.

The document was not found

✗ Try again.

The relevance score is 0%

✗ Try again.

How do you convert a ChromaDB cosine distance value of 0.4 to a relevance score on a 0–1 scale?score = 1 - 0.4 = 0.6

✗ Try again.

score = 1 - (0.4 / 2) = 0.8

✓ Correct! Well done.

score = 0.4 / 2 = 0.2

✗ Try again.

score = cos(0.4) = 0.921

✗ Try again.

34. What are ChromaDB's practical size limits and performance characteristics at scale?

ChromaDB does not impose hard document count limits, but practical performance degrades at different thresholds depending on storage mode, hardware, and HNSW configuration. Understanding these helps you plan capacity and know when to consider alternatives.

ChromaDB scale guidelines
Collection size	Storage mode	Typical behaviour
< 100K docs	PersistentClient or HttpClient	Excellent — sub-10ms query latency
100K – 1M docs	HttpClient (server mode)	Good — 10–100ms queries with default settings
1M – 10M docs	HttpClient + HNSW tuning	Acceptable — tune hnsw:M and hnsw:search_ef
> 10M docs	Consider FAISS or Weaviate	ChromaDB may struggle — these are better at extreme scale

import chromadb
import time

client = chromadb.Client()
col = client.create_collection(
    "scale_test",
    metadata={
        "hnsw:space":           "cosine",
        "hnsw:construction_ef": 200,   # higher quality index
        "hnsw:search_ef":       100,   # higher recall at query time
        "hnsw:M":               32,    # more connections per node
    },
)

# Batch insert 50,000 documents
BATCH = 500
for i in range(0, 50_000, BATCH):
    col.add(
        documents=[f"Document about topic {j % 100}" for j in range(i, i+BATCH)],
        ids=[str(j) for j in range(i, i+BATCH)],
    )

print(f"Collection has {col.count()} documents")

# Measure query latency
start = time.perf_counter()
results = col.query(query_texts=["topic 42"], n_results=10)
elapsed = time.perf_counter() - start
print(f"Query latency: {elapsed*1000:.1f}ms")

# Memory footprint estimate:
# 384-dim float32 vectors: 384 * 4 bytes = 1.5 KB per doc
# 50K docs * 1.5 KB = ~75 MB just for vectors
# HNSW graph adds ~20-30% overhead â ~100 MB total for 50K docs

Memory rule of thumb: each 384-dim vector requires ~1.5 KB. A 1M document collection with 384-dim embeddings needs ~1.5 GB just for vectors, plus HNSW graph overhead (~25%). Plan memory accordingly when deploying the ChromaDB server.

Take quiz

What is the approximate memory footprint per document for 384-dimensional float32 embeddings in ChromaDB?384 bytes

✗ Try again.

~1.5 KB (384 dimensions × 4 bytes per float32)

✓ Correct! Well done.

~15 KB

✗ Try again.

~150 bytes

✗ Try again.

At what approximate collection size does ChromaDB start to show performance degradation without tuning?1,000 documents

✗ Try again.

10,000 documents

✗ Try again.

Beyond 1 million documents with default HNSW settings

✓ Correct! Well done.

50 documents — ChromaDB is optimised only for small collections

✗ Try again.

35. How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?

ChromaDB's similarity search makes it straightforward to detect semantic duplicates — documents that express the same idea with different wording. Before inserting a new document, query ChromaDB to see if a highly similar document already exists and decide whether to skip or replace it.

import chromadb

client = chromadb.Client()
col = client.create_collection(
    "dedup_store",
    metadata={"hnsw:space": "cosine"},
)

# Similarity threshold â tune based on your use case
DUPLICATE_THRESHOLD = 0.95  # cosine similarity >= 0.95 â treat as duplicate

def cosine_dist_to_score(d: float) -> float:
    return 1 - d / 2

def add_if_unique(
    collection,
    document: str,
    doc_id: str,
    metadata: dict = None,
    threshold: float = DUPLICATE_THRESHOLD,
) -> bool:
    """Returns True if document was added, False if it was a duplicate."""
    if collection.count() == 0:
        collection.add(documents=[document], ids=[doc_id],
                       metadatas=[metadata or {}])
        return True

    # Query for the nearest existing document
    results = collection.query(
        query_texts=[document],
        n_results=1,
        include=["documents", "distances"],
    )
    nearest_dist  = results["distances"][0][0]
    nearest_score = cosine_dist_to_score(nearest_dist)
    nearest_doc   = results["documents"][0][0]

    if nearest_score >= threshold:
        print(f"DUPLICATE detected (score={nearest_score:.3f}):")
        print(f"  New:      {document[:60]}")
        print(f"  Existing: {nearest_doc[:60]}")
        return False  # skip insertion

    collection.add(documents=[document], ids=[doc_id],
                   metadatas=[metadata or {}])
    return True

# Test deduplication
phrases = [
    ("ChromaDB is a vector database for AI apps.", "p1"),
    ("Chroma DB is a vector store built for AI applications.", "p2"),  # near-dup of p1
    ("Python is great for machine learning.", "p3"),
]
for text, pid in phrases:
    added = add_if_unique(col, text, pid)
    print(f"Added: {added} â {text[:40]}")

print(f"\nFinal collection size: {col.count()}")  # 2 (p2 was duplicate of p1)

Use cases: deduplication during web scraping, preventing duplicate knowledge base entries in RAG systems, clustering similar customer support tickets, and identifying near-identical product descriptions in e-commerce catalogues.

Take quiz

What cosine similarity score range would you typically use to classify two documents as near-duplicates?Score > 0.3 — any similarity is a duplicate

✗ Try again.

Score >= 0.90–0.95 — very high similarity indicates the same idea expressed differently

✓ Correct! Well done.

Score == 1.0 only — exact duplicates

✗ Try again.

Score > 0.5 — more similar than random

✗ Try again.

What ChromaDB operation is at the core of semantic deduplication before inserting a new document?collection.get() to check if the exact ID exists

✗ Try again.

collection.query() with n_results=1 to find the nearest existing document and compare its similarity score to a threshold

✓ Correct! Well done.

collection.count() to check the total number of documents

✗ Try again.

collection.peek() to sample existing documents

✗ Try again.

36. How do you reset or clear a ChromaDB collection without deleting and recreating it?

ChromaDB does not have a direct clear() or truncate() method. The idiomatic way to reset a collection is to delete it and recreate it with the same parameters. For selective deletion, use delete() with ID lists or where filters.

import chromadb

client = chromadb.PersistentClient(path="./reset_demo")

# Setup
col = client.get_or_create_collection(
    "my_col",
    metadata={"hnsw:space": "cosine", "version": "1"},
)
col.add(
    documents=[f"Document {i}" for i in range(100)],
    ids=[str(i) for i in range(100)],
    metadatas=[{"batch": i // 10} for i in range(100)],
)
print(col.count())  # 100

# --- Option 1: Reset (delete all + recreate) ---
def reset_collection(client, name: str, metadata: dict = None):
    """Delete and recreate a collection, preserving its configuration."""
    saved_meta = {}
    try:
        saved_meta = client.get_collection(name).metadata or {}
    except Exception:
        pass
    client.delete_collection(name)
    return client.create_collection(
        name=name,
        metadata=metadata or saved_meta,
    )

col = reset_collection(client, "my_col")
print(col.count())  # 0

# Re-add fresh data after reset
col.add(documents=["Fresh start"], ids=["new-1"])

# --- Option 2: Selective delete by filter ---
col2 = client.get_or_create_collection("selective")
col2.add(
    documents=[f"Doc {i}" for i in range(20)],
    ids=[str(i) for i in range(20)],
    metadatas=[{"batch": i // 5} for i in range(20)],
)

# Delete only batch 0 (documents 0-4)
col2.delete(where={"batch": 0})
print(col2.count())  # 15 remaining

# Delete specific IDs
col2.delete(ids=["5","6","7"])
print(col2.count())  # 12 remaining

# Delete ALL via get + delete (when no useful metadata filter exists)
all_ids = col2.get(include=[])["ids"]  # get all IDs
if all_ids:
    col2.delete(ids=all_ids)
print(col2.count())  # 0

Collection reset options
Method	When to use	Preserves schema?
delete_collection + create_collection	Full reset — cleanest approach	Yes (manual)
delete(where={...})	Selective clear by metadata condition	Yes
delete(ids=[...])	Remove specific known documents	Yes
get all IDs then delete	Clear all without metadata	Yes

Take quiz

Why does ChromaDB not have a built-in clear() or truncate() method?It is a planned feature not yet implemented

✗ Try again.

ChromaDB's HNSW index does not support efficient bulk deletion — the idiomatic approach is to delete and recreate the collection, which rebuilds a fresh index

✓ Correct! Well done.

Clearing would require a paid enterprise license

✗ Try again.

It is not possible to delete documents once added

✗ Try again.

What is the most efficient way to delete all documents matching a metadata condition from a ChromaDB collection?Iterate through documents and call delete(ids=[id]) one at a time

✗ Try again.

Use collection.delete(where={'field': 'value'}) to delete all matching documents in a single call

✓ Correct! Well done.

Export to a file, filter, then reimport

✗ Try again.

Rebuild the entire collection from scratch

✗ Try again.

37. What configuration settings does ChromaDB support and how do you disable telemetry?

By default, ChromaDB sends anonymised usage telemetry to help the development team understand how the product is used. In enterprise or privacy-sensitive environments this should be disabled. ChromaDB also supports several configuration settings via environment variables and the Settings class.

import chromadb
from chromadb.config import Settings
import os

# --- Option 1: Disable telemetry via environment variable ---
os.environ["ANONYMIZED_TELEMETRY"] = "False"

# --- Option 2: Disable via Settings class ---
client = chromadb.PersistentClient(
    path="./my_db",
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=True,           # enables client.reset() â wipes all data!
    ),
)

# --- Option 3: Disable telemetry for HttpClient ---
client_http = chromadb.HttpClient(
    host="localhost",
    port=8000,
    settings=Settings(anonymized_telemetry=False),
)

# Settings available via Settings class
all_settings = Settings(
    anonymized_telemetry=False,
    allow_reset=False,          # default: False â prevents accidental wipe
    # chroma_db_impl="duckdb+parquet",  # legacy v0.3 setting (not used in v0.4+)
)

# allow_reset=True enables client.reset() â DELETES ALL DATA
# Only use in testing environments!
client_test = chromadb.EphemeralClient(
    settings=Settings(allow_reset=True)
)
client_test.create_collection("temp")
client_test.reset()  # wipes everything â use in test fixtures
print(client_test.list_collections())  # []

Key ChromaDB settings
Setting	Default	Notes
anonymized_telemetry	True	Set False in production for privacy
allow_reset	False	Set True only in test environments — reset() wipes all data
ANONYMIZED_TELEMETRY env var	True	Environment variable alternative to Settings class

Take quiz

How do you disable ChromaDB's anonymised telemetry in a production application?Call chromadb.disable_telemetry() at startup

✗ Try again.

Set the ANONYMIZED_TELEMETRY=False environment variable, or pass settings=Settings(anonymized_telemetry=False) to the client constructor

✓ Correct! Well done.

Telemetry cannot be disabled in the open-source version

✗ Try again.

Use chromadb.PersistentClient(telemetry=False)

✗ Try again.

What does allow_reset=True in ChromaDB Settings enable and why is it dangerous?It enables automatic schema migrations on version upgrade

✗ Try again.

It enables client.reset() which permanently deletes all collections and data in the database — only safe in test environments where data loss is acceptable

✓ Correct! Well done.

It allows the collection schema to change after documents are inserted

✗ Try again.

It permits resetting individual document embeddings

✗ Try again.

38. What is a production readiness checklist for a ChromaDB-based application?

Moving a ChromaDB application from prototype to production involves several architectural decisions around storage, concurrency, reliability, and observability. This checklist covers the key concerns.

ChromaDB production checklist
Area	Recommendation
Storage mode	Use HttpClient connecting to a ChromaDB server — not PersistentClient in multi-process apps
Embedding consistency	Store embedding model name in collection metadata; always re-supply EF on get_collection()
Distance metric	Set hnsw:space='cosine' at collection creation for text; cannot change later
Backups	Schedule regular directory snapshots or SQLite online backups; test restore procedure
Telemetry	Set ANONYMIZED_TELEMETRY=False for privacy
Batching	Insert in batches of 100–500; use upsert() for idempotent pipelines
Error handling	Catch IDAlreadyExistsError, InvalidCollectionException; implement retry logic for HttpClient
HNSW tuning	Increase hnsw:construction_ef to 200 and hnsw:search_ef to 50–100 for large collections
Metadata schema	Use ints for dates/booleans; document schema in collection metadata
Security	Run server behind a reverse proxy with TLS; add auth headers for HttpClient
Monitoring	Log query latency, collection size, and embedding function errors
Scale planning	Plan ~1.5 KB/doc for 384-dim vectors + 25% HNSW overhead; consider alternatives above 10M docs

# Minimal production-ready ChromaDB setup
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings
import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "prod_knowledge_base"

def create_client():
    return chromadb.HttpClient(
        host=os.environ["CHROMA_HOST"],
        port=int(os.environ.get("CHROMA_PORT", 8000)),
        settings=Settings(anonymized_telemetry=False),
    )

def get_collection(client):
    ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key=os.environ["OPENAI_API_KEY"],
        model_name=EMBEDDING_MODEL,
    )
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        embedding_function=ef,
        metadata={
            "hnsw:space":           "cosine",
            "hnsw:construction_ef": 200,
            "hnsw:search_ef":       100,
            "embedding_model":      EMBEDDING_MODEL,
        },
    )

client = create_client()
client.heartbeat()   # fail fast if server is unreachable
collection = get_collection(client)
logger.info(f"Connected to collection with {collection.count()} documents")

Take quiz

Which ChromaDB client should a production multi-service application use?PersistentClient — it is the most reliable

✗ Try again.

EphemeralClient — fastest performance

✗ Try again.

HttpClient connecting to a dedicated ChromaDB server process — supports concurrent access from multiple services

✓ Correct! Well done.

Any client — they are identical in production

✗ Try again.

What is the recommended first check after connecting to a production ChromaDB HttpClient?collection.count() to verify data integrity

✗ Try again.

client.heartbeat() to confirm the server is reachable before proceeding — fail fast rather than encountering errors mid-request

✓ Correct! Well done.

client.list_collections() to enumerate all data

✗ Try again.

collection.peek() to sample stored documents

✗ Try again.

pgvector basics Interview Questions

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Python Tools Golang	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Database / ChromaDB Interview Questions

1. What is ChromaDB and what problem does it solve?

2. What are embeddings and why are they central to how ChromaDB works?

3. What distance metrics does ChromaDB support and how do you choose between them?

4. What is a ChromaDB collection and how do you create, list, get, and delete collections?

5. How do you add documents to a ChromaDB collection?

6. How do you query a ChromaDB collection for similar documents?

7. How do you retrieve, update, and delete specific documents in ChromaDB?

8. How do you filter query results using metadata in ChromaDB?

9. What is the difference between ChromaDB's in-memory and persistent storage modes?

10. What is ChromaDB's default embedding function and how does it work?

11. How do you use the OpenAI embedding function with ChromaDB?

12. How do you use HuggingFace models as embedding functions in ChromaDB?

13. How do you create a custom embedding function for ChromaDB?

14. How does ChromaDB's PersistentClient store data on disk, and what are its limitations?

15. What is the HNSW index in ChromaDB and what parameters can you tune?

16. How do you efficiently add large numbers of documents to ChromaDB using batching?

17. What is the where_document filter in ChromaDB and how does it differ from where?

18. How do you control what data ChromaDB returns in query and get results using include?

19. How do you design metadata schemas for effective filtering in ChromaDB?

20. How do you inspect a ChromaDB collection's contents and configuration?

21. How do you build a basic RAG (Retrieval-Augmented Generation) pipeline with ChromaDB?

22. What are effective document chunking strategies when indexing documents into ChromaDB for RAG?

23. How do you use ChromaDB as a vector store with LangChain?

24. How do you implement multi-tenancy or data isolation in ChromaDB?

25. What is embedding consistency and why is it critical in ChromaDB applications?

26. How do you run ChromaDB as a standalone HTTP server and connect to it from multiple clients?

27. When should you use upsert() instead of add() in ChromaDB, and what are common patterns?

28. What are best practices for structuring ChromaDB collection metadata for production use?

29. How does ChromaDB compare to FAISS, and when should you choose one over the other?

30. What are common ChromaDB errors and how do you handle them in production code?

31. How do you back up and restore a ChromaDB persistent database?

32. How do you ensure the correct embedding function is used when reopening a persistent ChromaDB collection?

33. How do you interpret ChromaDB query distances and convert them into meaningful relevance scores?

34. What are ChromaDB's practical size limits and performance characteristics at scale?

35. How do you use ChromaDB to detect and remove near-duplicate or semantically similar documents?

36. How do you reset or clear a ChromaDB collection without deleting and recreating it?

37. What configuration settings does ChromaDB support and how do you disable telemetry?

38. What is a production readiness checklist for a ChromaDB-based application?

Comments & Discussions

Recently added...