Database / Vector database interview questions
1. What is a vector database, and why is it used in modern AI systems?
A vector database stores dense embeddings and indexes them for fast nearest-neighbor search. It is used to retrieve semantically similar items for use cases like semantic search, recommendation, and retrieval-augmented generation (RAG).
2. How does vector similarity search differ from keyword search?
Keyword search matches exact terms, while vector search compares semantic meaning in embedding space. This allows vector systems to find relevant results even when query words differ from document wording.
3. What are embeddings in the context of vector databases?
Embeddings are numerical vectors produced by ML models that encode semantic meaning. Vector databases store these vectors so queries can be matched by distance or similarity instead of exact string matching.
4. Which distance metrics are commonly used in vector databases?
Common metrics include cosine similarity, dot product, and Euclidean distance. The right metric depends on the embedding model and whether vectors are normalized.
5. When should you choose cosine similarity over Euclidean distance?
Cosine similarity is preferred when vector direction matters more than magnitude, especially with normalized embeddings. Euclidean distance can be better when absolute geometric distances carry signal.
6. What is Approximate Nearest Neighbor (ANN), and why is it important?
ANN techniques trade a small amount of recall for major gains in latency and throughput. This makes large-scale vector retrieval practical for real-time applications.
7. How does HNSW indexing work at a high level?
HNSW builds layered proximity graphs so search can quickly navigate from coarse to fine neighborhoods. It provides strong query performance with tunable memory and recall trade-offs.
8. What are IVF and PQ in vector indexing?
IVF partitions vectors into clusters to reduce candidate scans, while PQ compresses vectors into compact codes. Together they enable efficient search at very large scale with reduced memory usage.
9. How do you evaluate recall and latency in vector search systems?
Recall measures how many true nearest neighbors are returned compared with exact search, and latency measures response time. You should tune index parameters to meet target recall under production latency budgets.
10. What does top-k mean in vector retrieval?
Top-k is the number of most similar results returned for a query. Choosing k affects downstream quality, context window usage, and cost.
11. How does metadata filtering work with vector search?
Metadata filters constrain candidates by structured attributes such as tenant, language, region, or document type before or during similarity search. This improves relevance and supports access control and multi-tenant isolation.
12. What is hybrid search in vector databases?
Hybrid search combines lexical scoring (like BM25) and vector similarity to balance precision and semantic recall. It is often more robust than pure vector search for mixed-intent queries.
13. How do rerankers improve vector retrieval pipelines?
Rerankers apply deeper cross-encoder-style relevance scoring to a small retrieved candidate set. They improve final ranking quality, especially when initial ANN retrieval is broad.
14. What is the role of vector databases in RAG architectures?
In RAG, vector databases provide context retrieval from knowledge corpora using semantic similarity. Retrieved passages are passed to the LLM to ground responses and reduce hallucinations.
15. How do chunking strategies affect vector database retrieval quality?
Chunk size and overlap control how much context each vector represents. Poor chunking can hide key facts or add noise, while tuned chunking improves retrieval precision and answerability.
16. Why is embedding model choice critical for vector database performance?
Embedding models define semantic space quality, dimensionality, and domain fit. Better model-task alignment usually improves retrieval relevance more than index-only tuning.
17. How should you handle embedding model upgrades in production?
Use dual-write or shadow indexing to re-embed content into a new index while serving from the old one. Validate relevance metrics before cutover and keep rollback paths ready.
18. What are the trade-offs between managed and self-hosted vector databases?
Managed services reduce operational burden and speed onboarding, while self-hosted deployments can provide deeper control, custom tuning, and stricter data governance.
19. How do you design a schema for documents and vectors?
Define stable document IDs, embedding fields, metadata fields, and version markers for model/chunk revisions. A clear schema supports filtering, reindexing, and auditability.
20. What is upsert behavior in vector databases?
Upsert inserts new vectors or updates existing records with the same ID. Correct ID strategy is essential to avoid duplicates and stale content.
21. How do deletions and tombstones impact vector index maintenance?
Deletes may create tombstones that are cleaned during compaction or rebuild operations. Without lifecycle maintenance, query quality and storage efficiency can degrade.
22. How do you prevent duplicate vectors in ingestion pipelines?
Use deterministic IDs, dedup keys, and idempotent ingestion logic. This prevents multiple representations of the same content from polluting retrieval results.
23. What are common causes of poor relevance in vector search?
Typical causes include weak embeddings, bad chunking, missing filters, stale content, and overly aggressive ANN settings. Diagnose relevance with query sets and labeled evaluations.
24. How can query rewriting improve vector search outcomes?
Query rewriting can clarify ambiguous intent, add domain context, or expand shorthand terms before embedding. This often improves recall and relevance for short user prompts.
25. What is multi-vector representation for a single document?
Multi-vector approaches store several embeddings per document, such as per section or semantic facet. This can improve match quality for long or heterogeneous content.
26. How do sparse and dense vectors complement each other?
Sparse vectors capture exact lexical signals while dense vectors capture semantic meaning. Combining both often yields stronger retrieval performance across diverse query types.
27. What is vector quantization, and when is it used?
Vector quantization compresses embeddings to reduce memory and speed search, often with some accuracy loss. It is useful when serving very large corpora under strict cost constraints.
28. How do you choose vector dimensionality for an application?
Dimensionality is usually determined by the embedding model and task. Higher dimensions can capture richer semantics but increase memory, compute, and indexing overhead.
29. How does normalization affect dot-product and cosine search?
L2 normalization aligns dot product behavior with cosine similarity for many setups. Consistent normalization between indexing and querying is critical for predictable relevance.
30. What operational metrics should you monitor in vector databases?
Monitor query latency, QPS, recall proxies, index build times, memory usage, filter selectivity, and ingestion lag. These metrics help maintain reliability and relevance in production.
31. How do you benchmark vector databases fairly?
Use representative datasets, fixed query sets, explicit recall targets, and consistent hardware settings. Compare both retrieval quality and performance under realistic filters and concurrency.
32. What is multi-tenancy in vector databases, and how is it implemented?
Multi-tenancy isolates tenant data through namespaces, partitions, or filtered metadata policies. Strong isolation reduces leakage risk and simplifies governance.
33. How do access control and authorization apply to vector retrieval?
Authorization rules should be enforced at retrieval time using tenant and policy filters. Otherwise, semantically similar but unauthorized content might leak into results.
34. How do you handle fresh content and eventual consistency in vector systems?
Ingestion and indexing pipelines may introduce delay before new vectors become searchable. Design SLAs for freshness and use status checks to avoid serving incomplete updates.
35. What backup and disaster recovery considerations exist for vector databases?
You need backups for raw source documents, metadata, and index snapshots or rebuild pipelines. Recovery plans should define RPO/RTO and validated restore procedures.
36. How do vector databases support recommendation systems?
Recommendations can be generated by nearest-neighbor retrieval over user/item embeddings. Metadata constraints then enforce business rules such as inventory, region, or eligibility.
37. What are common cost drivers in vector database deployments?
Major cost drivers include embedding generation, storage footprint, memory-heavy indexes, and query throughput. Tuning chunking, compression, and caching can materially reduce spend.
38. How do caching layers help vector search workloads?
Result and embedding caches reduce repeated computation for frequent queries. Careful invalidation policies are needed when source content or embeddings change.
39. What is the difference between online and offline indexing strategies?
Online indexing prioritizes freshness with incremental updates, while offline indexing prioritizes throughput with periodic bulk rebuilds. Many systems combine both for balance.
40. How do you test quality regressions after index parameter changes?
Run controlled offline evaluations on labeled query sets and compare recall, NDCG, or task success metrics. Promote changes only when quality and latency remain within accepted thresholds.
41. What role do namespaces or collections play in vector databases?
Namespaces and collections organize vectors by domain, tenant, or lifecycle boundary. Proper partitioning simplifies access policies and improves operational control.
42. How can you reduce hallucinations using better vector retrieval?
Improve grounding by tuning chunking, filters, hybrid retrieval, and reranking quality. Returning high-quality context is one of the strongest controls against hallucinated answers.
43. How do you secure sensitive data in vector database pipelines?
Use encryption in transit and at rest, scoped credentials, and redaction/tokenization for sensitive fields before embedding. Security controls should cover ingestion, storage, and query paths.
44. How should teams handle multilingual vector search?
Use multilingual embedding models or language-aware routing and store language metadata for filtering. Evaluate relevance per language to avoid hidden quality gaps.
45. What are best practices for productionizing vector database systems?
Establish ingestion contracts, observability, evaluation gates, security controls, and rollback strategies. Treat retrieval quality as an SLO-backed production concern, not just a prototype feature.
