Database / Vector database interview questions
A vector database stores dense embeddings and indexes them for fast nearest-neighbor search. It is used to retrieve semantically similar items for use cases like semantic search, recommendation, and retrieval-augmented generation (RAG).
Keyword search matches exact terms, while vector search compares semantic meaning in embedding space. This allows vector systems to find relevant results even when query words differ from document wording.
Embeddings are numerical vectors produced by ML models that encode semantic meaning. Vector databases store these vectors so queries can be matched by distance or similarity instead of exact string matching.
Common metrics include cosine similarity, dot product, and Euclidean distance. The right metric depends on the embedding model and whether vectors are normalized.
Cosine similarity is preferred when vector direction matters more than magnitude, especially with normalized embeddings. Euclidean distance can be better when absolute geometric distances carry signal.
ANN techniques trade a small amount of recall for major gains in latency and throughput. This makes large-scale vector retrieval practical for real-time applications.
HNSW builds layered proximity graphs so search can quickly navigate from coarse to fine neighborhoods. It provides strong query performance with tunable memory and recall trade-offs.
IVF partitions vectors into clusters to reduce candidate scans, while PQ compresses vectors into compact codes. Together they enable efficient search at very large scale with reduced memory usage.
Recall measures how many true nearest neighbors are returned compared with exact search, and latency measures response time. You should tune index parameters to meet target recall under production latency budgets.
Top-k is the number of most similar results returned for a query. Choosing k affects downstream quality, context window usage, and cost.
Metadata filters constrain candidates by structured attributes such as tenant, language, region, or document type before or during similarity search. This improves relevance and supports access control and multi-tenant isolation.
Hybrid search combines lexical scoring (like BM25) and vector similarity to balance precision and semantic recall. It is often more robust than pure vector search for mixed-intent queries.
Rerankers apply deeper cross-encoder-style relevance scoring to a small retrieved candidate set. They improve final ranking quality, especially when initial ANN retrieval is broad.
In RAG, vector databases provide context retrieval from knowledge corpora using semantic similarity. Retrieved passages are passed to the LLM to ground responses and reduce hallucinations.
Chunk size and overlap control how much context each vector represents. Poor chunking can hide key facts or add noise, while tuned chunking improves retrieval precision and answerability.
Embedding models define semantic space quality, dimensionality, and domain fit. Better model-task alignment usually improves retrieval relevance more than index-only tuning.
Use dual-write or shadow indexing to re-embed content into a new index while serving from the old one. Validate relevance metrics before cutover and keep rollback paths ready.
Managed services reduce operational burden and speed onboarding, while self-hosted deployments can provide deeper control, custom tuning, and stricter data governance.
Define stable document IDs, embedding fields, metadata fields, and version markers for model/chunk revisions. A clear schema supports filtering, reindexing, and auditability.
Upsert inserts new vectors or updates existing records with the same ID. Correct ID strategy is essential to avoid duplicates and stale content.
Deletes may create tombstones that are cleaned during compaction or rebuild operations. Without lifecycle maintenance, query quality and storage efficiency can degrade.
Use deterministic IDs, dedup keys, and idempotent ingestion logic. This prevents multiple representations of the same content from polluting retrieval results.
Typical causes include weak embeddings, bad chunking, missing filters, stale content, and overly aggressive ANN settings. Diagnose relevance with query sets and labeled evaluations.
Query rewriting can clarify ambiguous intent, add domain context, or expand shorthand terms before embedding. This often improves recall and relevance for short user prompts.
Multi-vector approaches store several embeddings per document, such as per section or semantic facet. This can improve match quality for long or heterogeneous content.
Sparse vectors capture exact lexical signals while dense vectors capture semantic meaning. Combining both often yields stronger retrieval performance across diverse query types.
Vector quantization compresses embeddings to reduce memory and speed search, often with some accuracy loss. It is useful when serving very large corpora under strict cost constraints.
Dimensionality is usually determined by the embedding model and task. Higher dimensions can capture richer semantics but increase memory, compute, and indexing overhead.
L2 normalization aligns dot product behavior with cosine similarity for many setups. Consistent normalization between indexing and querying is critical for predictable relevance.
Monitor query latency, QPS, recall proxies, index build times, memory usage, filter selectivity, and ingestion lag. These metrics help maintain reliability and relevance in production.
Use representative datasets, fixed query sets, explicit recall targets, and consistent hardware settings. Compare both retrieval quality and performance under realistic filters and concurrency.
Multi-tenancy isolates tenant data through namespaces, partitions, or filtered metadata policies. Strong isolation reduces leakage risk and simplifies governance.
Authorization rules should be enforced at retrieval time using tenant and policy filters. Otherwise, semantically similar but unauthorized content might leak into results.
Ingestion and indexing pipelines may introduce delay before new vectors become searchable. Design SLAs for freshness and use status checks to avoid serving incomplete updates.
You need backups for raw source documents, metadata, and index snapshots or rebuild pipelines. Recovery plans should define RPO/RTO and validated restore procedures.
Recommendations can be generated by nearest-neighbor retrieval over user/item embeddings. Metadata constraints then enforce business rules such as inventory, region, or eligibility.
Major cost drivers include embedding generation, storage footprint, memory-heavy indexes, and query throughput. Tuning chunking, compression, and caching can materially reduce spend.
Result and embedding caches reduce repeated computation for frequent queries. Careful invalidation policies are needed when source content or embeddings change.
Online indexing prioritizes freshness with incremental updates, while offline indexing prioritizes throughput with periodic bulk rebuilds. Many systems combine both for balance.
Run controlled offline evaluations on labeled query sets and compare recall, NDCG, or task success metrics. Promote changes only when quality and latency remain within accepted thresholds.
Namespaces and collections organize vectors by domain, tenant, or lifecycle boundary. Proper partitioning simplifies access policies and improves operational control.
Improve grounding by tuning chunking, filters, hybrid retrieval, and reranking quality. Returning high-quality context is one of the strongest controls against hallucinated answers.
Use encryption in transit and at rest, scoped credentials, and redaction/tokenization for sensitive fields before embedding. Security controls should cover ingestion, storage, and query paths.
Use multilingual embedding models or language-aware routing and store language metadata for filtering. Evaluate relevance per language to avoid hidden quality gaps.
Establish ingestion contracts, observability, evaluation gates, security controls, and rollback strategies. Treat retrieval quality as an SLO-backed production concern, not just a prototype feature.
