Database / Apache Cassandra Interview Questions
Cassandra is an free, open-source, distributed, and NOSQL database management system used to handle large amount of data. Cassandra provides high availability without any failure.
Cassandra is written in Java. It is originally designed by Facebook consisting of flexible schemas. It is highly scalable for big data.
Cassandra has its own Cassandra Query Language (CQL). CQL is a simple interface for accessing Cassandra, as an alternative to the traditional Structured Query Language (SQL).
- Open-source availability.
- Distributed footprint.
- Scalability.
- Cassandra Query Language.
- Fault tolerance.
- Schema free.
- Tunable consistency.
- Fast writes.
- Peer-to-peer architecture.
Cassandra | RDBMS |
Data may be unstructured. | Only structured data. |
Flexible schema. | Fixed schema. |
Data is written in many locations. | Data is written in mostly one location. |
In Cassandra, a table is a list of "nested key-value pairs". (Row x Column Key x Column value) | In RDBMS, a table is an array of arrays. (Row x Column) |
Keyspace is the outermost container which contains data corresponding to an application. | Database is the outermost container which contains data corresponding to an application. |
The data storage path in Cassandra begins with the memtable where the data is stored temporarily and also to a commit log. And once committed, the data is periodically flushed and written into SSTable.
- Logging data in the commit log,
- Writing data to the memtable,
- Flushing data from the memtable,
- Storing data on disk in SSTables.
SSTables are the immutable data files that Cassandra uses for persisting data on disk. As SSTables are flushed to disk from memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.
Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.
Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. Eventually, memtables are flushed onto disk and become immutable SSTables.
NoSQL, also referred to as "not only SQL", "non-SQL", is an approach to database design that enables the storage and querying of data outside the traditional structures found in relational databases. While it can still store data found within relational database management systems (RDBMS), it just stores it differently compared to an RDBMS. The decision to use a relational database versus a non-relational database is largely contextual, and it varies depending on the use case.
Instead of the typical tabular structure of a relational database, NoSQL databases, house data within one data structure, such as JSON document.
- Handle large volumes of data at high speed with a scale-out architecture Store unstructured, semi-structured, or structured data.
- Enable easy updates to schemas and fields.
- Be developer-friendly.
- Take full advantage of the cloud to deliver zero downtime.
CQL query language is a NoSQL interface that is intentionally similar to SQL, providing users who are comfortable with relational databases a familiar language that ultimately lowers the barrier of entry to Apache Cassandra.
The components of Cassandra are:
- Node
- Data cluster
- Commit log
- Cluster
- Mem-table
- SSTable
- Bloom filter
A node represents a single instance of Cassandra. These nodes communicate with one another through a protocol called gossip, which is a process of computer peer-to-peer communication. Since it is a distributed database, Cassandra can (and usually does) have multiple nodes.
Node is where the data is stored.
Cassandra Datacenter is a group of nodes related and configured within a cluster for replication purposes. A datacenter is a logical set of racks. The datacenter should contain at least one rack.
A cluster is a component that contains one or more datacenters.
A rack is a collection of servers. A Cassandra rack is a logical grouping of nodes within the ring.
MemTable doesn't store the data. It temporarily accumulates 'write data', while SStable, store the data from Memtable into the Cassandra database. The data stored in SSTable is permanent and cannot be changed.
Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native memory) data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.
Cqlsh (Cassandra Query Language Shell) configures the CQL interactive terminal. It is a Python-based command-line prompt used on Linux or Windows and executes CQL commands like ASSUME, CAPTURE, CONSISTENCY, COPY, DESCRIBE, and many others. With cqlsh, users can define a schema, insert data, and execute a query.
Source command is used to execute a file consisting of CQL statements.
SOURCE '~/data/insert_data.cql'
Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The purpose of using Thrift in Cassandra is to facilitate access to the DB across the programming language.
Replication factor (RF) is the number that determines how many nodes get the copy of the same data in the cluster. For example, three nodes in the ring will have copies of the same data with RF=3.
Cassandra data model consists of four main components:
- Cluster: Made up of multiple nodes and keyspaces.
- Keyspace: A namespace to group multiple column families, especially one per partition.
- Column: Consisting of a column name, value, and timestamp.
- Column Family: Multiple columns with the row key reference.
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it is important to keep columns that you are likely to query together in the same column family, and a super column can be helpful here.Given below is the structure of a super column.