The big data is here, and just like the Big Brother from Orwell’s anti-utopia, it is everywhere today, in the 21st century. Whether you like it or not, you have to deal with it; the question is with what tools.
Luckily, the great minds of Google developed a perfect database technology called BigTable, which then delivered two even better solutions: Apache Cassandra and Apache HBase.
The first idea would suggest that since they descend from the same technology, they should be identical, but the reality says that they differ a lot.
Yes, both Cassandra and HBase are known to be NoSQL wide-column stores designed to handle large data amounts. Their form of tables and column families suggests complete structural identity, but the functionality and application of these technologies are not the same.
So let’s take a closer look at both to clarify their pros, cons, and specific characteristics that make software development teams recommend one of these databases for particular projects.
Contents
Data model
HBase
HBase is a column-oriented database and a classical representation of Google’s BigTable storage. You can imagine HBase storage as a table with rows and columns.
Rows are organized with row keys and columns in column families. There are also column qualifiers that allow for better data organization of cells within a column family. Cells that constitute rows and columns have their own value and timestamp.
The data organization system of HBase is based on the alphabet, so the lexicographically close rows would be put together. Since the domain name is the most common row key pattern, consider storing them in reverse naming (org.apache.www, org.apache.mail, etc.) so that the closely related data is kept together.
Cassandra
The naming conventions in Cassandra are the same as for HBase: table with rows and columns, column families, and row keys; however, the meaning of these words would be a bit different. In Cassandra, it is the column family that is organized by the row keys. Every column that consists of individual cells has its own definitive elements: name/key, value, and a timestamp.
The difference from HBase comes with super columns (with two or more subcolumns) that are grouped into super column families.
Cassandra organizes all data into partitions, which consist of those columns and column families. All partitions are stored on a node, a combination of which builds a cluster.
This means that whenever a record is inserted into this database, the system hashes the value of this data’s partition key. Then based on this hash value, Cassandra determines which node is responsible for the data.
HBase vs. Cassandra
Here is a simple comparison of the differences between the two:
- Cassandra’s column is almost like HBase’s cell.
- Cassandra’s column family is close to HBase’s table.
- Cassandra’s super column is close to HBase’s column qualifier (the former has two and more subcolumns, while the latter just one).
- Cassandra’s primary key can contain multiple columns, while HBase’s row key has only a one-column structure.
These are the main structural differences that signify the unique applications of both systems. However, even with the difference in terminology and a bit in models, both systems join together topically-related data. They do not occupy any space if a particular data cell is empty. To run smoothly, both storages require column families that cannot be changed later on.
Architecture
Architectural differences between the two systems are the opposite: Casandra is masterless, and HBase is master-based.
This means that while Cassandra will never fail from a single point of contact, HBase might.
The latter client communicates directly with the slave-server without the need to contact the master; this means that the whole cluster gets some time to operate after the master goes down, but the important word here is “some.”
For Cassandra, this kind of issue does not exist. Since it has no master, the tension of the system is equally distributed, allowing it to work flawlessly. If downtimes and complete system failures are not the options for you, then you already know which system fits your needs better—Cassandra.
There is, yet, another important point of architectural difference between the systems: data replication and system consistency.
HBase always writes all data in one place, and, as a result, the path to every data piece is always clear and consistent throughout the system. Cassandra, on the other hand, needs to duplicate data all over itself to guarantee flawless operation. This might bring data consistency problems due to data duplication and replication.
So if consistency is more important than stable operation, then HBase is your choice.
HBase’s architecture has only one goal in mind—data management, while Cassandra can also offer data storage as a perk to data management.
This is caused by the close connection and dependency of HBase on HDFS for storage, Apache Zookeeper for metadata and server status management and metadata, etc. And do not forget that to run queries, HBase needs extra technologies while Cassandra has its own language.
Performance
In terms of writing, Cassandra’s performance equals almost 385,000 operations per send, while, for HBase, the number is lower than 58,500 operations in a 32-node cluster. Since the latter does not cache or write the log simultaneously, contrary to Casandra’s doings, this database works slower.
For example, since HBase is closely connected to HDFS, the storage needs to wait for the file system to store the data physically. Moreover, to get the required data, a client will have to send the request via Zookeeper about the server where the required data is stored. Then they will have to ask the server “who” stores the required data, and only after that write the data to the needed place.
When it comes to reading, statistics say that HBase has only 8,000 reads per second compared to 129,000 reads in Cassandra within a 32-node cluster. And the mathematics says that Cassandra is better, but don’t rush into conclusions.
Read performance is mostly about consistency, and it is the trump card of HBase. Whenever you need fast reads, HBase is your choice contrary to what the numbers suggest.
The architectural peculiarities of Cassandra should remind you that its reads are inconsistent as the masterless structure creates a mess in data retrieval.
Operational characteristics of HBase, on the other hand, shout out loud about consistency since all reads are addressed towards the same server (like in HDFS) within the multi-layered system. Yes, the hesitation might come about the speed of data retrieval with HBase, but don’t worry, block cache that is frequently addressing HDFS data and the bloom filters with approximate “addresses” for all the data help to speed up the process.
So whether it is better to have 8,000 precise reads or 129,000 of inconsistent reads is up to you. Still, the answer is pretty straightforward: the multi-layered index system of HBase is actually more efficient than Cassandra’s indexes.
Security
NoSQL databases are not famous for their exquisite security, so are not the databases in this article. Cassandra’s security features are based on inter-note and client-to-node protection in the form of authentication and authorization of all actions.
So whenever someone needs to access some data in Cassandra, he or she needs to have an appropriate user role. Access to data levels and pieces is defined based on user roles.
In HBase, access to data can go deeper—to the cell level. The security here is based on the assignment of visibility labels to data sets rather than identification of user roles as in Cassandra. Since HBase relies on third-party technologies, it can be said that its security level is a bit lower than the one of Cassandra’s. However, this can also be used as an advantage since one can address more secure and reliable external technologies for data protection with HBase.
Usability
Scalability and operations with time-series data are the most significant advantages and hence the most common applications for Cassandra and HBase.
Both systems are great with customer behavior and website visits, sensor readings in IoT systems, stock exchange data, etc.
As discussed above, reading and writing data are their strong sides. Yet, as you remember, reading is more of HBase’s suit, so scanning large data volumes in search of a particular result or text analysis that is common for social networks, web pages search, dictionaries are all tasks for HBase to handle.
Besides that, if your goal is related to basic data analysis, then this is your best fit for summing and counting.
Whenever you need to write (ingest) large volumes of data, Cassandra gets more efficient. It offers a higher level of stability and enables the development of synchronized data centers in different countries all over the globe. If writing data is more important than reading it for your company, consider also adding Spark to improve Cassandra’s scan performance.
Whenever you need real-time analytics and always-available data, Cassandra can offer both. If, however, you need more precise and scrupulous analysis and time is not your pressure point, go for HBase.
Summary
Even though the information in the article does point out many similarities between the systems, a closer look clearly defines their difference.
Cassandra can operate on its own, while HBase depends on third-party technologies and should be considered more of a meta-data storage. As an independent system, HBase is more complicated, and hence its configuration, maintenance, and security require more workforce resources in total.
Whenever you are going for consistency and searching for some small pieces of information in large databases, HBase can become a perfect and reliable desktop tool. However, Cassandra is good at massive data ingestion and storage.
Anyway, both systems don’t like frequent deletes and updates of the stored data. Whenever you hear that Cassandra and HBase are so much alike that it doesn’t matter which one you choose, run away from such an expert.
Each of the systems has its perks and drawbacks, so before you select one, consider your day-to-day tasks to make a reasonable choice.