Big Industries Academy
Emre Sevinç reviews: Cassandra: The Definitive Guide
Apache Cassandra is one of the most popular open-source distributed NoSQL database management systems nowadays, and "Cassandra: The Definitive Guide - Distributed Data at Web Scale" is the best introduction to many aspects of this powerful distributed database.
The book gives a thorough description of the fundamental concepts of Cassandra, starting with its history and what differentiates a distributed, masterless NoSQL system such as Cassandra from traditional, typical RDBMS systems such as Oracle, MS SQL Server, PostgreSQL and MySQL. Authors don't shy away from going into what the famous CAP theorem says about distributed systems, and what kind of trade-offs and decisions underlie the Cassandra architecture, leading to high availability, partition tolerance, write performance and eventual consistency.
I consider the chapters on CQL (Cassandra Query Language) and Data Modelling particularly important for big data architects, as well as data engineers: without fully grasping the fine points and pitfalls of data modelling in Cassandra, it is very likely that you might fall into thinking along the patterns you gained from the RDBMS world. And without a correct data model as a starting point, it is pointless to discuss other issues you might encounter later related to performance, complexity, etc. These two chapters teach the reader how to do data modelling correctly and use formal methodologies employing Chebotko diagrams (see "A Big Data Modeling Methodology for Apache Cassandra" at http://ieeexplore.ieee.org/document/7... and http://www.slideshare.net/ArtemChebot... for more details).
Once you are well-versed in Big Data Modelling for Cassandra, the book lays down the architecture of Cassandra, and you are introduced the main concepts, components and processes that make up Cassandra such as gossip protocol, snitches, failure detection, rings, tokens, virtual nodes, partitioners, replication strategies, consistency levels, commit log, memtable, SSTables, caching, hinted hand-off, lightweight transactions, Paxos for consensus, tombstones, compaction, Bloom Filters, repair mechanisms and Merkle trees.
After that you learn about how to configure Cassandra based on your data-center considerations and various configuration options. This chapter gives the basic options but you'll probably need more than that in a real-life setting.
The chapter on clients, drivers and how to do basic programming by connecting to Cassandra is brief and not very detailed. Nevertheless the code examples provide a fine starting point.
The book dedicates almost 30 pages to describing the Read and Write Paths of Cassandra, and it was a delight to read. See the step-by-step journey of a read and write query, understanding what phases it goes through helps fill in the gaps in your understanding of Cassandra's working. It is also complementary to your data modelling skills, answering some of the "why" questions: by knowing how read/write path works, you realize the reasoning behind data modelling recommendations.
Among the remaining chapters, "Monitoring", "Maintenance", "Performance Tuning", and "Security" contain adequate information as an introduction, though you will still need to be careful for pitfalls, e.g. "hidden" tombstones caused by writing multi-value data types (sets, lists and maps), after all, the devil is in the details!
I found the final chapter of "Deploying and Integrating" a little lightweight: you'll definitely need more information than the book provides, so you should consider this chapter only a small starting point, and nothing more.
A very nice point that I want to stress is that authors also provide links to relevant Cassandra JIRA issue numbers when they describe the fine details of a feature or issue. This is very much aligned with the open source nature of Cassandra, being an Apache Software Foundation project. This also lets the curious reader to learn many more details first-hand. Authors also provide extra explanation about and pointers to interesting aspects of Cassandra such as the "ϕ Accrual Failure Detector", "Paxos protocol"; why and how they are used in Cassandra. After all, we are talking about a distributed, masterless database system that's know to scale to 75.000 nodes (e.g. in Apple's case), and these fundamental algorithms play an important role.
One thing thay I found missing is a brief discussion about Cassandra: what's its future, where is it going, what's the roadmap for 2017 and beyond? To be fair, popular open source projects such as Cassandra are moving targets in a sense, it is not easy to fit everything in a book. But for example, when discussing the architecture of the Cassandra's SEDA (Staged Event-Driven Architecture), the authors note that there are some shortcomings discovered in recent years, they don't go beyond that. The curious reader will need to consult Cassandra JIRA issue web site, particularly the following ones: Move away from SEDA to TPC, Move away from SEDA to TPC, stage 1, and Make read and write requests paths fully non-blocking, eliminate related stages.
Let me end this review by stating that this is also a very good reference book for big data engineers and architects who plan to study for Certified Architect on Apache Cassandra exams. You'll find yourself marking many pages of the book, especially the discussion of fundamental concepts, best practices, as well as anti-patterns.
Emre Sevinç
Emre is a senior software engineer and project lead with more than 15 years of experience, and a formal background in mathematics and cognitive science. He gained his experience in varied domains such as space industry, biomedical informatics, and e-learning. When he is not busy with developing software intensive solutions for Big Data systems, writing technical documentation, and engaging with the customer, he follows the latest developments in Data Science.