Technical Interview Questions: Cassandra Interview Questions

Cassandra Interview Questions

Q: Why Apache Cassandra?

A: Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems.

Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.

Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there's no single point of failure. It is highly available and offers a schema-free data model.

Q: Where Did Cassandra Come From?

A: The Cassandra data store is an open source Apache project available at http://cassandra.apache.org. Cassandra originated at Facebook in 2007 to solve that company's inbox search problem, in which they had to deal with large volumes of data in a way that was difficult to scale with traditional methods. Specifically, the team had requirements to handle huge volumes of data in the form of message copies, reverse indices of messages, and many random reads and many simultaneous random writes.

The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Facebook engineer on the Search Team Prashant Malik as key engineers. The code was released as an open source Google Code project in July 2008. During its tenure as a Google Code project in 2008, the code was updateable only by Facebook engineers, and little community was built around it as a result. So in March 2009 it was moved to an Apache Incubator project, and on February 17, 2010 it was voted into a top-level project.

Cassandra today presents a kind of paradox: it feels new and radical, and yet it's solidly rooted in many standard, traditional computer science concepts and maxims that successful predecessors have already institutionalized. Cassandra is a realist's kind of database; it doesn't depart from the relational model to be a fun art project or experiment for smart developers. It was created specifically to solve a real-world problem that existing tools weren't able to solve. It acknowledges the limitations of prior methods and faces our new world of big data head-on.

Q: How Did Cassandra Get Its Name?

A: I'm a little surprised how often people ask me where the database got its name. It's not the first thing I think of when I hear about a project. But it is interesting, and in the case

of this database, it's felicitously meaningful.

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen—but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named.

Use Cases for CassandraLarge Deployments

You probably don't drive a semi truck to pick up your dry cleaning; semis aren't well suited for that sort of task. Lots of careful engineering has gone into Cassandra's high availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which are its main selling points. None of these qualities is even meaningful in a single-node deployment, let alone allowed to realize its full potential.

There are, however, a wide variety of situations where a single-node relational database is all we may need. So do some measuring. Consider your expected traffic, throughput needs, and SLAs. There are no hard and fast rules here, but if you expect that you can reliably serve traffic with an acceptable level of performance with just a few relational databases, it might be a better choice to do so, simply because RDBMS are easier to run on a single machine and are more familiar.

If you think you'll need at least several nodes to support your efforts, however, Cassandra might be a good fit. If your application is expected to require dozens of nodes, Cassandra might be a great fit.Lots of Writes, Statistics, and Analysis

Consider your application from the perspective of the ratio of reads to writes. Cassandra is optimized for excellent throughput on writes.

Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

According to the project wiki, Cassandra has been used to create a variety of applications, including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.Geographical Distribution

Cassandra has out-of-the-box support for geographical distribution of data. You can easily configure Cassandra to replicate data across multiple data centers. If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.Evolving Applications

If your application is evolving rapidly and you're in “startup mode,” Cassandra might be a good fit given its schema-free data model. This makes it easy to keep your database in step with application changes as you rapidly deploy.

Q: Who Is Using Cassandra?

A: Cassandra is still in its early stages in many ways, not yet seeing its 1.0 release at the time of this writing. There are few easy, graphical tools to help manage it, and the community has not settled on certain key internal and external design questions that have been revisited. But what does it say about the promise, usefulness, and stability of a data store that even in its early stages is being used in production by many large, well-known companies?The list of companies using Cassandra is growing. These companies include:

Twitter is using Cassandra for analytics. In a much-publicized blog post (at http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html), Twitter's primary Cassandra engineer, Ryan King, explained that Twitter had decided against using Cassandra as its primary store for tweets, as originally planned, but would instead use it in production for several different things: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store.

Mahalo uses it for its primary near-time data store.

Facebook still uses it for inbox search, though they are using a proprietary fork.

Digg uses it for its primary near-time data store.

Rackspace uses it for its cloud service, monitoring, and logging.

Reddit uses it as a persistent cache.

Cloudkick uses it for monitoring statistics and analytics.

Ooyala uses it to store and serve near real-time video analytics data.

SimpleGeo uses it as the main data store for its real-time location infrastructure.

Onespot uses it for a subset of its main data store.

Cassandra is also being used by Cisco and Platform64, and is starting to see use at Comcast and bee.tv for personalized television streaming to the Web and to mobile devices. There are others. The bottom line is that the uses are real. A wide variety of companies are finding use cases for Cassandra and seeing success with it. As of this writing, the largest known Cassandra installation is at Facebook, where they have more than 150TB of data on more than 100 machines. Many more companies are currently evaluating Cassandra for production use in different projects, and a services company called Riptano, cofounded by Jonathan Ellis, the Apache Project Chair for Cassandra, was started in April of 2010. As more features are added and better tooling and support options are rolled out, anticipate even broader adoption.

Q: What is Cassandra?

A: Apache Cassandra is a high performance, extremely scalable, fault tolerant (i.e. no single point of failure), distributed post-relational database solution. Cassandra combines all the benefits of Google Bigtable and Amazon Dynamo to handle the types of database management needs that traditional RDBMS vendors cannot support. From a commercial software standpoint, DataStax is the leading worldwide commercial provider of Cassandra products, services, support, and training.

Q: What are the benefits of using Cassandra?

A: There are many technical benefits that come from using Cassandra.

Apache Cassandra is a standout among the NoSQL/post-relational database solutions on the market for many reasons. Today, major companies, educational institutions, and government agencies are using Cassandra to power key aspects of their business because of the benefits they derive from the following core features:

Based on the best of Amazon Dynamo and Google BigTable, Cassandra's peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability. Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data. Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations. No single point of failure. Data is replicated to multiple nodes to protect from loss during node failure, and new machines can be added incrementally while online to increase the capacity and data protection of your Cassandra cluster. Transparent fault detection and recovery. Cassandra clusters can grow into the hundreds or thousands of nodes. Because Cassandra was designed for commodity servers, machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without your application noticing.Flexible, dynamic schema data modeling – Cassandra offers the organization of a traditional RDBMS table layout combined with the flexibility and power of no stringent structure requirements. This allows you to store your data as you need to without performance penalty for changes as your needs evolve. Plus, Cassandra can store structured, semi-structured, and unstructured data.Guaranteed data safety – Cassandra far exceeds other systems on write performance, while ensuring durability, due to its innovative append-only commit log. Users no longer have to trade off durability to keep up with immense write streams. Data is absolutely safe in Cassandra; there is no possibility of data loss.Distributed, read/write anywhere design – Cassandra's peer-to-peer architecture avoids the hotspots and read/write issues found in master-slave designs. This means you can have a highly distributed database (multi-geography, data center, etc.) and read or write to any node in a cluster without concern over what node is being accessed.Tunable Data Consistency – Cassandra is a distributed system that can span multiple machines, multiple racks, and multiple data centers. Because you know your requirements for latency across those barriers better than anyone, it allows you to choose strong consistency or allow varying degrees of more relaxed consistency (incorporating advanced anti-entropy protocols). The full ‘CAP‘ spectrum between consistency and availability is yours. Data consistency can be controlled on a per-operation basis (i.e. per INSERT, per UPDATE, etc.)Multi-datacenter replication – Whether it's keeping your data in multiple locations for disaster recovery scenarios or for blazing performance to keep it near your end user, Cassandra offers support for multiple data centers. Simply configure how many copies of your data you want in each data center, and Cassandra handles the rest – replicating your data for you. Cassandra is also rack-aware and can keep replicas of data stored on different physical racks, which helps ensure uptime in the case of single rack failures.Cloud enabled – Cassandra's architecture maximizes the benefits of running in the Cloud. Plus, Cassandra allows for hybrid data distribution where some data can be kept on premise and some in the Cloud.Data compression – Cassandra supplies built-in data compression, with some use cases showing up to an 80% reduction in raw data footprint. Plus, Cassandra's compression results in no performance penalty, with some use cases showing actual read/write speedup's due to less physical I/O being managed.CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL that mirrors SQL's DDL, DML, and SELECT syntax. CQL greatly lessens the learning curve for those coming from RDBMS systems because they can use familiar syntax for all object creation and data access operations.No caching layer required – Cassandra offers caching on each of its nodes. Coupled with Cassandra's scalability characteristics, and you can incrementally add nodes to the cluster to keep as much of your data in memory as you need. The result? There's no need for a separate caching layer. Caching + disk persistence in one layer – ease of development, ease of operations.No special hardware needed – Cassandra runs on commodity machines and requires no expensive or special hardware.Incremental and elastic expansion – The Cassandra ring allows you to add nodes easily without manual migration of data needed from one to another. The result is your Cassandra cluster can grow as you need it to – and you can increase your cost incrementally as your data needs demand. Simply add new nodes to the Cassandra cluster as needed.Simple install and setup – Cassandra can be downloaded and installed in minutes, even for multi-cluster installs.

Q: How to install Cassandra?

A: Debian installation instructions

step 1:

Upgrade your software

sudo apt-get upgrade

step 2:

Open sources.list

sudo vi /etc/apt/sources.list

step 3

Add following lines to your source.list

deb http://www.apache.org/dist/cassandra/debian 11x main

deb-src http://www.apache.org/dist/cassandra/debian 11x main

step 4:

Run update

sudo apt-get update

Now you will see an error similar to this:

GPG error: http://www.apache.org unstable Release: The following signatures couldn't be

verified because the public key is not available: NO_PUBKEY F758CE318D77295D

This simply means you need to add the PUBLIC_KEY. You do that like this:

gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D

gpg --export --armor F758CE318D77295D | sudo apt-key add -

Starting with the 0.7.5 debian package, you will also need to add public key 2B5C1B00

using the same commands as above:

gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00

gpg --export --armor 2B5C1B00 | sudo apt-key add -

step 5

Run update again and install Cassandra

sudo apt-get update

sudo apt-get install cassandra

step 6:

Start Cassandra

sudo service cassandra start

Start Cassandra from package without installationstep 1:

Download latest cassandra version from following url

http://cassandra.apache.org/download/step 2:

Start Cassandra by using following command

Downloaded_appache_version/bin/cassandra -f

Q: How to start/stop Cassandra on a machine?

A: Starting Cassandra involves connecting to the machine where it is installed with the proper security credentials, and invoking the cassandra executable from the installation's binary directory. An example of starting Cassandra on Mac could be:

sudo /Applications/Cassandra/apache-cassandra-1.1.1/bin/cassandra

Q: How to log into Cassandra?

A: The basic command line interface (CLI) for logging into and executing commands against Cassandra is the cassandra-cli utility, which is found in the software installation's bin directory.

An example of logging into a local machine's Cassandra installation using the CLI and the default Cassandra port might be:

Welcome to the Cassandra CLI.

Type 'help;' or '?' for help.

Type 'quit;' or 'exit;' to quit.

[default@unknown] connect localhost/9160;

Connected to: "Test Cluster" on localhost/9160

[default@unknown]

Q: When is Cassandra required for an application?

A: Cassandra can be used in many different data management situations. Some of the most common use cases for Cassandra include:

Serving as the operational/real-time/system-of-record datastore for Web or other online applications needing around-the-clock transactional input capabilities

Applications needing “network independence”, meaning systems that cannot worry about where data lives. This oftentimes equates to widely dispersed applications that need to serve numerous geographies with the same fast response times

Applications needing extreme degrees of uptime and no single point of failure

Retailing or other such systems needing easy data elasticity, so that capacity can be added to service peak workloads for various periods of time and then shrink back when a reduction in user traffic allows – all done in an online fashion

Write intensive applications that have to take in continuous large volumes of data (e.g. credit card systems, music download purchases, device/sensor data, Web clickstream data, archiving systems, event logging, etc.)

Real-time analysis of social media or similar data that requires tracking user activity, preferences, etc.

Systems that need to quickly analyze data, and then use the results of that analysis as input back into the real-time system. For example, a travel or retail site may need to analyze patterns on the fly to customize offers to customers in real-time.

Management of large data volumes (terabytes-petabytes) that must be kept online for query access and business intelligence processing

Caching functionality that delivers caching tier performance response times without resorting to separate caching (e.g. memcached) and database tiers

SaaS applications that utilize web services to connect into a distributed, yet centrally managed database, and then display results to SaaS customers

Cloud applications that require elastic data scale, easy deployment, and a need to grow through a data-centric scale-out architecture

Systems that need to store and directly deal with a combination of structured, unstructured, and semi-structured data, with a requirement for a flexible schema/data storage paradigm that allows for easy and online structure modifications

Q: When should I not use Cassandra?

A: Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Cassandra does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).

Q: How does Cassandra differ from Hadoop?

A: The primary difference between Cassandra and Hadoop is that Cassandra targets real-time/operational data, while Hadoop has been designed for batch-based analytic work.

There are many different technical differences between Cassandra and Hadoop, including Cassandra's underlying data structure (based on Google's Bigtable), its fault-tolerant, peer-to-peer architecture, multi-data center capabilities, tunable data consistency, all nodes being the same (no concept of a namenode, etc.) and much more.

Q: How does Cassandra differ from HBase?

A: HBase is an open-source, column-oriented data store modeled after Google Bigtable, and is designed to offer Bigtable-like capabilities on top of data stored in Hadoop. However, while HBase shared the Bigtable design with Cassandra, its foundational architecture is much different.

A Cassandra cluster is much easier to setup and configure than a comparable HBase cluster. HBase's reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Cassandra, because all nodes are the same, there is no such issue

In internal performance tests conducted at DataStax (using the Yahoo Cloud Serving Benchmark – YCSB), Cassandra offered literally 5X better performance in writes and 4X better performance on reads than HBase.

Q: How does Cassandra differ from MongoDB?

A: MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.

By contrast, Cassandra uses a peer-to-peer, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Cassandra offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.

Q: What is DataStax Community Edition?

A: DataStax Community Edition is a free software bundle from DataStax that combines Apache Cassandra with a number of developer and management tools provided by DataStax, which are designed to get someone up and productive with Cassandra in very little time. DataStax Community Edition is provided for open source enthusiasts and is not recommended for production use as it is not formally supported by DataStax' production support staff.

Q: What is DataStax Enterprise Edition?

A: DataStax Enterprise is the commercial product offering from DataStax that is designed for enterprise-class, production usage that combines Apache Cassandra with real-time analytics capabilities and smart mixed workload management. DataStax Enterprise also provides full 24×7 production support, consultative assistance, certified maintenance updates, and much more.

Q: How does Cassandra protect me against downtime?

A: Cassandra has been built from the ground up to be a fault tolerant, peer-to-peer database that offers no single point of failure. Cassandra can automatically replicate data between nodes to offer data redundancy. It also offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe) as well as between geographically dispersed data centers, and/or public Cloud providers and on-premises machines, which offers the strongest possible uptime and disaster recovery capabilities:

Automatically replicates data between nodes to offer data redundancy

Offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe)

Easily replicates between geographically dispersed data centers

Leverages any combination of cloud and on-premise resources

Q: Does Cassandra use a master/slave architecture or something else?

A: Cassandra does not use a master/slave architecture, but instead uses a peer-to-peer implementation, which avoids the pitfalls, latency problems, single point of failure issues, and performance headaches associated with master/slave setups.

Q: How do I replicate data across Cassandra nodes?

A: Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other.

The total number of replicas across the cluster is often referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row. A replication factor of 2 means two copies of each row. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.

Replication options are defined when you create a keyspace in Cassandra. The snitch is configured per node.

Q: How is my data partitioned in Cassandra across nodes in a cluster?

A: Cassandra provides a number of options to partition your data across nodes in a cluster.

The RandomPartitioner is the default partitioning strategy for a Cassandra cluster. It uses a consistent hashing algorithm to determine which node will store a particular row. The end result is an even distribution of data across a cluster.

The ByteOrderedPartitioner ensures that row keys are stored in sorted order. It is not recommended for most use cases and can result in uneven distribution of data across a cluster.

Q: What are ‘seed nodes' in Cassandra?

A: A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure.

Q: What is a "snitch"?

A: The snitch is a configurable component of a Cassandra cluster used to define how the nodes are grouped together within the overall network topology (such as rack and data center groupings). Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Cassandra (it does not control which node a client connects to).

Q: How do I add new nodes to a cluster?

A: Cassandra is capable of offering linear performance benefits when new nodes are added to a cluster.

A new machine can be added to an existing cluster by installing the Cassandra software on the server and configuring the new node so that it knows (1) the name of the Cassandra cluster it is joining; (2) the seed node(s) it should obtain its data from; (3) the range of data that it is responsible for, which is done by assigning a token to the node.

Please see the online documentation about how to assign a token to a new node and the various use cases that dictate the complexity of token assignment.

Note that OpsCenter is capable of automatically rebalancing the data across all nodes in a cluster when new nodes are added.

Q: How do I remove nodes from an existing cluster?

A: Nodes can be removed from a Cassandra cluster by using the nodetool utility and issuing a decommission command. This can be done without affecting the overall operations or uptime of the cluster.

Q: What happens when a node fails in Cassandra?

A: Cassandra uses gossip state information to locally determine if another node in the system is up or down. This failure detection information is used by Cassandra to avoid routing client requests to unreachable nodes whenever possible.

The gossip inter-node communication process tracks heartbeats from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes heard about secondhand, thirdhand, and so on). Rather than have a fixed threshold for marking nodes without a heartbeat as down, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect the perceived heartbeat rate.

Node failures can result from various causes such as hardware failures, network outages, and so on. Node outages are often transient but can last for extended intervals. A node outage rarely signifies a permanent departure from the cluster, and therefore does not automatically result in permanent removal of the failed node from the cluster. Other nodes will still try to periodically initiate gossip contact with failed nodes to see if they are back up.

When a node comes back online after an outage, it may have missed writes for the replica data it maintains. Writes missed due to short, transient outages are saved for a period of time on other replicas and replayed on the failed host once it recovers using Cassandra's built-in hinted handoff feature. If a node is down for an extended period, an administrator can run the nodetool repair utility after the node is back online to ‘catch it up' with its corresponding replicas.

To permanently change a node's membership in a cluster, administrators must explicitly remove a node from a Cassandra cluster using the nodetool management utility.

Q: How can I use the same Cassandra cluster across multiple datacenters?

A: Cassandra can easily replicate data between different physical datacenters by creating a keyspace that uses the replication strategy currently termed NetworkTopologyStrategy. This strategy allows you to configure Cassandra to automatically replicate data to different data centers and even different racks within datacenters to protect against specific rack/physical hardware failures causing a cluster to go down. It can also replicate data between public Clouds and on-premises machines.

Q: What configuration files does Cassandra use?

A: The main Cassandra configuration file is the cassandra.yaml file, which houses all the main options that control how Cassandra operates.

Q: How can I use Cassandra in the Cloud?

A: Cassandra's architecture make it perfect for full Cloud deployments as well as hybrid implementations that store some data in the Cloud and other data on premises.

DataStax provides an Amazon AMI that allows you to quickly deploy a Cassandra cluster on EC2. See the online documentation for a step-by-step guide to installing a Cassandra cluster on Amazon.

Q: Do I need to use a caching layer (like memcached) with Cassandra?

A: Cassandra negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures.

Q: Why is Cassandra so fast for write activity/data loads?

A: Cassandra has been architecture for consuming large amounts of data as fast as possible. To accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that, the data is then written to an in-memory structure called a memtable. Cassandra deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems.

Once a memtable‘s memory limit is reached, all writes are then written to disk in the form of an SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert fashion and the previous data automatically removed.

Because SSTables are immutable and only written once the corresponding memtable is full, Cassandra avoids random seeks and instead only performs sequential IO in large batches, resulting in high write throughput.

A related factor is that Cassandra doesn't have to do a read as part of a write (i.e. check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g. MongoDB) it deteriorates.

Q: How does Cassandra communicate across nodes in a cluster?

A: Cassandra is architected in a peer-to-peer fashion and uses a protocol called gossip to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.

Gossip only includes information about the cluster itself (up/down, joining, leaving, version, schema, etc.) and does not manage the data. Data is transferred node-to-node using a message passing like protocol on a distinct port from what client applications connect to. The Cassandra partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Cassandra sends messages to those replicas with the request (read or write).

Q: How does Cassandra detect that a node is down?

A: The gossip protocol is used to determine the state of all nodes in a cluster and if a particular node has gone down.

The gossip process tracks heartbeats from other nodes and uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate before a node is actually marked as down.

The configuration parameter phi_convict_threshold in the cassandra.yaml file is used to control Cassandra's sensitivity of node failure detection. The default value is appropriate for most situations. However in Cloud environments, such as Amazon EC2, the value should be increased to 12 in order to account for network issues that sometimes occur on such platforms.

Q: Does Cassandra compress data on disk?

A: Yes, data compression is available with Cassandra 1.0 and above. The snappy compression algorithm from Google is used and is able to deliver fairly impressive storage savings, in some cases compressing raw data up to 80+% with no performance penalties for read/write operations. In fact, because of the reduction in physical I/O, compression actually increases performance in some use cases. Compression is enabled/disabled on a per-column family basis and is not enabled by default.

Q: How to backup data in Cassandra?

A: Currently, the most common method for backing up data in Cassandra is using the snapshot function in the nodetool utility. This is an online operation and does not require any downtime or block any operations on the server.

Snapshots are sent by default to a snapshots directory that is located in the Cassandra data directory (controlled via the data_file_directories in the cassandra.yaml file). Once taken, snapshots can be moved off-site to be protected.

Incremental backups (i.e. data backed up since the last full snapshot) can be performed by setting the incremental_backups parameter in the cassandra.yaml file to ‘true'. When incremental backup is enabled, Cassandra copies every flushed SSTable for each keyspace to a backup directory located under the Cassandra data directory. Restoring from an incremental backup involves first restoring from the last full snapshot and then copying each incremental file back into the Cassandra data directory. Eg.

Create a Cassandra snapshot for a single node

nodetool -h 10.10.10.1 snapshot KEYSPACE_NAME

Create a cluster wide Cassandra snapshot

clustertool -h 10.10.10.1 global_snapshot KEYSPACE_NAME

Q: How to restore data in Cassandra?In general, restoring a Cassandra node is done by first following these procedures:step 1:

A: Shut down the node that is to be restoredstep 2:

Clear the commit log by removing all the files in the commit log directory

e.g.

rm /var/lib/cassandra/commitlog/*step 3:

Remove the database files for all keyspaces

e.g.

rm /var/lib/cassandra/data/keyspace1/*.db

Take care so as not to remove the snapshot directory for the keyspacestep 4:

Copy the latest snapshot directory contents for each keyspace to the keyspace's data directory

e.g.

cp -p /var/lib/cassandra/data/keyspace1/snapshots/56046198758643-snapshotkeyspace1/* /var/lib/cassandra/data/keyspace1step 5:

Copy any incremental backups taken for each keyspace into the keyspace's data directorystep 6:

Repeat steps 3-5 for each keyspacestep 7:

Restart the node

Q: How do I uninstall Cassandra?

A: Currently, no uninstaller exists for Cassandra. Therefore, removing Cassandra from a machine consists of the manual deletion of the Cassandra software, data, and log files.

Q: Is my data safe in Cassandra?

A: Yes. First, data durability is fully supported in Cassandra so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.

Second, Cassandra offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense in that all readers will see the same values.

Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Cassandra is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.

Q: What options do I have to make sure my data is consistent across nodes?

A: In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Cassandra offers a number of built-in features to ensure data consistency:

Hinted Handoff Writes – Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online again. Hinted handoff ensures data consistency due to short, transient node outages.

Read Repair – Read operations trigger consistency across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas.

Anti-Entropy Node Repair – For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations.

Q: What is 'tunable consistency' in Cassandra?

A: Cassandra extends the concept of ‘eventual consistency' by offering ‘tunable consistency'. For any given read or write operation, the client application decides how consistent the requested data should be.

Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes.

Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).

If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always have the most recent write). For some applications, this may be an acceptable trade-off.

If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula:

(nodes_written + nodes_read) > replication_factor

For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency.

Q: How to load data into Cassandra?

A: With respect to loading external data, Cassandra supplies a load utility called the sstableloader. The sstableloader is able to load flat files into Cassandra, however the files must first be converted into SSTable format.

Q: How can I move data from another database to Cassandra?

A: Most RDBMS's have an unload utility that allows data to be unloaded to flat files. Once in flat file format, the sstableloader utility can be used to load the data into Cassandra column families.

Some developers write programs to connect to both an RDBMS and Cassandra and move data in that way.

Q: What is read repair in Cassandra?

A: Read operations trigger consistency checks across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Read repair is an optional feature and can be configured per column family.

Q: What client libraries/drivers can I use with Cassandra?

A: There are a number of CQL (Cassandra Query Language) drivers and native client libraries available for most all popular development languages (e.g. Java, Ruby, etc.) All drivers and client libraries can be downloaded from: http://www.datastax.com/download/clientdrivers.

Q: What type of data model does Cassandra use?

A: The Cassandra data model is a schema-optional, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

Although it is natural to want to compare the Cassandra data model to a relational database, they are really quite different. In a relational database, data is stored in tables and the tables comprising an application are typically related to each other. Data is usually normalized to reduce redundant entries, and tables are joined on common keys to satisfy a given query.

In Cassandra, the keyspace is the container for your application data, similar to a database or schema in a relational database. Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns.

Cassandra does not enforce relationships between column families the way that relational databases do between tables: there are no formal foreign keys in Cassandra, and joining column families at query time is not supported. Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from your application.

Q: What datatypes does Cassandra support?

A: In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column.

In Cassandra, you can specify a data type for both the column name (called a comparator) as well as for row key and column values (called a validator).

Column and row key data in Cassandra is always stored internally as hex byte arrays, but the compartor/validators are used to verify data on insert and translate data on retrieval. In the case of comparators (column names), the comparator also determines the sort order in which columns are stored.

Cassandra comes with the following comparators and validators:

BytesType >

Bytes (no validation)

AsciiType >

US-ASCII bytes

UTF8Type >

UTF-8 encoded strings

LongType >

64-bit longs

LexicalUUIDType >

128-bit UUID by byte value

TimeUUIDType >

Version 1 128-bit UUID by timestamp

CounterColumnType* >

64-bit signed integer

* can only be used as a column validator, not valid as a row key validator or column name comparator

A simple example might be:

CREATE COLUMNFAMILY Standard1 WITH comparator_type = 'UTF8Type';

Q: What is a keyspace in Cassandra?

A: In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.

Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.

Q: What is a column family in Cassandra?

A: When comparing Cassandra to a relational database, the column family is similar to a table in that it is a container for columns and rows. However, a column family requires a major shift in thinking for those coming from the relational world.

In a relational database, you define tables, which have defined columns. The table defines the column names and their data types, and the client application then supplies rows conforming to that schema: each row contains the same fixed set of columns.

In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.

Q: What is a supercolumn in Cassandra?

A: A Cassandra column family can contain regular columns (key/value pairs) or super columns. Super columns add another level of nesting to the regular column family column structure. Super columns are comprised of a (super) column name and an ordered map of sub-columns. A super column is a way to group multiple columns based on a common lookup value.

Q: When should I use a supercolumn in Cassandra?

A: The primary use case for super columns is to denormalize multiple rows from other column families into a single row, allowing for materialized view data retrieval.

Super columns should not be used when the number of sub-columns is expected to be a large number. During reads, all sub-columns of a super column must be deserialized to read a single sub-column, so performance of super columns is not optimal if there are a large number of sub-columns. Also, you cannot create a secondary index on a sub-column of a super column.

Q: Does Cassandra support transactions?

A: Yes and No, depending on what you mean by “transactions”. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There is no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by “transactions” you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.

Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.

However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

Q: What is the CQL language?

A: Cassandra 0.8 is the first release to introduce Cassandra Query Language (CQL), the first standardized query language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted within the Apache Cassandra project.

CQL syntax in based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINs, for example.

Q: What is a compaction in Cassandra?

A: Cassandra is optimized for write throughput. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (Sorted String table). The “Sorted” part means SSTables are sorted by row token (as determined by the partitioner), which is what makes merges for compaction efficient (don't have to read entire SSTables into memory). Row contents are also sorted by column comparator, so Cassandra can support larger-than-memory rows too.

SSTables are immutable (they are not written to again after they have been flushed). This means that a row is typically stored across multiple SSTable files.

In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTable files are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the smaller input SSTables are marked as obsolete and eventually deleted by the Java Virtual Machine (JVM) garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O on the node.

Q: What platforms does Cassandra run on?

A: Cassandra is a Java application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). DataStax strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance.

Packaged releases are provided for RedHat, CentOS, Debian and Ubuntu Linux platforms.

Q: What management tools exist for Cassandra?

A: DataStax supplies both a free and commercial version of OpsCenter, which is a visual, browser-based management tool for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on the DataStax Web site.

A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, etc.

Cassandra also exposes a number of statistics and management operations via Java Management Extensions (JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java applications and services. Any statistic or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.

During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operations such as flushing caches or doing a repair.

Q: Why can't I make Cassandra listen on 0.0.0.0 (all my addresses)?

A: Cassandra is a gossip-based distributed system. ListenAddress is also 'contact me here address,' i.e., the address it tells other nodes to reach it at. Telling other nodes 'contact me on any of my addresses' is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don't want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it's up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to 0.0.0.0 (Java bug 6425769).

Q: What ports does Cassandra use?

A: By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP.

Q: Why does Cassandra slow down after doing a lot of inserts?

A: This is a symptom of memory pressure, resulting in a storm of GC operations as the JVM frantically tries to free enough heap to continue to operate. Eventually, the server will crash from OutOfMemory; usually, but not always, it will be able to log this final error before the JVM terminates.

You can increase the amount of memory the JVM uses, or decrease the insert threshold before Cassandra flushes its memtables. See MemtableThresholds for details.

Setting your cache sizes too large can result in memory pressure.

Q: What happens to existing data in my cluster when I add new nodes?

A: Starting a new node with the -b [bootstrap] option will cause it to contact other nodes in the cluster to copy the right data to itself.

In Cassandra 0.5 and above, there is an 'AutoBootStrap' option in the config file. When enabled, using the '-b' options is unnecessary, because new nodes will automatically bootstrap themselves when they start up for the first time. Even with AutoBootstrap it is recommended that you always specify the InitialToken because the picking of an initial token will almost certainly result in an unbalanced ring. If you are building the initial cluster you certainly don't want to leave InitialToken blank. Pick the tokens such that the ring will be balanced afterward and explicitly set them on each node. See token selection in the operations wiki.

Unless you know precisely what you're doing and are aware of how the Cassandra internals work you should never introduce a new empty node to your cluster and have autoboostrap disabled. In version 0.7 under write load it will cause writes to be sent to the new node before the schema arrives from another member of the cluster. This would also indicate to clients that the new node is responsible for servicing reads for data that it definitely doesn't have.

In Cassandra 0.4 and below, it is recommended that you manually specify a value for 'InitialToken' in the config file of a new node.

Q: Can I add/remove/rename Column Families on a working cluster?

A: Yes, but it's important that you do it correctly.

Empty the commitlog with 'nodetool drain.'

Shutdown Cassandra and verify that there is no remaining data in the commitlog.

Delete the sstable files (-Data.db, -Index.db, and -Filter.db) for any CFs removed, and rename the files for any CFs that were renamed.

Make necessary changes to your storage-conf.xml.

Start Cassandra back up and your edits should take effect.

Q: Does it matter which node a Thrift or higher-level client connects to?

A: No, any node in the cluster will work; Cassandra nodes proxy your request as needed. This leaves the client with a number of options for end point selection:

You can maintain a list of contact nodes (all or a subset of the nodes in the cluster), and configure your clients to choose among them.

Use round-robin DNS and create a record that points to a set of contact nodes (recommended).

Use the get_string_property('token map') RPC to obtain an update-to-date list of the nodes in the cluster and cycle through them. Deploy a load-balancer, proxy, etc.

When using a higher-level client you should investigate which, if any, options are implemented by your higher-level client to help you distribute your requests across nodes in a cluster.

Q: What kind of hardware should I run Cassandra on?Memory

A: The most recently written data resides in memory tables (aka memtables), but older data that has been flushed to disk can be kept in the OS's file-system cache. In other words, the more memory, the better, with 4GB being the minimum we typically recommended in a virtualized environment (e.g., EC2 Large instances). Obviously there is no benefit to having more RAM than your hot data set, but with dedicated hardware there is no reason to use less than 8GB or 16GB, and you often see clusters with 32 GB or more per node.RAM can also be useful for the key cache (introduced in 0.5) and row cache (in 0.6).CPU

Many workloads will actually be CPU-bound in Cassandra before being memory-bound. Cassandra is highly concurrent and will make good use of however many cores you can give it. For raw hardware, 8-core boxes are the current price/performance sweet spot. If you're running on virtualized machines, consider using a provider such as Rackspace Cloud Servers that allows CPU bursting.Disk

The short answer here is that ideally you will have at least 2 disks, one to keep your CommitLogDirectory on, the other to use in DataFileDirectories. The exact answer though depends a lot on your usage so it's important to understand what is going on here.

Cassandra persists data to disk for two very different purposes. The first is to the commitlog when a new write is made so that it can be replayed after a crash or system shutdown. The second is to the data directory when thresholds are exceeded and memtables are flushed to disk as SSTables.

Commit logs receive every write made to a Cassandra node and have the potential to block client operations, but they are only ever read on node start-up. SSTable (data file) writes on the other hand occur asynchronously, but are read to satisfy client look-ups. SSTables are also periodically merged and rewritten in a process called compaction. Another important difference between commitlog and sstables is that commit logs are purged after the corresponding data has been flushed to disk as an SSTable, so CommitLogDirectory only holds uncommitted data while the directories in DataFileDirectories store all of the data written to a node.

So to summarize, if you use a different device for your CommitLogDirectory it needn't be large, but it should be fast enough to receive all of your writes (as appends, i.e., sequential i/o). Then, use one or more devices for DataFileDirectories and make sure they are both large enough to house all of your data, and fast enough to both satisfy reads that are not cached in memory and to keep up with flushing and compaction.

As covered in MemtableSSTable, compactions can require up to 100% of your in-use space temporarily in the worst case, free on a single volume (that is, in a data file directory). So if you are going to be approaching 50% or more of your disks' capacity, you should raid0 your data directory volumes. B. Todd Burruss adds on the mailing list, 'With the file sizes we're talking about with cassandra and other database products, the [raid] stripe size doesn't seem to matter. Mine is set to 128k, which produced the same results as 16k and 256k.' In addition to giving you capacity for compactions, raid0 will help smooth out io hotspots within a single sstable.

On ext2/ext3 the maximum file size is 2TB, even on a 64 bit kernel. On ext4 that goes up to 16TB. Since Cassandra can use almost half your disk space on a single file, if you are raiding large disks together you may want to use XFS instead, particularly if you are using a 32-bit kernel. XFS file size limits are 16TB max on a 32 bit kernel, and basically unlimited on 64 bit.Cloud

Several heavy users of Cassandra deploy in the cloud, e.g. CloudKick on Rackspace Cloud Servers and SimpleGeo on Amazon EC2.

On EC2, the best practice is to use L or XL instances with local storage. I/o performance is proportionately much worse on S and M sizes, and EBS is a bad fit for several reasons (see Erik Onnen's excellent explanation). Put the Cassandra commitlog on the root volume, and the data directory on the raid0'd ephemeral disks.

Q: What are SSTables and Memtables?

A: Cassandra writes are first written to the CommitLog, and then to a per-ColumnFamily structure called a Memtable. When a Memtable is full, it is written to disk as an SSTable.

A Memtable is basically a write-back cache of data rows that can be looked up by key -- that is, unlike a write-through cache, writes are batched up in the Memtable until it is full, when it is flushed.Flushing

The process of turning a Memtable into a SSTable is called flushing. You can manually trigger flush via jmx (e.g. with bin/nodetool), which you may want to do before restarting nodes since it will reduce CommitLog replay time. Memtables are sorted by key and then written out sequentially. Thus, writes are extremely fast, costing only a commitlog append and an amortized sequential write for the flush!

Once flushed, SSTable files are immutable; no further writes may be done. So, on the read path, the server must (potentially, although it uses tricks like bloom filters to avoid doing so unnecessarily) combine row fragments from all the SSTables on disk, as well as any unflushed Memtables, to produce the requested data.Compaction

To bound the number of SSTable files that must be consulted on reads, and to reclaim space taken by unused data, Cassandra performs compactions: merging multiple old SSTable files into a single new one. Compactions are triggered when at least N SStables have been flushed to disk, where N is tunable and defaults to 4. Four similar-sized SSTables are merged into a single one. They start out being the same size as your memtable flush size, and then form a hierarchy with each one doubling in size. So you'll have up to N of the same size as your memtable, then up to N double that size, then up to N double that size, etc.

'Minor' only compactions merge sstables of similar size; 'major' compactions merge all sstables in a given ColumnFamily. Prior to Cassandra 0.6.6/0.7.0, only major compactions can clean out obsolete tombstones.

Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o. Once compaction is finished, the old SSTable files may be deleted: note that in the worst case (a workload consisting of no overwrites or deletes) this will temporarily require 2x your existing on-disk space used. In today's world of multi-TB disks this is usually not a problem but it is good to keep in mind when you are setting alert thresholds.

SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. If a GC does not occur before Cassandra is shut down, Cassandra will remove them when it restarts. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted.

ColumnFamilyStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only includes size of non-obsolete files) and getTotalDiskSpaceUsed (includes everything).

MemtablesDon't Touch that Dial

The settings described here should only be changed in the face of a quantifiable performance problem. They will affect the cluster quite differently for distinct use cases and workloads, and the defaults, though conservative, were well-chosen.JVM Heap Size

By default, the cassandra startup scripts specifies a maximum JVM heap of -Xmx1G. Consider increasing this -- but gently! If Cassandra and other processes take up too much of the available ram, they'll force out the operating system's file buffers and caches. These are as important as the internal data structures for ensuring Cassandra performance.

It's much riskier to start tuning with this too high (a difficult-to-pinpoint malaise) than too low (easy to diagnose using JMX). Even for a high-end machine with say 48GB of ram, a 4GB heap size is reasonable initial guess -- the OS is much smarter than you think. For a rough rule of thumb, Cassandra's internal datastructures will require about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches.

Also know that if you're running up against the heap limit under load that's probably a symptom of other problems. Diagnose those first.Virtual Memory and Swap

On a dedicated cassandra machine, the best value for your swap settings is no swap at all -- it's better to have the OS kill the java process (taking the node down but leaving your monitoring, etc. up) than to have the system go into swap death (and become entirely unreachable).

Linux users should understand fully and then consider adjusting the system values for swappiness, overcommit_memory and overcommit_ratio.Memtable Thresholds

When performing write operations, Cassandra stores values to column-family specific, in-memory data structures called Memtables. These Memtables are flushed to disk whenever one of the configurable thresholds is exceeded. The initial settings (64mb/0.3) are purposefully conservative, and proper tuning of these thresholds is important in making the most of available system memory, without bringing the node down for lack of memory.Configuring Thresholds Larger Memtables take memory away from caches: Since Memtables are storing actual column values, they consume at least as much memory as the size of data inserted. However, there is also overhead associated with the structures used to index this data. When the number of columns and rows is high compared to the size of values, this overhead can become quite significant, (possibly greater than the data itself). In other words, which threshold(s) to use, and what to set them to is not just a function of how much memory you have, but of how many column families, how many columns per column-family, and the size of values being stored. Larger Memtables don't improve write performance: Increasing the memtable capacity will cause less-frequent flushes but doesn't improve write performance directly: writes go directly to memory regardless. (Actually, if your commitlog and sstables share a volume they might contend, so if at all possible, put them on separate volumes) Larger memtables do absorb more overwrites: If your write load sees some rows written more often than others (eg upvotes of a front-page story) a larger memtable will absorb those overwrites, creating more efficient sstables and thus better read performance. If your write load is batch oriented or if you have a massive row set, rows are not likely to be rewritten for a long time, and so this benefit will pay a smaller dividend. Larger memtables do lead to more effective compaction: Since compaction is tiered, large sstables are preferable: turning over tons of tiny memtables is bad. Again, this impacts read performance (by improving the overall io-contention weather), but not writes.

Listed below are the thresholds found in storage-conf.xml (or cassandra.yaml in 0.7+), along with a description.MemtableThroughputInMB

As the name indicates, this sets the max size in megabytes that the Memtable will store before triggering a threshold violation and causing it to be flushed to disk. It corresponds to the size of the values inserted, (plus the size of the containing column).

If left unconfigured (missing from the config), this defaults to 128MB.

Note: This was referred to as MemtableSizeInMB in versons of Casandra before 0.6.0. In version 0.7b2+, the value will be applied on a per column-family basis.MemtableOperationsInMillions

This directive sets a threshold on the number of columns stored.

Left unconfigured (missing from the config), this defaults to 0.1 (or 100,000 objects). The config file's inital setting of 0.3 (or 300,000 objects) is a conservative starting point.

Note: This was referred to as MemtableObjectCountInMillions in versons of Casandra before 0.6.0. In version 0.7b2+, the value will be applied on a per column-family basis.Using Jconsole To Optimize Thresholds

Cassandra's column-family mbeans have a number of attributes that can prove invaluable in determining optimal thresholds. One way to access this instrumentation is by using Jconsole, a graphical monitoring and management application that ships with your JDK.

Launching Jconsole with no arguments will display the 'New Connection' dialog box. If you are running Jconsole on the same machine that Cassandra is running on, then you can connect using the PID, otherwise you will need to connect remotely. The default startup scripts for Cassandra cause the VM to listen on port 8080 (7199 starting in v0.8.0-beta1) using the JVM option:

-Dcom.sun.management.jmxremote.port=8080

The remote JMX url is then:

service:jmx:rmi:///jndi/rmi://localhost:8080/jmxrmi

This is used internally by: bin/nodetool src/java/org/apache/cassandra/tools/nodetool.java

jconsole_connect.png

Once connected, select the MBeans tab, expand the org.apache.cassandra.db section, and finally one of your column families.

There are three interesting attributes here.

MemtableColumnsCount, representing the total number of column entries in this table. If you store 100 rows that each have 100 columns, expect to see this value increase by 10,000. This attribute is useful in setting the MemtableOperationsInMillions threshold.

MemtableDataSize, which is used to determine the total size of stored data. This is the sum of all the values stored and does not account for Memtable overhead, (i.e. it's not indicative of the actual memory used by the Memtable). Use this value when adjusting MemtableThroughputInMB.

Finally there is MemtableSwitchCount which increases by one each time a column family flushes its Memtable to disk.

Note: You'll need to manually mash the Refresh button to update these values.

jconsole_attributes.png

It is also possible to schedule an immediate flush using the forceFlush() operation.

jconsole_operations.png

Q: Why is it so hard to work with TimeUUIDType in Java?

A: TimeUUID's are difficult to use from java clients because java.util.UUID does not support generating version 1 (time-based) UUIDs. Here is one way to work with them and Cassandra:

Use the UUID generator from: http://johannburkard.de/software/uuid/. See Time based UUID Notes

Below are three methods that are quite useful in working with the uuids as they come in and out of Cassandra.

Generate a new UUID to use in a TimeUUIDType sorted column family.

/**

* Gets a new time uuid.

* @return the time uuid

public static java.util.UUID getTimeUUID() {

return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());

}

When you read out of cassandra your getting a byte[] that needs to be converted into a TimeUUID and since the java.util.UUID doesn't seem to have a simple way of doing this, pass it through the eaio uuid dealio again.

/**

* Returns an instance of uuid.

* @param uuid the uuid

* @return the java.util. uuid

public static java.util.UUID toUUID( byte[] uuid ) {

long msb = 0;

long lsb = 0;

assert uuid.length == 16;

for (int i=0; i<8; i++)

msb = (msb << 8) | (uuid[i] & 0xff);

for (int i=8; i<16; i++)

lsb = (lsb << 8) | (uuid[i] & 0xff);

long mostSigBits = msb;

long leastSigBits = lsb;

com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);

return java.util.UUID.fromString(u.toString());

}

/**

* As byte array.

* @param uuid the uuid

* @return the byte[]

public static byte[] asByteArray(java.util.UUID uuid) {

long msb = uuid.getMostSignificantBits();

long lsb = uuid.getLeastSignificantBits();

byte[] buffer = new byte[16];

for (int i = 0; i < 8; i++) {

buffer[i] = (byte) (msb >>> 8 * (7 - i));

}

for (int i = 8; i < 16; i++) {

buffer[i] = (byte) (lsb >>> 8 * (7 - i));

}

return buffer;

}

Further, it is often useful to create a TimeUUID object from some time other than the present: for example, to use as the lower bound in a SlicePredicate to retrieve all columns whose TimeUUID comes after time X. Most libraries don't provide this functionality, probably because this breaks the 'Universal' part of UUID: this should give you pause! Never assume such a UUID is unique: use it only as a marker for a specific time.

With those disclaimers out of the way, if you feel a need to create a TimeUUID based on a specific date, here is some code that will work:

public static java.util.UUID uuidForDate(Date d) {

Magic number obtained from #cassandra's thobbs, who

claims to have stolen it from a Python library.

final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 0x01b21dd213814000L;

long origTime = d.getTime();

long time = origTime * 10000 + NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;

long timeLow = time & 0xffffffffL;

long timeMid = time & 0xffff00000000L;

long timeHi = time & 0xfff000000000000L;

long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | (timeHi >> 48) ;

return new java.util.UUID(upperLong, 0xC000000000000000L);

}

Q: I delete data from Cassandra, but disk usage stays the same. What gives?

A: Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a 'tombstone') is written to indicate the value's new status. Never fear though, on the first compaction that occurs after GCGraceSeconds (hint: storage-conf.xml) have expired, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Q: Why are reads slower than writes?

A: Unlike all major relational databases and some NoSQL systems, Cassandra does not use b-trees and in-place updates on disk. Instead, it uses a sstable/memtable model like Bigtable's: writes to each ColumnFamily are grouped together in an in-memory structure before being flushed (sorted and written to disk). This means that writes cost no random I/O, compared to a b-tree system which not only has to seek to the data location to overwrite, but also may have to seek to read different levels of the index if it outgrows disk cache!

The downside is that on a read, Cassandra has to (potentially) merge row fragments from multiple sstables on disk. We think this is a tradeoff worth making, first because scaling writes has always been harder than scaling reads, and second because as your data corpus grows Cassandra's read disadvantage narrows vs b-tree systems that have to do multiple seeks against a large index. See MemtableSSTable for more details.

Q: Why does nodeprobe ring only show one entry, even though my nodes logged that they see each other joining the ring?

A: This happens when you have the same token assigned to each node. Don't do that.

Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.

The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

Q: Why do deleted keys show up during range scans?

A: Because get_range_slice says, 'apply this predicate to the range of rows given,' meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.

So to special case leaving out result entries for deletions, we would have to check the entire rest of the row to make sure there is no undeleted data anywhere else either (in which case leaving the key out would be an error).

This is what we used to do with the old get_key_range method, but the performance hit turned out to be unacceptable.

Q: Can I change the ReplicationFactor on a live cluster?

A: Yes, but it will require running repair to change the replica count of existing data.

Alter the ReplicationFactor for the desired keyspace(s) using cassandra-cli.If you're reducing the

ReplicationFactor:

Run 'nodetool cleanup' on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.If you're increasing the ReplicationFactor:

Run 'nodetool repair' to run an anti-entropy repair on the cluster. Repair runs on a per-replica set basis. This is an intensive process that may result in adverse cluster performance. It's highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it.

Q: Can I Store BLOBs in Cassandra?

A: Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift's behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:

The main limitation on a column and super column size is that all the data for a single key and column must fit (on disk) on a single machine(node) in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.

When large columns are created and retrieved, that columns data is loaded into RAM which can get resource intensive quickly. Consider, loading 200 rows with columns that store 10Mb image files each into RAM. That small result set would consume about 2Gb of RAM. Clearly as more and more large columns are loaded, RAM would start to get consumed quickly. This can be worked around, but will take some upfront planning and testing to get a workable solution for most applications. You can find more information regarding this behavior here: memtables, and a possible solution in 0.7 here: CASSANDRA-16.

Please refer to the notes in the Cassandra limitations section for more information: Cassandra Limitations

Q: Nodetool says 'Connection refused to host: 127.0.1.1' for any remote host. What gives?

A: Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up it's own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails try passing the -Djava.rmi.server.hostname=$IP option to the JVM at startup (where $IP is the address of the interface you can reach from the remote machine).

Q: How can I iterate over all the rows in a ColumnFamily?

A: Simple but slow: Use get_range_slices, start with the empty string, and after each call use the last key read as the start key in the next iteration.

Q: Why were none of the keyspaces described in storage-conf.xml loaded?

A: Prior to 0.7, cassandra loaded a set of static keyspaces defined in a storage-conf.xml file. CASSANDRA-44 added the ability to modify schema dynamically on a live cluster. Part of this change required that we ignore the schema defined in storage-conf.xml. Additionally, 0.7 converts to YAML based configuration.

If you have an existing storage-conf.xml file, you will first need to convert it to YAML using the bin/config-converter tool, which can generate a cassandra.yaml file from a storage-conf.xml file. Once you have a cassandra.yaml, it is possible to do a one-time load of the schema it defines. 0.7 adds a loadSchemaFromYAML method to StorageServiceMBean (triggered via JMX: see https://issues.apache.org/jira/browse/CASSANDRA-1001 ) which will load the schema defined in cassandra.yaml, but this is a one-time operation. A node that has had its schema defined via loadSchemaFromYAML will load its schema from the system table on subsequent restarts, which means that any further changes to the schema need to be made using the system_* thrift operations (see API).

It is recommended that you only perform schema updates on one node and let cassandra propagate changes to the rest of the cluster. If you try to perform the same updates simultaneously on multiple nodes, you run the risk of introducing inconsistent migrations, which will lead to a confused cluster.

Q: Is there a GUI admin tool for Cassandra?

A: chiton, a GTK data browser.

cassandra-gui, a Swing data browser.

Cassandra Cluster Admin, a PHP-based web UI.

Insert operation throws InvalidRequestException with message 'A long is exactly 8 bytes'

You are propably using LongType column sorter in your column family. LongType assumes that the numbers stored into column names are exactly 64bit (8 bytes) long and in big endian format. Example code how to pack and unpack an integer for storing into cassandra and unpacking it for php:

/**

* Takes php integer and packs it to 64bit (8 bytes) long big endian binary representation.

* @param $x integer

* @return string eight bytes long binary repersentation of the integer in big endian order.

public static function pack_longtype($x) {

return pack('C8', ($x >> 56) & 0xff, ($x >> 48) & 0xff, ($x >> 40) & 0xff, ($x >> 32) & 0xff, ($x >> 24) & 0xff, ($x >> 16) & 0xff, ($x >> 8) & 0xff, $x & 0xff);

}

/**

* Takes eight bytes long big endian binary representation of an integer and unpacks it to a php integer.

* @param $x

* @return php integer

public static function unpack_longtype($x) {

$a = unpack('C8', $x);

return ($a[1] << 56) + ($a[2] << 48) + ($a[3] << 40) + ($a[4] << 32) + ($a[5] << 24) + ($a[6] << 16) + ($a[7] << 8) + $a[8];

}

Q: Cassandra says 'ClusterName mismatch: oldClusterName != newClusterName' and refuses to start

A: To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, you can:

Perform these steps on each node:

Start the cassandra-cli connected locally to this node.

Run the following:

use system;

set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('<new cluster name>');

exit;

Run nodetool flush on this node.

Update the cassandra.yaml file for the cluster_name as the same as 2b).

Restart the node.

Once all nodes have been had this operation performed and restarted, nodetool ring should show all nodes as UP.

Q: Are batch_mutate operations atomic?

A: As a special case, mutations against a single key are atomic but not isolated. Reads which occur during such a mutation may see part of the write before they see the whole thing. More generally, batch_mutate operations are not atomic. batch_mutate allows grouping operations on many keys into a single call in order to save on the cost of network round-trips. If batch_mutate fails in the middle of its list of mutations, no rollback occurs and the mutations that have already been applied stay applied. The client should typically retry the batch_mutate operation.

Q: Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?

A: Cassandra 0.6+ enables certain Hadoop functionality against Cassandra's data store. Specifically, support has been added for MapReduce, Pig and Hive.

DataStax open-sourced a Cassandra based Hadoop distribution called Brisk. (Documentation) (Code) Brisk is now part of DataStax Enterprise and is no longer maintained as a standalone project.MapReduce

Input from Cassandra

Cassandra 0.6+ adds support for retrieving data from Cassandra. This is based on implementations of InputSplit, InputFormat, and RecordReader so that Hadoop MapReduce jobs can retrieve data from Cassandra. For an example of how this works, see the contrib/word_count example in 0.6 or later. Cassandra rows or row fragments (that is, pairs of key + SortedMap of columns) are input to Map tasks for processing by your job, as specified by a SlicePredicate that describes which columns to fetch from each row.

Here's how this looks in the word_count example, which selects just one configurable columnName from each row:ConfigHelper.setColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);

SlicePredicate predicate = new

SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));

ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);

As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the README in the word_count and pig contrib modules for more details.Output To Cassandra

As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra. The contrib/word_count example in 0.7 contains two reducers - one for outputting data to the filesystem and one to output data to Cassandra (default) using this new mechanism. See that example in the latest release for details.Hadoop Streaming

Hadoop output streaming was introduced in 0.7 but was removed from 0.8 due to lack of interest and the additional complexity it added to the Hadoop integration code. To use output streaming with 0.7.x, see the contrib directory of the source download of Cassandra.Pig

Cassandra 0.6+ also adds support for Pig with its own implementation of LoadFunc. This allows Pig queries to be run against data stored in Cassandra. For an example of this, see the contrib/pig example in 0.6 and later.

Cassandra 0.7.4+ brings additional support in the form of a StoreFunc implementation. This allows Pig queries to output data to Cassandra. It is handled by the same class as the LoadFunc: CassandraStorage. See the README in contrib/pig for more information.

When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the README notes in the <cassandra_src>/contrib/pig directory, the Cluster Configuration section on this page, and some additional notes here:

Set the HADOOP_HOME environment variable to <hadoop_dir>, e.g. /opt/hadoop or /etc/hadoop

Set the PIG_CONF environment variable to <hadoop_dir>/conf

Set the JAVA_HOME

Pygmalion is a project created to help with using Pig with Cassandra, especially for tabular (static column names) data.Hive

Hive support is currently a standalone project but will become part of the main Cassandra source tree in the future. See https://github.com/riptano/hive for details.Oozie

Oozie, the open-source workflow engine originally from Yahoo!, can be used with Cassandra/Hadoop. Cassandra configuration information needs to go into the oozie action configuration like so:

<name>cassandra.thrift.address</name>

<value>${cassandraHost}</value>

</property>

<name>cassandra.thrift.port</name>

<value>${cassandraPort}</value>

</property>

<name>cassandra.partitioner.class</name>

<value>org.apache.cassandra.dht.RandomPartitioner</value>

</property>

<name>cassandra.consistencylevel.read</name>

<value>${cassandraReadConsistencyLevel}</value>

</property>

<name>cassandra.consistencylevel.write</name>

<value>${cassandraWriteConsistencyLevel}</value>

</property>

<name>cassandra.range.batch.size</name>

<value>${cassandraRangeBatchSize}</value>

</property>

Note that with Oozie you can specify values outright like the partitioner here, or via variable that is typically found in the properties file. One other item of note is that Oozie assumes that it can detect a filemarker for successful completion of the job. This means that when writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job that called it will fail because filemarkers aren't written to Cassandra. So when you write to Cassandra with Hadoop, specify this property to avoid that check. Oozie will still get completion updates from a callback from the job tracker, but it just won't look for the filemarker.

<name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>

<value>false</value>

</property>

Cluster Configuration

The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk, the open-source packaging of Cassandra with Hadoop. That will start the JobTracker and TaskTracker processes for you. It also uses CFS, an HDFS compatible distributed filesystem built on Cassandra that removes the need for a Hadoop NameNode and DataNode processes. For details, see the Brisk documentation and code

Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop NameNode/JobTracker. Then install a Hadoop TaskTracker on each of your Cassandra nodes. That will allow the JobTracker to assign tasks to the Cassandra nodes that contain data for those tasks. Also install a Hadoop DataNode on each Cassandra node. Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored.

The nice thing about having a TaskTracker on every node is that you get data locality and your analytics engine scales with your data. You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance.

A note on speculative execution: you may want to disable speculative execution for your hadoop jobs that either read or write to Cassandra. This isn't required, but may be helpful to reduce unnecessary load.

One configuration note on getting the task trackers to be able to perform queries over Cassandra: you'll want to update your HADOOP_CLASSPATH in your <hadoop>/conf/hadoop-env.sh to include the Cassandra lib libraries. For example you'll want to do something like this in the hadoop-env.sh on each of your task trackers:

export HADOOP_CLASSPATH=/opt/cassandra/lib/*:$HADOOP_CLASSPATHVirtual Datacenter

One thing that many have asked about is whether Cassandra with Hadoop will be usable from a random access perspective. For example, you may need to use Cassandra for serving web latency requests. You may also need to run analytics over your data. In Cassandra 0.7+ there is the NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics. You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter. You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance.

Troubleshooting

If you are running into timeout exceptions, you might need to tweak one or both of these settings:

cassandra.range.batch.size - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize.

rpc_timeout_in_ms - this is set in your cassandra.yaml (in 0.6 it's RpcTimeoutInMillis in storage-conf.xml). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out.

If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs. An example of usage (in either the job configuration or tasktracker mapred-site.xml):

<name>mapred.max.tracker.failures</name>

</property>

<name>mapred.map.max.attempts</name>

</property>

<name>mapred.reduce.max.attempts</name>

</property>

The settings normally default to 4 each, but some find that too conservative. If you set it too low, you might have blacklisted tasktrackers and failed jobs because of occasional timeout exceptions. If you set them too high, jobs that would otherwise fail quickly take a long time to fail, sacrificing efficiency. Keep in mind that this can just cover a problem. It may be that you always want these settings to be higher when operating against Cassandra. However, if you run into these exceptions too frequently, there may be a problem with your Cassandra or Hadoop configuration.

If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at.

The two relevant properties are:

cassandra.consistencylevel.read - defaults to ConsistencyLevel.ONE.

cassandra.consistencylevel.write - defaults to ConsistencyLevel.ONE.

Also hadoop integration uses range scans underneath which do not do read repair. However reading at ConsistencyLevel.QUORUM will reconcile differences among nodes read. See ReadRepair section as well as the ConsistencyLevel section of the API page for more details.

Sometimes configuration and integration can get tricky. To get support for this functionality, start with the contrib examples in the source download of Cassandra. Make sure you are following instructions in the README file for that example. You can search the Cassandra user mailing list or post on there as it is very active. You can also ask in the #Cassandra irc channel on freenode for help. Other channels that might be of use are #hadoop, #hadoop-pig, and #hive. Those projects' mailing lists are also very active.

There are professional support options for Cassandra that can help you get everything working together. For more information, see ThirdPartySupport. There are also professional support options specifically for Hadoop. For more information on that, see Hadoop's third party support wiki page.

Q: Can a Cassandra cluster be multi-tenant?

A: There is work being done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant.

Q: Are there any OBDC drivers for Cassandra?

A: No.

Q: On RHEL nodes are unable to join the ring

A: Check if selinux is on, if it is turn it OFF.

Q: Is there an authentication/authorization mechanism for Cassandra?

A: Yes. For details, see ExtensibleAuth.

Q: Why aren't range slices/sequential scans giving me the expected results?

A: You're probably using the RandomPartitioner. This is the default because it avoids hotspots, but it means your rows are ordered by the md5 of the row key rather than lexicographically by the raw key bytes.

You can start out with a start key and end key of [empty] and use the row count argument instead, if your goal is paging the rows. To get the next page, start from the last key you got in the previous page. This is what the Cassandra Hadoop RecordReader does, for instance.

You can also use intra-row ordering of column names to get ordered results within a row; with appropriate row 'bucketing,' you often don't need the rows themselves to be ordered.

Q: I compacted, so why did space used not decrease?

A: SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted.

Q: Why does top report that Cassandra is using a lot more memory than the Java heap max?

A: Cassandra uses mmap to do zero-copy reads. That is, we use the operating system's virtual memory system to map the sstable data files into the Cassandra process' address space. This will 'use' virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.

What matters from the perspective of 'memory use' in the sense as it is normally meant, is the amount of data allocated on brk() or mmap'd /dev/zero, which represent real memory used. The key issue is that for a mmap'd file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.

The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory - in the case of the memory being resident, you just read it - you don't even take a page fault (so no overhead in entering the kernel and doing a semi-context switch).

Q: I'm getting java.io.IOException: Cannot run program 'ln' when trying to snapshot or update a keyspace?

A: Updating a keyspace first takes a snapshot. This involves creating hardlinks to the existing SSTables, but Java has no native way to create hard links, so it must fork 'ln'. When forking, there must be as much memory free as the parent process, even though the child isn't going to use it all. Because Java is a large process, this is problematic. The solution is to install Java Native Access so it can create the hard links itself.

Q: How does Cassandra decide which nodes have what data?

A: The set of nodes (a single node, or several) responsible for any given piece of data is determined by:

The row key (data is partitioned on row key)

The replication factor (decides how many nodes are in the replica set for a given row)

The replication strategy (decides which nodes are part of said replica set)

In the case of the SimpleStrategy, replicas are placed on succeeding nodes in the ring. The first node is determined by the partitioner and the row key, and the remainder are placed on succeeding node. In the case of NetworkTopologyStrategy placement is affected by data-center and rack awareness, and the placement will depend on how nodes in different racks or data centers are placed in the ring.

It is important to understand that Cassandra does not alter the replica set for a given row key based on changing characteristics like current load, which nodes are up or down, or which node your client happens to talk to.

Q: I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?

A: XX%

Q: Commit Log gets very big. Cassandra does not delete 'old' commit logs. Why?

A: You probably have one or more Column Families with very low throughput. These will typically not be flushed by crossing the throughput or operations thresholds, causing old commit segments to be retained until the memtable_flush_after_min threshold has been crossed. The default value for this threshold is 60 minutes and may be decreased via cassandra-cli by doing:

update column family XXX with memtable_flush_after=YY;

where YY is a number of minutes.

Q: What are seeds?

A: Seeds are used during startup to discover the cluster

If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip for details ) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.

Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn't need seed on subsequent boot.

Newer versions of cassandra persist the cluster topology making seeds less important then they were in the 0.6.X series, where they were used every startup

You can make a seed a node at any time. There is nothing special about seed nodes. If you list the node in seed list it is a seed

Seeds do not auto bootstrap (ie if a node has itself in its seed list it will not automatically transfer data to itself) If you want a node to do that bootstrap it first and then add it to seeds later. If you have no data (new install) you do not have to worry about bootstrap or autobootstrap at all.

Recommended usage of seeds:

pick two (or more) nodes per data center as seed nodes.

sync the seed list to all your nodes

Q: Does single seed mean single point of failure?

A: If you are using replicated CF on the ring, only one seed in the ring doesn't mean single point of failure. The ring can operate or boot without the seed. However, it will need more time to spread status changes of node over the ring. It is recommended to have multiple seeds in production system.

Q: Why can't I call jmx method X on jconsole? (ex. getNaturalEndpoints)

A: Some of JMX operations can't be called with jconsole because the buttons are inactive for them. Jconsole doesn't support array argument, so operations which need array as arugument can't be invoked on jconsole. You need to write a JMX client to call such operations or need array capable JMX monitoring tool.

Q: What's the maximum key size permitted?

A: The key (and column names) must be under 64K bytes.

Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large 'natural' keys use their hashes instead to cut down the size.

Q: I'm using Ubuntu with JNA, and holy crap weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg?

A: We have come across several different, but similar, sets of symptoms that might match what you're seeing. They might all have the same root cause; it's not clear. One common piece is messages like this in dmesg:

INFO: task (some_taskname):(some_pid) blocked for more than 120 seconds.

echo 0 > /proc/sys/kernel/hung_task_timeout_secs' disables this message.

It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image package and rebooting your instances fixes it. There is likely some bug in several of the kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-* which are known not to have this problem include:

linux-image-2.6.38-10-virtual (2.6.38-10.46) (Ubuntu 11.04/Natty Narwhal)

linux-image-2.6.35-24-virtual (2.6.35-24.42) (Ubuntu 10.10/Maverick Meerkat)

Uninstalling libjna-java or recompiling Cassandra with CLibrary.tryMlockall()'s mlockall() call commented out also make at least some sorts of this problem go away, but that's a lot less desirable of a fix.

If you have more information on the problem and better ways to avoid it, please do update this space.

Q: What are schema disagreement errors and how do I fix them?

A: Cassandra schema updates assume that schema changes are done one-at-a-time. If you make multiple changes at the same time, you can cause some nodes to end up with a different schema, than others. (Before 0.7.6, this can also be caused by cluster system clocks being substantially out of sync with each other.)

To fix schema disagreements, you need to force the disagreeing nodes to rebuild their schema. Here's how:

Open the cassandra-cli and run: 'connect localhost/9160;', then 'describe cluster;'. You'll see something like this:

[default@unknown] describe cluster;

Cluster Information:

Snitch: org.apache.cassandra.locator.SimpleSnitch

Partitioner: org.apache.cassandra.dht.RandomPartitioner

Schema versions:

75eece10-bf48-11e0-0000-4d205df954a7: [192.168.1.9, 192.168.1.25]

5a54ebd0-bd90-11e0-0000-9510c23fceff: [192.168.1.27]

Note which schemas are in the minority and mark down those IPs -- in the above example, 192.168.1.27. Login to each of those machines and stop the Cassandra service/process by running 'sudo service cassandra stop' or 'kill <pid>'. Remove the schema* and migration* sstables inside of your system keyspace (/var/lib/cassandra/data/system, if you're using the defaults).

After starting Cassandra again, this node will notice the missing information and pull in the correct schema from one of the other nodes.

To confirm everything is on the same schema, verify that describe cluster only returns one schema version.

Q: Why do I see '... messages dropped..' in the logs?

A: Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client.

For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair.

For READ messages this means a read request may not have completed.

Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster.

Q: Why does the 0.8 cli not assume keys are strings anymore?

A: Prior to 0.8, there was no type metadata available for row keys, and the cli interface treated all keys as strings. This made the cli unusable for the many applications whose rows were numeric, uuids, or other non-string data.

0.8 added key_validation_class to the ColumnFamily definition, similarly to the existing comparator for column names, and column_metadata validation_class for column values. This both lets clients know the expected data type, and rejects updates with non-conformant values.

To preserve application compatibility, the default key_validation_class is BytesType, i.e., 'anything goes.' The CLI expects bytes to be provided in hex.

If all your keys are of the same type, you should add information to the CF metadata.

Q: Cassandra dies with 'java.lang.OutOfMemoryError: Map failed'

A: IF Cassandra is dying specifically with the 'Map failed' message it means the OS is denying java the ability to lock more memory. In linux, this typically means memlock is limited. Check /proc/<pid of cassandra>/limits to verify this and raise it (eg, via ulimit in bash.) You may also need to increase vm.max_map_count. Note that the debian and redhat packages handle this for you automatically.

4 comments:

Unknown11 May 2017 at 23:20
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Cassandra Admin , kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Sangita Mohanty
MaxMunus
E-mail: sangita@maxmunus.com
Skype id: training_maxmunus
Ph:(0) 9738075708 / 080 - 41103383
http://www.maxmunus.com/
Unknown9 May 2018 at 04:14
Recognize your Cassandra Database Technical Issue through Cassandra Technical Support
With Cassandra database you will get high openness and adjustment to inner disappointment database condition in light of the fact that through our Cassandra Database Consulting and Support you will get complete the process of checking and seeing of your estimations like: arrange, CPU, suspension and garbage amassing time. When you pick our Cognegic's Cassandra Database Support or Apache Cassandra Support then you will never stand up to any kind of specific issue since we 100% ensures the best game plan.
For More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801
rohit3 January 2019 at 04:03
Nice post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
kajal agarwal hot
Usama Siddiqui30 November 2020 at 21:49
Thanks for these amazing questions and their answers. I am thinking to become cassandra consultant. This is going to be helpful.