Cassandra is a peer-to-peer, fault-tolerant system. Data is replicated among multiple nodes across multiple data centers. Single or even multi-node failures can be recovered from surviving nodes with the data.
Restores from backups are unnecessary in the event of disk or system hardware failure even if an entire site goes off-line. As long as there exists one node with the replicated data in one data center, Cassandra can recover the data without having to restore data from an external source.
However, Cassandra backups are still necessary to recover from any errors made in data edits by client applications. There is a need for a “point-in-time” recovery of Cassandra in the event of data corruption or some other catastrophic situation.
Cassandra provides backup utilities and a restore process. Constant Contact has extended these features with scripts to provide more flexibility with backups and to simplify recovery operations.
Cassandra File Structure
To understand Cassandra backups, a brief overview of Cassandra file structure is necessary.
Cassandra keeps data in SSTable files. They are stored in the keyspace directory within the data directory path specified by the <DataFileDirectory> parameter in the cassandra.yaml file. By default this directory path is /var/lib/cassandra/data/<keypace_name>.
These are written to when Cassandra fills its memtable. When the memtable contents are written to files, the memtable is cleared and ready to process new data. Cassandra continually makes new SSTable files in keyspace directories as its memtable “bucket” is filled and emptied.
Once the memtable is written to an SSTable file, the SSTable file is said to be unmutable – that is, no more writes are made to that file. However, Cassandra executes compaction, meaning multiple old SSTable files are merged into single new one, and obsolete data (and the SSTable file(s) that contained it) is removed. So while the data is unmutable, the files which contain the data are not.
A “point-in-time” recovery requires recovery of all the SSTable files in a keyspace exactly as they were in a given instant.
Native Cassandra Backup Tools
The Cassandra CLI has a snapshot utility. This flushes all in-memory writes to disk, then hard-links each current SSTable file for each keyspace in a “snapshots” subdir in the local disk keyspace area (<DataFileDirectory>/<keypace_name>/snapshots).
It creates a unique subdirectory for each set of snapshot files, so multiple instances in time can be preserved.
Versions of Cassandra 1.x and later also have an “incremental” backup feature.
There is no CLI command to execute incremental backups. It is enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.
The incremental backup feature creates a hard-link to each new SSTable as it is created to a backups subdir in the local disk keyspace area (<DataFileDirectory>/<keyspace_name>/backups).
As incremental backups only contain new SSTable files, they are dependent on the last snapshot created. Natively, Cassandra incremental backups only supplement the last snapshot made.
Note that in both the snapshot and incremental backups, Cassandra only creates hard-links to the active data files in the parent keyspace, not actual copies of the files. As hard links are simply different pointer names to the same inode on the disk, the processes are quick and consume little disk space.
Because Cassandra doesn’t modify SSTable files after creating them, but simply adds new files and deletes old ones as needed, the hard-link method for backups does provide consistency for a successful restore. When Cassandra removes an old SSTable file from the active keyspace, a pointer to the file still exists in the snapshot or backup sub-directory.
As Cassandra continues to operate, the contents of the snapshot /backup sub-directories begin to diverge in size and content from the active keyspace area. The longer the local snapshots/backups are retained, the greater the local disk usage becomes, because the backup directories continue to retain links to older SSTable files that were removed from the active keyspace area.
The incremental backup feature in Cassandra 1.0 substantially reduces disk space requirements because it only contains links to new SSTable files generated since the last full snapshot. In contrast, all snapshots have links to all the files in the active keyspace area at the time the snapshot was made. The value of incremental backups in reducing disk usage is especially noticeable in larger data sets with minimal active writes.
Native Cassandra Recovery Process
Restoring a Cassandra keyspace means restoring all the keyspace SSTable files as they existed in a point in time.
Cassandra does not provide a native restore utility, but does provide a restore procedure. For each node in the cluster:
- Shut down Cassandra.
- Clear all files in commitlog directory (path defined by the <CommitLogDirectory> parameter in the cassandra.yaml file, by default /var/lib/cassandra/commitlog). Ideally, logs will be flushed before Cassandra is shut down, as the commitlog directory is a shared resource of all keyspaces, not just the one to be restored.
- Removing all current contents of the active keyspace (all *.db files).
- Copying contents of desired snapshot to active keyspace.
- Only if restored snapshot is the latest one, and you want the latest backup, copy contents of backup directory into active keyspace area on top of the restored snapshot files.
Note that the process must be executed on all nodes in the cluster, otherwise nodes that did not get the restored data will “update” the restored nodes with the newer, bad data.
Extensions to Cassandra Native Backup Tools
We had additional goals for our Cassandra backups that were unmet by native Cassandra tools, specifically:
- Automation:While incremental backups occur automatically in Cassandra, snapshots are command-executed via the CLI. We needed to automate this process, scheduling it on a regular basis to establish consistent full backups for each node in the cluster.
- Non-local storage: While Cassandra’s hard-link backup method is fast, it does not account for any potential problems with the local disk. We needed to make real copies of snapshot and backup files, and put them on storage devices separate from the local Cassandra node.
- Multi-day/instance retention:While Cassandra allows for multiple snapshots to be retained on disk, it retains only one set of incremental backups, and those only cover changes made from the last snapshot. We needed to retain incremental backups for previous snapshots as well.
To achieve these objectives, we utilized the common sysadmin tools Bash, Puppet, cron, and NFS. We decided on these tools because they are routinely used in our group and would be the quickest and most straightforward method to achieve our goals.
Puppet is used to distribute the backup script and cron entries to the hosts. The script is set up in Puppet as a template file, with variables to define the appropriate NFS target for each cluster for each type and location.
Cron is already in use in the hosts and managed via puppet. Puppet variables in the cron statements, defined by cluster type and location, are used to determine the execution times and command-line flags.
NFS was the most economical and quickest way to provide non-local storage to massive quantities of Cassandra nodes. We use a mix of NetApp appliances and Isilon filers for our on-line Cassandra backup repositories.
It is important to note in discussing NFS that all Cassandra nodes need backup, even though data is replicated across nodes. The reason is that the data is replicated to be eventually consistent between all nodes – not that all nodes sharing a file set will have exactly the same data at the same time. Backing up all the nodes is the only way to ensure a consistent, complete backup of a keyspace.
This, however, means that the amount of NFS storage space needed for Cassandra backups may be larger than that for a similarly-scaled relational DB – because you will essentially be backing up X-number of copies of the same data (where X is the number of nodes in your Cassandra cluster that are configured to replicate to one another).
The Bash backup script, called “cc_backup.sh,” does the following:
- If the interval between snapshots (full backups) is greater than 7 days, or if specifically invoked via the “–forcesnap” flag, runs the Cassandra snapshot cli command.
- If the script takes a snapshot, it copies that snapshot to contents an NFS mount point.
- If the script does not take a snapshot, it copies the current contents of the backup directory to NFS.
- The script creates the directory structure as needed in the NFS mount point as follows: /<nfs_mount>/<hostname>/<date>/snapshots
- The script tars and compresses each keyspace snapshot or backup directory as a single file. Snapshot tar files are identified by keyspace name and snapshot id number ( e.g., system-1360082572055.tar.gz ). Incremental backup tar files are identified by keyspace name, parent snapshot id number, the phrase “bkup,” and the date of the backup copy to NFS (e.g., system-1360082572055_bkup-07FEB13.tar.gz). The script uses the latest snapshot listed on the node to identify the appropriate parent snapshot id number.
- The tar processes are run with reduced CPU and I/O priority so as not to interfere with regular Cassandra operations.
- The script prunes older snapshots and incremental backups from both NFS and the local file system.
- On the local system disk, the script removes the oldest snapshots over the maximum local retention value specified. Upon creation of a new snapshot, the script removes the contents from the local incremental backup directory, as that content is now useless with the new snapshot.
- On the NFS mount point, the script removes older snapshots and backups based on NFS retention values specified.
Unlike the native Cassandra process, which only keeps one set of incremental backup files, the script keeps multiple instances of incremental backup files on NFS to provide for more choices in a point-in-time restore. The script also keeps multiple copies of snapshot files at certain points in time to support recovery of older incremental backups that are dependent on the older snapshot files.
For example, Cassandra clusters may have a 3-day “live” recovery capacity on NFS (older recoveries, if available, would be dependent on NDMP tape restore to NFS first). This 3-day “live” recovery potential would require a mix of both snapshots and incremental backups, depending on the day of the week.
Backup File Retention Examples
For example, let’s assume that a given cluster with 3-day NFS retention performs snapshots on Sundays and incremental backups the rest of the week.
The Monday morning NFS backup tree for a given Cassandra host in that cluster would have the following:
- Previous Sunday snapshot (necessary for recovery of Friday and Saturday incremental backups)
- Friday incremental backup
- Saturday incremental backup
- Yesterday’s Sunday snapshot (necessary for recovery of Monday incremental backup)
- Monday incremental backup
This gives us the 3 days of recovery options not including Monday morning’s backup.
On Wednesday the NFS backup tree would have:
- Latest Sunday snapshot (necessary for recovery of Monday, Tuesday, or Wednesday incremental backups)
- Monday incremental backup
- Tuesday incremental backup
- Wednesday incremental backup
Extensions to the Native Cassandra Recovery Process
As our backup tool is merely a wrapper around the native Cassandra snapshot/backup process, data can be restored manually following the native Cassandra restore steps previously discussed. Local snapshots/backups can be recovered this way, and our NFS backups can be restored the same way once they are untarred and uncompressed.
To simplify restores, however, we have created the script “cc_restore.sh.”
This script allows you to specify just the keyspace and date to restore from, and it will gather the appropriate snapshot and backup files from NFS to restore. The script will also verify that any and all Cassandra processes are offline before it will proceed to restore any data. It makes it easier to execute restores en masse across an entire cluster via func or ssh.
/usr/local/bin/cc_restore.sh (table) (date)
table= table name to restore, or specify “ALL”
table names case sensitive, specifying ALL should be all caps
date= restore last instance of date specified in two-digit daymonthyear
or specify “lastlocal” to restore last local backup
/usr/local/bin/cc_restore.sh PLINK_L1 05JUL11
/usr/local/bin/cc_restore.sh SharedVol1_F1 07JUL11
/usr/local/bin/cc_restore.sh ALL 11JUL11
/usr/local/bin/cc_restore.sh SharedVol1_F1 lastlocal
Our backup/restore process is working, but we are constantly monitoring and tweaking it as the scale and complexity of our Cassandra environment grows.
Some of the things we are keeping an eye on:
System resource utilization and length of time to copy snapshots to NFS.
Although we feel we need the compression capability and file bundling of tar, we are aware of its aggressive use of system resources and the latency of NFS.
Careful, distributed scheduling of snapshots so no more than one site cluster goes to any one NFS appliance at the same time, coupled with highly optimized NFS client settings, have substantially reduced this impact.
Still, we are formulating possible mid-term and long-term alternatives should we encounter any performance problems.
Mid-term alternatives include using rsync in place of tar, and using non-Cassandra nodes to compress and tar-up the files on the NFS mount. This, however, will increase the NFS volume usage considerably.
Long-term alternatives include using a full-fledged Cassandra management solution such as Priam, which has its own built-in Cassandra backup/restore function within a Java VM. However the fast compression used by this tool is said to be substantial less effective than traditional gzip, meaning NFS volume usage will likely grow substantially. Also, the implementation of a separate Java VM for backups adds a level of complexity to the process, and would bring with it its own set of support requirements that would have to be addressed.
Backup space requirements.
As previously stated, Cassandra needs a lot of space for backups. We have to prepare for much more rapid acquisitions and implementations of NFS appliances. As our clusters grow in size, we are finding that horizontal scaling ability of Isilon filers is a good fit for Cassandra.
Cassandra Repair and Snapshot Conflicts.
In the previous Cassandra release, the Cassandra repair function would modify files that were still hard-linked to snapshot directory. This would generate an error if it occurred while the backup script was still copying files to the NFS directory, as tar would report that the source file had changed. We have not yet observed this behavior with Cassandra 1.X and above, but we continue to watch for this behavior.
Continue the conversation by sharing your comments here on the blog.