New and Improved Compression Formats for Linux

A few weeks ago we decided to look for alternative compression formats that we can use as a replacement for gzip. We take several backups of our MySQL & MongoDB data daily, and these are several gigabytes large, compressed. Any improvements on compression would benefit us in two ways :

  1. It takes less space on S3, and hence we save on storage and transfer costs.
  2. Faster downloads means we’re able to restore our backups faster, if need be. (It also helps to have smaller downloads when you’re downloading on a slower internet connection at home)

For this to work, the compression formats would have to be “a lot like gzip”. They should be stable (i.e production ready), reasonably fast to encode/decode, and not use system resources excessively.

After a quick google search I found a couple of new(ish) file compression formats that offer 30% – 70% better compression than gzip. They are :

  • xz – XZ is both a compression format and it’s associated tools. It is part of the XZ Utils package, a set of free lossless data compressors. It is available in the standard apt-get repos.
  • lrzip – lrzip is a software package that combines several compression formats with Rzip . Rzip is a ranged compressor that works by chunking duplicate data, then using a standard compression format, to compress it. From a sysadmin perspective, it’s how it shows up in `top`.

A compression shootout!

Let’s whet our appetites by first doing a quick compression shootout. Here, i’m going to compress a single, large, text file with 4 different formats. Where I’ve noted times, it’s slower than the rest.

Original file : my.sql, 461M
GZip Compression : 110M (23% of original)
BZip Compression : 83M (18%)
XZ Compression : 70M (15%) (4 mins compression time!)
LRZip Compression : 73M (16%) (1:44 mins)

XZ is cool

XZ had 2 things going for it.

  1. It had the best compression of the round up. It took a SQL file to 15% its original size, very cool.
  2. Memory usage was extremely stable (at 108M virtual) throughout the compression.

The downside was the compression time. It was almost 2.5x slower than the next slowest in the roundup! From watching top, it was clear XZ was CPU bound, so having a fast processor is key to getting good performance. Also, there’s nothing else you can do make it faster. That was a bit disappointing, but far from unusable.

xz can only compress a single file at a time. If you want to combine multiple files, `tar` them up first. But there’s nothing special you need to do here. The tar command integrates with xz if xz-utils is installed. This is how you can compress a dir :

tar --xz -cf files.xz /my/dir

To install xz on Ubuntu, run `sudo apt-get install xz-utils`

Pros

  • Works on a single core.xz is heavily CPU bound and keeps one core fully busy and the rest are available for the system. This is excellent when you’re backing up on a db server, and you want all the server’s resources dedicated to the db.
  • Very stable memory usage. No spikes.
  • Decompression is very fast.

Cons

  • Slowest compression of the four. 4 mins for 500MB!

LRZ is cooler!

From the docs (emphasis mine):

Long Range ZIP (or Lzma RZIP) is a compression program optimised for large files. The larger the file and the more memory you have, the better the compression advantage this will provide, especially once the files are larger than 100MB. The advantage can be chosen to be either size (much smaller than bzip2) or speed (much faster than bzip2).

LRZ combines the rzip technique with one of several different compression formats to produce file sizes that are both small and reasonably fast to encode. Just running a plain rzip compressor gets compression benefits, because (emphasis mine):

rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2) to compress the output of the first stage.

lrzip works on all cores. It also uses as much RAM as possible. This can look alarming on a resource graph, but it is optimized for speed this way. On the server-side, this can show up as a spike in both mem and CPU usage. This is usually fine if you know what’s happening, but if you have lots of short-lived processes that consume RAM (like running multiple processes in xargs) the output can be confusing. Don’t be alarmed. In my testing, it never filled my system beyond 90% of RAM continuously.

To compress files, run it like this :

lrzip my.sql # creates my.sql.lrz
lrztar -q -o file.lrz FILES..

To install lrzip on Ubuntu, run `sudo apt-get install lrzip`

Pros

  • Makes full use of all cores (See Cons)
  • The default LZMA compressor provides fast compression times and excellent compression ratios.
  • Decompression is much faster, and uses far less resources. In my test, it only used one core.
  • You can use other compression formats too. For ex, the ZPAQ compressor gets WAY better compression, but with far FAR longer compression times. Use it when you want to put your stuff on ice.
  • Very near xz in compression, but much faster.

Cons

  • Causes huge spikes on ALL CPU cores and memory. That’s just how rzip works. But, processes use nice priorities of 19, so almost any other task will take precedence.

In Conclusion

xz and lrzip have been stable for years and are already included in the debian apt repos. It’s worth your time exploring these tools and integrating them into your devops processes.

Know any more good compression utils. Comment!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s