blog

Home / DeveloperSection / Blogs / Various Data compression codecs in Hadoop

Various Data compression codecs in Hadoop

marcel ethan1655 04-May-2016

Here we enlist and identify some common codecs that are supported by the Hadoop framework. Be sure to opt for the codec that most closely matches the needs of for a particular use case (for example, with workloads where the speed of processing is important, choose a codec with high decompression speeds):

Gzip:

It a kind of compression utility that was used by the GNU project, Gzip (acronym for GNU zip) build compressed files that have a .gz extension. We can execute the gunzip command to decompress the files that were genrated by numerous compression utilities, including Gzip.

Bzip2:

From a usability point of view, Bzip2 and Gzip are very identical. Bzip2 generates a better compression ratio as compared to Gzip, but it’s a bit slower. In fact, of all the available compression codecs in Hadoop, Bzip2 is by far the slowest. If we are setting up an archive that we will rarely need to query and space is at a high premium, then maybe would Bzip2 be worth considering. (The B in Bzip comes from its use of the Burrows-Wheeler algorithm.)

 Snappy:

The Snappy codec is provided by Google and generates a modest compression ratios, although fast compression and decompression speeds. (In fact, it decompresses data with the fastest speeds, which makes it highly desirable for data sets that are likely to be queried often.) The Snappy codec is involved into Hadoop Common, a set of common utilities that provide supports  for other Hadoop subprojects. We can use Snappy as an add-on for more recent versions of Hadoop that do not yet provide Snappy codec support..

 LZO:

LZO is very identical to Snappy, LZO (short for Lempel-Ziv-Oberhumer, the trio of computer scientists who have generated the algorithm) provides modest compression ratios, with fast compression and decompression speeds. LZO is licensed under the GNU Public License (GPL). This license is incompatible with the Apache license, and as a result, LZO has been removed from some distributions. (Some distributions, such as IBM’s BigInsights, have made an end run around this restriction by releasing GPL-free versions of LZO.)

LZO supports splittable compression, which enables the parallel processing of compressed text file splits by our MapReduce jobs. LZO needs to create an index when it compresses a file, because with variable-length compression blocks, an index is required to tell the mapper where it can safely split the compressed file. LZO is only really desirable if we need to compress text files. For binary files, which are not impacted by non-splittable codecs, Snappy is our best option.


Leave Comment

Comments

Liked By