Hadoop Takes Compressed Files (gzip, bzip2) as Direct Input

The log rotation mechanism on my servers automatically compresses (gzip) the rotated log file to save on disk space. I discovered that Hadoop is already designed to deal with compressed files using gzip, bzip2 and LZO out of the box.

This means that no additional work is required in the Mapper class to decompress. Here’s a snippet from the MapReduce output that shows

13/11/15 22:01:46 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131103.gz:0+3058954
13/11/15 22:01:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/11/15 22:01:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/11/15 22:01:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/11/15 22:01:46 INFO mapred.MapTask: soft limit at 83886080
13/11/15 22:01:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/11/15 22:01:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/11/15 22:01:46 INFO compress.CodecPool: Got brand-new decompressor [.gz]

LZO compression is a little more complicated since it doesn’t inherently support splitting up large compressed files. You can read more about a solution to LZO splitting here.

Twitter Digg Delicious Stumbleupon Technorati Facebook Email

About Daniel Watrous

I'm a Software & Electrical Engineer and online entrepreneur.

No comments yet... Be the first to leave a reply!

Leave a Reply