Hadoop Takes Compressed Files (gzip, bzip2) as Direct Input
The log rotation mechanism on my servers automatically compresses (gzip) the rotated log file to save on disk space. I discovered that Hadoop is already designed to deal with compressed files using gzip, bzip2 and LZO out of the box.
This means that no additional work is required in the Mapper class to decompress. Here’s a snippet from the MapReduce output that shows
13/11/15 22:01:46 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131103.gz:0+3058954 13/11/15 22:01:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/11/15 22:01:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 13/11/15 22:01:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 13/11/15 22:01:46 INFO mapred.MapTask: soft limit at 83886080 13/11/15 22:01:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 13/11/15 22:01:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 13/11/15 22:01:46 INFO compress.CodecPool: Got brand-new decompressor [.gz] |
LZO compression is a little more complicated since it doesn’t inherently support splitting up large compressed files. You can read more about a solution to LZO splitting here.