It is generally recommended to always compress intermediate map output. This is because IO and network transfer are big bottlenecks in Hadoop, and compression can help with both of these issues.
Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce job, we are no longer concerned with data being splittable. Therefore a non-splittable compression type will work fine. One thing to consider is that increased compression also means increased processing time, so an fast compressor like Snappy or LZO is usually a good choice for compressing intermediate map output. This way you can get increased performance by simply reducing the amount of data sent over the network. In fact, Amazon EMR enables intermediate compression with the Snappy codec by default.
In order to enable intermediate data compression you must adjust the parameters you pass to your MapReduce job. Simply set mapreduce.map.output.compress
to true
, and mapreduce.map.output.compress.codec
to the compression codec of your choice.
Here is an example of enabling intermediate (Snappy) compression in our Java MapReduce job class:
//turn on intermediate (map output) compression
conf.set("mapreduce.map.output.compress", "true");
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
Here is an example of enabling intermediate (Snappy) compression when from the command line:
hadoop jar bigdatums-hadoop-1.0-SNAPSHOT.jar com.bigdatums.hadoop.mapreduce.ToolMapReduceExample \
-DinputLoc=/dataIn/ -DoutputLoc=/dataOut/ \
-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec -Dmapreduce.map.output.compress=true