Hadoop’s default output delimiter (character separating the output key and value) is a tab ("\t"
). This post explains how to change the default Hadoop output delimiter.
Output Delimiter Configuration Property
The output delimiter of a Hadoop job can easily be changed by setting the mapred.textoutputformat.separator
configuration property. This property can be set from the code itself or from the command line.
Setting delimiter in job class:
//get configuration object
Configuration conf = getConf();
//set output delimiter to comma
conf.set("mapred.textoutputformat.separator", ",");
Setting delimiter from command line:
# adding the following args to a Hadoop job command will change output delimiter to comma
-D mapred.textoutputformat.separator=","
Example
We will use the word count example that comes packaged with Hadoop to show how set a custom output delimiter from the command line.
Running word count with default delimiter:
# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount /input-dir /output-dir
# cat output
hadoop fs -cat /output-dir/*
with 56
within 4
without 1
work 12
Running word count with custom delimiter:
# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount -D mapred.textoutputformat.separator="," /input-dir /output-dir
# cat output
hadoop fs -cat /output-dir/*
with,56
within,4
without,1
work,12