Hadoop’s default output delimiter (character separating the output key and value) is a tab ("\t"). This post explains how to change the default Hadoop output delimiter.

Output Delimiter Configuration Property

The output delimiter of a Hadoop job can easily be changed by setting the mapred.textoutputformat.separator configuration property. This property can be set from the code itself or from the command line.

Setting delimiter in job class:

//get configuration object
Configuration conf = getConf();

//set output delimiter to comma
conf.set("mapred.textoutputformat.separator", ","); 

Setting delimiter from command line:

# adding the following args to a Hadoop job command will change output delimiter to comma
-D mapred.textoutputformat.separator="," 

Example

We will use the word count example that comes packaged with Hadoop to show how set a custom output delimiter from the command line.

Running word count with default delimiter:

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount /input-dir /output-dir

# cat output
hadoop fs -cat /output-dir/* 

with	56
within	4
without	1
work	12

Running word count with custom delimiter:

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount -D mapred.textoutputformat.separator="," /input-dir /output-dir

# cat output
hadoop fs -cat /output-dir/* 

with,56
within,4
without,1
work,12

Leave a Reply

How to Change Hadoop Output Delimiter