Below is a simple Hadoop MapReduce example. This example is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementation are included below as well.

package com.bigdatums.hadoop.mapreduce;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

import java.io.IOException;
import java.util.Iterator;

public class FieldCountsExample {

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String line = value.toString();
            String[] fields = line.split("\t");
            String firstName = fields[1];
            word.set(firstName);
            output.collect(word, one);
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }


    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(FieldCountsExample.class);
        conf.setJobName("Field Counts");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path("/data"));
        FileOutputFormat.setOutputPath(conf, new Path("/tmpout"));

        JobClient.runJob(conf);
    }
}

You can see above in the Map class that each line of text is split using split("\t"). The second field in the resulting array is used as the map key, and 1 is use as it’s value. This data is then aggregated in the combiner (optional) and reducer until the final result of a count by first name is available in HDFS. Code for this example is available in the Big Datums GitHub repo.

Here is how to set things up to run the above MapReduce job:

1. Create an Executable Jar containing your MapReduce classes

This can be done a variety of ways. This example assumes Maven is being used.

mvn package #creates bigdatums-hadoop-1.0-SNAPSHOT.jar used below

2. Create a working Hadoop instance

You must first have a working Hadoop installation to run this on. I personally like to create a Docker container using the sequenceiq/docker-spark image.

3. Create an HDFS directory for your input data

If you do not have an HDFS directory containing the data you want to aggregate, create one.

hadoop fs -mkdir /data  

4. Add data to your HDFS directory

Add text file(s) to your newly created HDFS directory.

hadoop fs -put bddatagen_people_wHeader_v1_5k.txt /data

5. Run program from the command line

hadoop jar bigdatums-hadoop-1.0-SNAPSHOT.jar \
  com.bigdatums.hadoop.mapreduce.FieldCountsExample

6. Print output from HDFS

 hadoop fs -cat /tmpout/*

Great documentation about Map Reduce as well as the standard “Word Count” example can be found in this MapReduce Tutorial.

Leave a Reply

Hadoop MapReduce Example – Aggregating Text Fields