Apache Hadoop is a core big data technology. Running Hadoop on Docker is a great way to get up and running quickly. Below are the basic steps to create a simple Hadoop Docker image.

Pick an OS

Hadoop runs great on a variety of Linux distos. In this post we use Ubuntu 16.04.

Install Required Packages

Various software packages are required for Hadoop, including ssh and Java. These must be installed before using Hadoop.

apt-get update && apt-get install -y \
  ssh \
  rsync \
  vim \
openjdk-8-jdk

Install Hadoop

Installing Hadoop can be done by downloading and extracting the binary package within your Docker container. There are many mirrors from which this package can be downloaded. Here is an example of downloading from a specific mirror, and extracting Hadoop into the /opt/hadoop/ directory.

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz && \
tar -xzf hadoop-2.8.1.tar.gz && \
mv hadoop-2.8.1 $HADOOP_HOME

Make sure to update this URL with the version of Hadoop you are interested in. In this example we use version 2.8.1. See http://hadoop.apache.org/releases.html for a list of Hadoop releases to download.

Configure SSH

Running Hadoop in pseudo-distributed mode requires ssh. Add the following to ~/.ssh/config to avoid having to manually confirm the connection.

Host *
  UserKnownHostsFile /dev/null 
  StrictHostKeyChecking no

You will also need to set up SSH keys, which can be done like this:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 0600 ~/.ssh/authorized_keys

Configure Hadoop

Various Hadoop configuration files need to be created or updated in order for Hadoop to run correctly. These config files can be found in $HADOOP_HOME/etc/hadoop/. The following are examples of various config files needed:

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.resourcemanager.address</name>
       <value>127.0.0.1:8032</value>
    </property>
</configuration>

Set Environment Variables

Export the HADOOP_HOME and JAVA_HOME environment variables in the .bashrc and $HADOOP_HOME/etc/hadoop/hadoop-env.sh files.

Expose Ports

If you want the ability to view the various web interfaces available with Hadoop, expose the related ports in your Dockerfile.

Starting Hadoop

At this point all the pieces should be in place, and Hadoop can be started. The remaining steps are to start the SSH server, format the namenode, run start-dfs.sh, and run start-yarn.sh.

# start ssh server
/etc/init.d/ssh start

# format namenode
$HADOOP_HOME/bin/hdfs namenode -format

# start hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Sample Dockerfile

This Dockerfile shows an example of installing Hadoop on Ubuntu 16.04 into /opt/hadoop. The start-hadoop.sh script is used to start SSH and Hadoop (contents shown below). The Hadoop and SSH configuration files shown above are copied from the local filesystem using the ADD command.

Dockerfile

FROM ubuntu:16.04

# set environment vars
ENV HADOOP_HOME /opt/hadoop
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

# install packages
RUN \
  apt-get update && apt-get install -y \
  ssh \
  rsync \
  vim \
  openjdk-8-jdk


# download and extract hadoop, set JAVA_HOME in hadoop-env.sh, update path
RUN \
  wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz && \
  tar -xzf hadoop-2.8.1.tar.gz && \
  mv hadoop-2.8.1 $HADOOP_HOME && \
  echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
  echo "PATH=$PATH:$HADOOP_HOME/bin" >> ~/.bashrc

# create ssh keys
RUN \
  ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
  chmod 0600 ~/.ssh/authorized_keys

# copy hadoop configs
ADD configs/*xml $HADOOP_HOME/etc/hadoop/

# copy ssh config
ADD configs/ssh_config /root/.ssh/config

# copy script to start hadoop
ADD start-hadoop.sh start-hadoop.sh

# expose various ports
EXPOSE 8088 50070 50075 50030 50060

# start hadoop
CMD bash start-hadoop.sh

start-hadoop.sh

#!/bin/bash

# start ssh server
/etc/init.d/ssh start

# format namenode
$HADOOP_HOME/bin/hdfs namenode -format

# start hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

# keep container running
tail -f /dev/null

Building the Hadoop Docker Image

Running docker build -t my-hadoop . from the directory containing your Dockerfile will create the docker my-hadoop image.

Creating & Running Docker Container

The command docker run -p 8088:8088 --name my-hadoop-container -d my-hadoop can now be used to create a Docker container from this image. The -p option in the command will map the port 8088 inside to the container to port 8088 on the host machine. The CMD instruction used in the Dockerfile will run start-hadoop.sh by default when the container is created.

Accessing Hadoop in Docker Container

Hadoop should now be running in a Docker container. Below is an example of starting an interactive shell in the Docker container, and running a sample MapReduce job.

# start interactive shell in running container
docker exec -it my-hadoop-container bash

# once shell has started run hadoop "pi" example job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 10 100

You can also take a look at the web interface of the Resource Manager at http://localhost:8088.

The original Docker image used in this example can be found at https://github.com/nsonntag/docker-images/.

Leave a Reply

Creating a Hadoop Docker Image