Apache Hadoop is a core big data technology. Running Hadoop on Docker is a great way to get up and running quickly. Below are the basic steps to create a simple Hadoop Docker image.
Pick an OS
Hadoop runs great on a variety of Linux distos. In this post we use Ubuntu 16.04.
Install Required Packages
Various software packages are required for Hadoop, including ssh
and Java
. These must be installed before using Hadoop.
apt-get update && apt-get install -y \
ssh \
rsync \
vim \
openjdk-8-jdk
Install Hadoop
Installing Hadoop can be done by downloading and extracting the binary package within your Docker container. There are many mirrors from which this package can be downloaded. Here is an example of downloading from a specific mirror, and extracting Hadoop into the /opt/hadoop/
directory.
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz && \
tar -xzf hadoop-2.8.1.tar.gz && \
mv hadoop-2.8.1 $HADOOP_HOME
Make sure to update this URL with the version of Hadoop you are interested in. In this example we use version 2.8.1
. See http://hadoop.apache.org/releases.html for a list of Hadoop releases to download.
Configure SSH
Running Hadoop in pseudo-distributed mode requires ssh
. Add the following to ~/.ssh/config
to avoid having to manually confirm the connection.
Host *
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
You will also need to set up SSH keys, which can be done like this:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 0600 ~/.ssh/authorized_keys
Configure Hadoop
Various Hadoop configuration files need to be created or updated in order for Hadoop to run correctly. These config files can be found in $HADOOP_HOME/etc/hadoop/
. The following are examples of various config files needed:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
</configuration>
Set Environment Variables
Export the HADOOP_HOME
and JAVA_HOME
environment variables in the .bashrc
and $HADOOP_HOME/etc/hadoop/hadoop-env.sh
files.
Expose Ports
If you want the ability to view the various web interfaces available with Hadoop, expose the related ports in your Dockerfile.
Starting Hadoop
At this point all the pieces should be in place, and Hadoop can be started. The remaining steps are to start the SSH server, format the namenode, run start-dfs.sh
, and run start-yarn.sh
.
# start ssh server
/etc/init.d/ssh start
# format namenode
$HADOOP_HOME/bin/hdfs namenode -format
# start hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Sample Dockerfile
This Dockerfile shows an example of installing Hadoop on Ubuntu 16.04 into /opt/hadoop
. The start-hadoop.sh
script is used to start SSH and Hadoop (contents shown below). The Hadoop and SSH configuration files shown above are copied from the local filesystem using the ADD
command.
Dockerfile
FROM ubuntu:16.04
# set environment vars
ENV HADOOP_HOME /opt/hadoop
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
# install packages
RUN \
apt-get update && apt-get install -y \
ssh \
rsync \
vim \
openjdk-8-jdk
# download and extract hadoop, set JAVA_HOME in hadoop-env.sh, update path
RUN \
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz && \
tar -xzf hadoop-2.8.1.tar.gz && \
mv hadoop-2.8.1 $HADOOP_HOME && \
echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
echo "PATH=$PATH:$HADOOP_HOME/bin" >> ~/.bashrc
# create ssh keys
RUN \
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 0600 ~/.ssh/authorized_keys
# copy hadoop configs
ADD configs/*xml $HADOOP_HOME/etc/hadoop/
# copy ssh config
ADD configs/ssh_config /root/.ssh/config
# copy script to start hadoop
ADD start-hadoop.sh start-hadoop.sh
# expose various ports
EXPOSE 8088 50070 50075 50030 50060
# start hadoop
CMD bash start-hadoop.sh
start-hadoop.sh
#!/bin/bash
# start ssh server
/etc/init.d/ssh start
# format namenode
$HADOOP_HOME/bin/hdfs namenode -format
# start hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
# keep container running
tail -f /dev/null
Building the Hadoop Docker Image
Running docker build -t my-hadoop .
from the directory containing your Dockerfile will create the docker my-hadoop
image.
Creating & Running Docker Container
The command docker run -p 8088:8088 --name my-hadoop-container -d my-hadoop
can now be used to create a Docker container from this image. The -p
option in the command will map the port 8088 inside to the container to port 8088 on the host machine. The CMD
instruction used in the Dockerfile will run start-hadoop.sh
by default when the container is created.
Accessing Hadoop in Docker Container
Hadoop should now be running in a Docker container. Below is an example of starting an interactive shell in the Docker container, and running a sample MapReduce job.
# start interactive shell in running container
docker exec -it my-hadoop-container bash
# once shell has started run hadoop "pi" example job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 10 100
You can also take a look at the web interface of the Resource Manager at http://localhost:8088
.
The original Docker image used in this example can be found at https://github.com/nsonntag/docker-images/.