Cloudera’s Quickstart Image is a fantastic way to get started quickly with the big data ecosystem. With software such as Hadoop, Spark, Hive, Pig, Impala, and Hue already set up, this Docker image is a must in your big data toolkit.
One thing the Cloudera Quickstart container is lacking however, is an easy way to set up AWS credentials to allow software like Hadoop and Spark to use S3 for job inputs and outputs.
Set AWS Credentials in Dockerfile
The cloudera-quickstart-aws Docker image is a good example of how to set AWS credentials in a Cloudera Quickstart container. This image builds off of the Cloudera Quickstart image and sets AWS Credentials as well as installs the AWS Command Line Interface.
The cloudera-quickstart-aws image has two major parts; the set-aws-creds.sh
shell script, and the Dockerfile
.
set-aws-creds.sh
The set-aws-creds.sh
shell script sets AWS keys to be exported as environment variables in the root user’s .bashrc
file (root is default Docker user). The script also uses sed
to add the s3a
and s3n
AWS properties to Hadoop’s core-site.xml
configuration file (don’t necessarily need both s3a and s3n properties set).
#!/bin/bash
# ADD ACTUAL AWS KEYS HERE BEFORE RUNNING SCRIPT/BUILDING DOCKER IMAGE
#######################################################################
AWS_ACCESS_KEY_ID=REPLACE-ME
AWS_SECRET_ACCESS_KEY=REPLACE-ME
###################################################################
# add aws creds to .bashrc
echo "export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" >> /root/.bashrc
echo "export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" >> /root/.bashrc
# make backup of core-site.xml
mv /etc/hadoop/conf/core-site.xml /etc/hadoop/conf/core-site.xml.bak
# add aws credentials for s3a and s3n to core-site.xml
cat /etc/hadoop/conf/core-site.xml.bak \
| sed "s#<\/configuration>#<property>\n<name>fs.s3a.awsAccessKeyId<\/name>\n<value>${AWS_ACCESS_KEY_ID}<\/value>\n<\/property>\n<property>\n<name>fs.s3a.awsSecretAccessKey<\/name>\n<value>${AWS_SECRET_ACCESS_KEY}<\/value>\n<\/property>\n<property>\n<name>fs.s3n.awsAccessKeyId<\/name>\n<value>${AWS_ACCESS_KEY_ID}<\/value>\n<\/property>\n<property>\n<name>fs.s3n.awsSecretAccessKey<\/name>\n<value>${AWS_SECRET_ACCESS_KEY}<\/value>\n<\/property>\n<\/configuration>#g" \
> /etc/hadoop/conf/core-site.xml
Dockerfile
The Dockerfile is what is actually used to build the Docker image. The cloudera-quickstart-aws Dockerfile copies the set-aws-creds.sh
script mentioned above and executes it. The Dockerfile also installs the AWS Command Line Interface, which is an extremely useful tool for working with AWS services.
# Dockerfile
# use cloudera quickstart
FROM cloudera/quickstart:latest
# use local script to set aws creds in hadoop and environment
ADD ./set_aws_creds.sh /scripts/
RUN /scripts/set_aws_creds.sh
# install aws cli
RUN cd /tmp && \
curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" && \
unzip awscli-bundle.zip && \
./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
# start services
CMD /usr/bin/docker-quickstart
Build Image & Run Container
To understand how to build this image you should read the documentation on the cloudera-quickstart-aws GitHub page. But in short, the commands to to build this image locally and run a Docker container are:
#!/bin/bash
#build image
docker build -t cloudera-quickstart-aws:latest .
#run container
docker run --hostname=quickstart.cloudera --privileged=true -t -i cloudera-quickstart-aws:latest