Hadoop
Jump to navigation
Jump to search
Requisites
- java >= 7 (OpenJDK is ok)
- hadoop bin package == 2.7.1
- ssh server (ex openssh-server)
- rsync (hadoop documentation declare it as requisite)
Hadoop as Pseudo-Distributed Environment - Install and config
Creating hadoop user
sudo su - groupadd --gid 11001 hadoop mkdir -p /srv/hadoop/ -m 755 useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop chown hadoop:hadoop /srv/hadoop exit
Creating common ssh authentication for hadoop user
sudo su - hadoop mkdir .ssh ssh-keygen -t rsa -P "" -f .ssh/id_rsa cat .ssh/id_rsa.pub >> .ssh/authorized_keys exit
Installing hadoop binary package
sudo su - hadoop cd /srv/hadoop wget -c http://www.us.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz tar -xzf hadoop-2.7.1.tar.gz ln -s hadoop-2.7.1 hadoop-current cat >> ~/.profile << EOF # Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop/hadoop-current export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 EOF exit
Configuring hadoop daemons
sudo su - hadoop
#
cp -a $HADOOP_HOME/etc/hadoop/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old
cat $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old | sed 's|export JAVA_HOME=${JAVA_HOME}|export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64|g' > $HADOOP_HOME/etc/hadoop/hadoop-env.sh
rm $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old
#
# change log size
nano $HADOOP_HOME/etc/hadoop/log4j.properties
# in section Rolling File Appender set
# hadoop.log.maxfilesize=10MB
# hadoop.log.maxbackupindex=100
#
# from https://support.pivotal.io/hc/en-us/articles/202296718-How-to-change-Hadoop-daemon-log4j-properties
echo >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo "export HADOOP_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo "export HADOOP_MAPRED_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
echo "export YARN_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
#
cat > $HADOOP_HOME/etc/hadoop/core-site.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/srv/hadoop/tmp_data</value>
</property>
</configuration>
EOF
#
cp -a $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
cat > $HADOOP_HOME/etc/hadoop/mapred-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
EOF
#
cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
EOF
#
cat > $HADOOP_HOME/etc/hadoop/yarn-site.xml << EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8050</value>
</property>
</configuration>
EOF
#
exit
Manual startup
sudo su - hadoop hadoop namenode -format # format HDFS, only first time start-dfs.sh start-yarn.sh exit
Hadoop as Cluster - Install and config
Define hostnames on all hosts if it's not well defined (Master and slaves)
- change /etc/hostname
- run follow command to apply new hostname
hostname `cat /etc/hostname`
Define hadoop nodes IP and names on all hosts (Master and Workers)
- NOTE: configure hadoop using IP and not hostnames appears not work, we will configure IPs on /etc/hosts and then use hostnames
- http://stackoverflow.com/questions/28381807/hadoop-slaves-file-regard-ip-as-hostname
- https://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ 15. Possible errors
- http://wiki.apache.org/hadoop/UnknownHost
- However, you may encounter several problems of communication. A sample problem is a hostname like
127.0.0.1 localhost 127.0.1.1 myhostname 10.1.1.5 myhostname
if you use myhostname in the configuration above, the master server appears to listening only on 127.0.1.1 !
so we will add specific hadoop hostnames
- adictional warning: do not use '_' char in hostnames!
- Add something like follow example at the end of the file /etc/hosts (if not already configured), where MASTER_HOSTNAME or NODE_HOSTNAME_* are the real names used in /etc/hostname :
# Hadoop needs the follow specific config 10.1.1.120 MASTER_HOSTNAME hadoopmaster 10.1.1.121 NODE_HOSTNAME_1 hadoopnode1 10.1.1.122 NODE_HOSTNAME_2 hadoopnode2 ... 10.1.1.12N NODE_HOSTNAME_N hadoopnodeN
Commands executed at Master host
Creating hadoop user
sudo groupadd --gid 11001 hadoop sudo mkdir -p /srv/hadoop -m 755 sudo useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop
Creating common ssh authentication for hadoop user
- NOTE: hadoop use ssh to communicate, so a duple key without password is needed
sudo su - hadoop mkdir .ssh ssh-keygen -t rsa -P "" -f .ssh/id_rsa # creating the key (same key will used on all nodes) cat .ssh/id_rsa.pub >> .ssh/authorized_keys # authorize the key on this host (auth will shared on all nodes) tar -czf ssh.tgz .ssh exit
- copy the file in all nodes
scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_1:/tmp # copy to node 1 scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_2:/tmp # copy to node 2 ... # ... scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_N:/tmp # copy to node N sudo rm /srv/hadoop/home/ssh.tgz # remove the file
Commands executed at any Slave host
Creating hadoop user
sudo groupadd --gid 11001 hadoop sudo mkdir -p /srv/hadoop -m 755 sudo useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop
Setting ssh and environment configs
sudo chown -R hadoop:hadoop /tmp/ssh.tgz sudo mv /tmp/ssh.tgz /srv/hadoop/home sudo su - hadoop tar -xzf ssh.tgz rm -f ssh.tgz exit
Commands executed at Master host
Configuring common hadoop env variables
- as hadoop user, add follow var in ~/.profile
sudo su - hadoop
nano .profile
...
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # or /usr/lib/jvm/java-7-oracle
export HADOOP_PREFIX=/srv/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=${HADOOP_PREFIX}"/etc/hadoop"
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
...
exit
- copy the file in all nodes
sudo su - hadoop scp .profile NODE_HOSTNAME_1: # copy to node 1 scp .profile NODE_HOSTNAME_2: # copy to node 2 ... # ... scp .profile NODE_HOSTNAME_N: # copy to node N exit
Installing hadoop binary package
sudo su - hadoop wget -c http://www.us.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz exit cd /srv/hadoop/home/ sudo tar -xzf hadoop-2.7.1.tar.gz -C /srv/ sudo mv /srv/hadoop-2.7.1/* /srv/hadoop/ sudo rmdir /srv/hadoop-2.7.1/ sudo chown -R hadoop:hadoop /srv/hadoop/
Copy binary package to all Workers nodes
sudo su - hadoop scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_1: # copy to node 1 scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_2: # copy to node 2 ... # ... scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_N: # copy to node N exit
Configuring hadoop servers
Change default JAVA_HOME and HADOOP_PREFIX in /srv/hadoop/etc/hadoop/hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # or /usr/lib/jvm/java-7-oracle
# Assuming your installation directory is /srv/hadoop
export HADOOP_PREFIX=/srv/hadoop
export HADOOP_CONF_DIR=${HADOOP_PREFIX}"/etc/hadoop"
Create Master and Slave directories
sudo su - hadoop mkdir -pv /srv/hadoop/data/namenode mkdir -pv /srv/hadoop/data/datanode mkdir -pv /srv/hadoop/logs exit
Configure logger
- Change the log size
nano /srv/hadoop/etc/hadoop/log4j.properties # in section Rolling File Appender set # hadoop.log.maxfilesize=10MB # hadoop.log.maxbackupindex=100
- Change the log type
# from https://support.pivotal.io/hc/en-us/articles/202296718-How-to-change-Hadoop-daemon-log4j-properties echo >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo "export HADOOP_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo >> /srv/hadoop/etc/hadoop/mapred-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/mapred-env.sh echo "export HADOOP_MAPRED_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/mapred-env.sh echo >> /srv/hadoop/etc/hadoop/yarn-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/yarn-env.sh echo "export YARN_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/yarn-env.sh
Define Master configuration tags /srv/hadoop/etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoopmaster:9000</value>
<description>NameNode URI</description>
</property>
Define Common hosts configuration tags /srv/hadoop/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///srv/hadoop/data/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///srv/hadoop/data/datanode</value>
<description>DataNode directory</description>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
Define Slaves host names configuration tags /srv/hadoop/etc/hadoop/slaves
NODE_HOSTNAME_1 NODE_HOSTNAME_2 ... NODE_HOSTNAME_N
Define YARN scheduler configuration tags /srv/hadoop/etc/hadoop/mapred-site.xml
sudo su - hadoop cp /srv/hadoop/etc/hadoop/mapred-site.xml.template /srv/hadoop/etc/hadoop/mapred-site.xml exit
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>0</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
Define YARN scheduler configuration tags /srv/hadoop/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>259200</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopmaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopmaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopmaster:8032</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
sudo su - hadoop scp -r ../etc/hadoop NODE_HOSTNAME_1:etc_hadoop # copy to node 1 scp -r ../etc/hadoop NODE_HOSTNAME_2:etc_hadoop # copy to node 2 ... # ... scp -r ../etc/hadoop NODE_HOSTNAME_N:etc_hadoop # copy to node N exit
Commands executed at any Worker host
Install Hadoop and apply configs
sudo tar -xzf /srv/hadoop/home/hadoop-2.7.1.tar.gz -C /srv/ sudo mv /srv/hadoop-2.7.1/* /srv/hadoop/ sudo rmdir /srv/hadoop-2.7.1/ sudo rm -rf /srv/hadoop/etc/hadoop/ sudo chown -R hadoop:hadoop /srv/hadoop/ sudo su - hadoop mv etc_hadoop ../etc/hadoop exit
Starting/Stopping Hadoop
- You can start hadoop manually or configure a daemon service at host startup
Option 1: Manual startup
Start
Setting up hadoop from Master host
sudo su - hadoop # Format a new distributed filesystem (execute only first time) hdfs namenode -format # # Start the HDFS with the following command, run on the designated NameNode: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode # # Start the YARN with the following command, run on the designated ResourceManager: yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager exit
Setting up Slave hosts (this must executed at all node hosts)
sudo su - hadoop # Run a script to start DataNodes on all slaves: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode # # Run a script to start NodeManagers on all slaves: yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager exit
Stop
When needed, stop hadoop from Master host
sudo su - hadoop # Stop the NameNode with the following command, run on the designated NameNode: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode # # Stop the ResourceManager with the following command, run on the designated ResourceManager: yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager # # this can solve some problems #rm -rf /srv/hadoop/data/namenode/current/ #rm -rf /srv/hadoop/logs/* exit
And stop hadoop services in all node hosts
sudo su - hadoop # Run a script to stop DataNodes on all slaves: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode # # Run a script to stop NodeManagers on all slaves: yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager # # this can solve some problems #rm -rf /srv/hadoop/data/datanode/current/ #rm -rf /srv/hadoop/logs/* exit
Option 2: Automatic boot startup
On Master host
- Create the script /etc/init.d/hadoop_master.sh at Master host
#!/bin/bash
#
# hadoop_master.sh
# version 1.0
# from http://www.campisano.org/wiki/en/Script_chroot.sh
### BEGIN INIT INFO
# Provides: hadoop_master.sh
# Required-Start: $remote_fs $syslog $time
# Required-Stop: $remote_fs $syslog $time
# Should-Start: $network
# Should-Stop: $network
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Hadoop Master Node
# Description: Hadoop Master Node
### END INIT INFO
PATH=/sbin:/bin:/usr/sbin:/usr/bin;
OWNER[1]="hadoop";
MSG[1]="Hadoop (Master) namenode daemon ...";
START_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs start namenode";
STOP_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs stop namenode";
OWNER[2]="hadoop";
MSG[2]="Hadoop (Master) yarn resourcemanager daemon ...";
START_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR start resourcemanager";
STOP_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR stop resourcemanager";
function start()
{
N=1;
while test -n "${OWNER[$N]}"; do
echo -ne "Starting ${MSG[$N]} ...";
su "${OWNER[$N]}" -l -c "${START_CMD[$N]}";
echo "Done.";
N=$(($N+1));
done;
}
function stop()
{
N=${#OWNER[@]};
while test -n "${OWNER[$N]}"; do
echo -ne "Stopping ${MSG[$N]} ...";
su "${OWNER[$N]}" -l -c "${STOP_CMD[$N]}" || break;
echo "Done.";
N=$(($N-1));
done;
}
function usage()
{
echo "Usage: $0 {start|stop}"
exit 0
}
case "$1" in
start)
start
;;
stop)
stop
;;
*)
usage
esac;
# End
- Configure as startup script: debian example
#remeber to do that the first time: hdfs namenode -format sudo chmod 755 /etc/init.d/hadoop_master.sh sudo update-rc.d hadoop_master.sh defaults # Format a new distributed filesystem (execute only first time) sudo su - hadoop hdfs namenode -format exit sudo service hadoop_master.sh start
On all slave hosts
- Create the script /etc/init.d/hadoop_slave.sh at all slave hosts
#!/bin/bash
#
# hadoop_slave.sh
# version 1.0
# from http://www.campisano.org/wiki/en/Script_chroot.sh
### BEGIN INIT INFO
# Provides: hadoop_slave.sh
# Required-Start: $remote_fs $syslog $time
# Required-Stop: $remote_fs $syslog $time
# Should-Start: $network
# Should-Stop: $network
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Hadoop Slave Node
# Description: Hadoop Slave Node
### END INIT INFO
PATH=/sbin:/bin:/usr/sbin:/usr/bin;
OWNER[1]="hadoop";
MSG[1]="Hadoop (Slave) datanode daemon ...";
START_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs start datanode";
STOP_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs stop datanode";
OWNER[2]="hadoop";
MSG[2]="Hadoop (Slave) yarn nodemanager daemon ...";
START_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR start nodemanager";
STOP_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR stop nodemanager";
function start()
{
N=1;
while test -n "${OWNER[$N]}"; do
echo -ne "Starting ${MSG[$N]} ...";
su "${OWNER[$N]}" -l -c "${START_CMD[$N]}";
echo "Done.";
N=$(($N+1));
done;
}
function stop()
{
N=${#OWNER[@]};
while test -n "${OWNER[$N]}"; do
echo -ne "Stopping ${MSG[$N]} ...";
su "${OWNER[$N]}" -l -c "${STOP_CMD[$N]}" || break;
echo "Done.";
N=$(($N-1));
done;
}
function usage()
{
echo "Usage: $0 {start|stop}"
exit 0
}
case "$1" in
start)
start
;;
stop)
stop
;;
*)
usage
esac;
# End
- Configure as startup script
sudo chmod 755 /etc/init.d/hadoop_slave.sh sudo update-rc.d hadoop_slave.sh defaults sudo service hadoop_slave.sh start
Hadoop WEB interface
- WEB interface usually start at Master node, at port 8088: http://MASTER_HOST:8088
- If you have any problem to access this port (ex firewall lock), you can tunneling this port on local computer with the follow command:
ssh -C -N -L 8088:localhost:8088 -p MASTER_SSH_PORT root@MASTER_HOST
- now you can access the hadoop interface opening this url http://localhost:8088
NOTE: the HDFS interface usually run on http://MASTER_HOST:50070
- Hadoop ports are described here http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
Run
First ready example
sudo su - hadoop hdfs namenode -format hdfs dfs -mkdir -p /user/hadoop hdfs dfs -put /srv/hadoop/etc/hadoop /user/hadoop/input hadoop jar /srv/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+' exit
First Java example
sudo su - hadoop
mkdir java_mapreduce_example
cd java_mapreduce_example
mkdir txt_books
wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books
wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books
wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
hdfs dfs -mkdir -p /user/hadoop/java_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/java_mapreduce_example/txt_books
# create file WordCount.java
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount /user/hadoop/java_mapreduce_example/txt_books /user/hadoop/java_mapreduce_example/output
exit
Creating a python example
sudo su - hadoop mkdir python_mapreduce_example cd python_mapreduce_example
- create file mapper.py, mode 755
#!/usr/bin/env python
#
# from http://www.michael-noll.com/tutorials/\
# writing-an-hadoop-mapreduce-program-in-python/
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
- create file reducer.py, mode 755
#!/usr/bin/env python
#
# from http://www.michael-noll.com/tutorials/\
# writing-an-hadoop-mapreduce-program-in-python/
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
- getting input data
mkdir txt_books wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
- putting data into hdfs, run and cleanup
hdfs dfs -mkdir -p /user/hadoop/python_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/python_mapreduce_example/txt_books
hadoop jar ~/../share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py \
-input /user/hadoop/python_mapreduce_example/txt_books \
-output /user/hadoop/python_mapreduce_example/output
hadoop fs -copyToLocal /user/hadoop/python_mapreduce_example/output
hadoop fs -rm -r -f /user/hadoop/python_mapreduce_example
exit
Creating the R equivalent example
- from previous example and https://weblogs.java.net/blog/manningpubs/archive/2012/10/10/r-and-streaming-hadoop-practice
sudo su - hadoop mkdir r_mapreduce_example cd r_mapreduce_example
- create file mapper.r, mode 755
#!/usr/bin/env Rscript
options(warn=-1) # disables warnings
sink("/dev/null") # redirect all R output to /dev/null
# input comes from STDIN (standard input)
input <- file("stdin", "r")
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
# remove leading and trailing whitespace
line <- gsub("^\\s+|\\s+$", "", line)
# split the line into words
words <- strsplit(line, "[ \t\n\r\f]")[[1]]
words <- words[! words == ""]
# increase counters
for (word in words){
sink() # reenable normal output
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
cat(word, 1, "\n", sep="\t")
sink("/dev/null") # redirect all R output to /dev/null
}
}
close(input)
- create file reducer.r, mode 755
#!/usr/bin/env Rscript
options(warn=-1) # disables warnings
sink("/dev/null") # redirect all R output to /dev/null
# input comes from STDIN (standard input)
input <- file("stdin", "r")
current_word <- NULL
current_count <- 0
word <- NULL
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
# remove leading and trailing whitespace
line <- gsub("^\\s+|\\s+$", "", line)
# split the line into words
tuples <- strsplit(line, "\t", fixed=TRUE)[[1]]
if(length(tuples) != 2 || is.na(suppressWarnings(as.integer(tuples[2])))) {
next
}
word <- tuples[1]
count <- as.integer(tuples[2])
if((!is.null(current_word)) && current_word == word) {
current_count <- current_count + 1
} else {
if(!is.null(current_word)) {
# write result to STDOUT
sink() # reenable normal output
cat(current_word, current_count, "\n", sep="\t")
sink("/dev/null") # redirect all R output to /dev/null
}
current_count <- count
current_word <- word
}
}
# do not forget to output the last word if needed!
if((! is.null(current_word)) && current_word == word) {
sink() # reenable normal output
cat(current_word, current_count, "\n", sep="\t")
sink("/dev/null") # redirect all R output to /dev/null
}
- getting input data
mkdir txt_books wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
- putting data into hdfs, run and cleanup
hdfs dfs -mkdir -p /user/hadoop/r_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/r_mapreduce_example/txt_books
hadoop jar ~/../share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-files mapper.r,reducer.r -mapper mapper.r -reducer reducer.r \
-input /user/hadoop/r_mapreduce_example/txt_books \
-output /user/hadoop/r_mapreduce_example/output
hadoop fs -copyToLocal /user/hadoop/r_mapreduce_example/output
hadoop fs -rm -r -f /user/hadoop/r_mapreduce_example
exit
References
- http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html
- http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
- https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
- https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
- http://disi.unitn.it/~lissandrini/notes/installing-hadoop-on-ubuntu-14.html
- https://districtdatalabs.silvrback.com/creating-a-hadoop-pseudo-distributed-environment
- http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ A really nice example to explain MapReduce concept
- an updated article: https://dzone.com/articles/install-a-hadoop-cluster-on-ubuntu-18041