Hadoop
		
		
		
		Jump to navigation
		Jump to search
		
Requisites
- java >= 7 (OpenJDK is ok)
 - hadoop bin package == 2.7.1
 - ssh server (ex openssh-server)
 - rsync (hadoop documentation declare it as requisite)
 
Hadoop as Pseudo-Distributed Environment - Install and config
Creating hadoop user
sudo su - groupadd --gid 11001 hadoop mkdir -p /srv/hadoop/ -m 755 useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop chown hadoop:hadoop /srv/hadoop exit
Creating common ssh authentication for hadoop user
sudo su - hadoop mkdir .ssh ssh-keygen -t rsa -P "" -f .ssh/id_rsa cat .ssh/id_rsa.pub >> .ssh/authorized_keys exit
Installing hadoop binary package
sudo su - hadoop cd /srv/hadoop wget -c http://www.us.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz tar -xzf hadoop-2.7.1.tar.gz ln -s hadoop-2.7.1 hadoop-current cat >> ~/.profile << EOF # Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop/hadoop-current export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 EOF exit
Configuring hadoop daemons
sudo su - hadoop
#
cp -a $HADOOP_HOME/etc/hadoop/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old
cat $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old | sed 's|export JAVA_HOME=${JAVA_HOME}|export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64|g' > $HADOOP_HOME/etc/hadoop/hadoop-env.sh
rm $HADOOP_HOME/etc/hadoop/hadoop-env.sh.old
#
# change log size
nano $HADOOP_HOME/etc/hadoop/log4j.properties
# in section Rolling File Appender set
# hadoop.log.maxfilesize=10MB
# hadoop.log.maxbackupindex=100
#
# from https://support.pivotal.io/hc/en-us/articles/202296718-How-to-change-Hadoop-daemon-log4j-properties
echo >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo "export HADOOP_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo "export HADOOP_MAPRED_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/mapred-env.sh
echo >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
echo "# configure log to console and RFA file" >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
echo "export YARN_ROOT_LOGGER=\"INFO,console,RFA\"" >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
#
cat > $HADOOP_HOME/etc/hadoop/core-site.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/srv/hadoop/tmp_data</value>
    </property>
</configuration>
EOF
#
cp -a $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
cat > $HADOOP_HOME/etc/hadoop/mapred-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
EOF
#
cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
EOF
#
cat > $HADOOP_HOME/etc/hadoop/yarn-site.xml << EOF
<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8050</value>
    </property>
</configuration>
EOF
#
exit
Manual startup
sudo su - hadoop hadoop namenode -format # format HDFS, only first time start-dfs.sh start-yarn.sh exit
Hadoop as Cluster - Install and config
Define hostnames on all hosts if it's not well defined (Master and slaves)
- change /etc/hostname
 
- run follow command to apply new hostname
 
hostname `cat /etc/hostname`
Define hadoop nodes IP and names on all hosts (Master and Workers)
- NOTE: configure hadoop using IP and not hostnames appears not work, we will configure IPs on /etc/hosts and then use hostnames
 
- http://stackoverflow.com/questions/28381807/hadoop-slaves-file-regard-ip-as-hostname
- https://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ 15. Possible errors
- http://wiki.apache.org/hadoop/UnknownHost
- However, you may encounter several problems of communication. A sample problem is a hostname like
 
127.0.0.1 localhost 127.0.1.1 myhostname 10.1.1.5 myhostname
if you use myhostname in the configuration above, the master server appears to listening only on 127.0.1.1 !
so we will add specific hadoop hostnames
- adictional warning: do not use '_' char in hostnames!
 
- Add something like follow example at the end of the file /etc/hosts (if not already configured), where MASTER_HOSTNAME or NODE_HOSTNAME_* are the real names used in /etc/hostname :
 
# Hadoop needs the follow specific config 10.1.1.120 MASTER_HOSTNAME hadoopmaster 10.1.1.121 NODE_HOSTNAME_1 hadoopnode1 10.1.1.122 NODE_HOSTNAME_2 hadoopnode2 ... 10.1.1.12N NODE_HOSTNAME_N hadoopnodeN
Commands executed at Master host
Creating hadoop user
sudo groupadd --gid 11001 hadoop sudo mkdir -p /srv/hadoop -m 755 sudo useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop
Creating common ssh authentication for hadoop user
- NOTE: hadoop use ssh to communicate, so a duple key without password is needed
 
sudo su - hadoop mkdir .ssh ssh-keygen -t rsa -P "" -f .ssh/id_rsa # creating the key (same key will used on all nodes) cat .ssh/id_rsa.pub >> .ssh/authorized_keys # authorize the key on this host (auth will shared on all nodes) tar -czf ssh.tgz .ssh exit
- copy the file in all nodes
 
scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_1:/tmp # copy to node 1 scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_2:/tmp # copy to node 2 ... # ... scp /srv/hadoop/home/ssh.tgz NODE_HOSTNAME_N:/tmp # copy to node N sudo rm /srv/hadoop/home/ssh.tgz # remove the file
Commands executed at any Slave host
Creating hadoop user
sudo groupadd --gid 11001 hadoop sudo mkdir -p /srv/hadoop -m 755 sudo useradd --uid 11001 --gid hadoop --no-user-group --shell /bin/bash --create-home --home-dir /srv/hadoop/home hadoop
Setting ssh and environment configs
sudo chown -R hadoop:hadoop /tmp/ssh.tgz sudo mv /tmp/ssh.tgz /srv/hadoop/home sudo su - hadoop tar -xzf ssh.tgz rm -f ssh.tgz exit
Commands executed at Master host
Configuring common hadoop env variables
- as hadoop user, add follow var in ~/.profile
 
sudo su - hadoop
nano .profile
...
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # or /usr/lib/jvm/java-7-oracle
export HADOOP_PREFIX=/srv/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=${HADOOP_PREFIX}"/etc/hadoop"
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
...
exit
- copy the file in all nodes
 
sudo su - hadoop scp .profile NODE_HOSTNAME_1: # copy to node 1 scp .profile NODE_HOSTNAME_2: # copy to node 2 ... # ... scp .profile NODE_HOSTNAME_N: # copy to node N exit
Installing hadoop binary package
sudo su - hadoop wget -c http://www.us.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz exit cd /srv/hadoop/home/ sudo tar -xzf hadoop-2.7.1.tar.gz -C /srv/ sudo mv /srv/hadoop-2.7.1/* /srv/hadoop/ sudo rmdir /srv/hadoop-2.7.1/ sudo chown -R hadoop:hadoop /srv/hadoop/
Copy binary package to all Workers nodes
sudo su - hadoop scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_1: # copy to node 1 scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_2: # copy to node 2 ... # ... scp hadoop-2.7.1.tar.gz NODE_HOSTNAME_N: # copy to node N exit
Configuring hadoop servers
Change default JAVA_HOME and HADOOP_PREFIX in /srv/hadoop/etc/hadoop/hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # or /usr/lib/jvm/java-7-oracle
# Assuming your installation directory is /srv/hadoop
export HADOOP_PREFIX=/srv/hadoop
export HADOOP_CONF_DIR=${HADOOP_PREFIX}"/etc/hadoop"
Create Master and Slave directories
sudo su - hadoop mkdir -pv /srv/hadoop/data/namenode mkdir -pv /srv/hadoop/data/datanode mkdir -pv /srv/hadoop/logs exit
Configure logger
- Change the log size
 
nano /srv/hadoop/etc/hadoop/log4j.properties # in section Rolling File Appender set # hadoop.log.maxfilesize=10MB # hadoop.log.maxbackupindex=100
- Change the log type
 
# from https://support.pivotal.io/hc/en-us/articles/202296718-How-to-change-Hadoop-daemon-log4j-properties echo >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo "export HADOOP_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/hadoop-env.sh echo >> /srv/hadoop/etc/hadoop/mapred-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/mapred-env.sh echo "export HADOOP_MAPRED_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/mapred-env.sh echo >> /srv/hadoop/etc/hadoop/yarn-env.sh echo "# configure log to console and RFA file" >> /srv/hadoop/etc/hadoop/yarn-env.sh echo "export YARN_ROOT_LOGGER=\"INFO,console,RFA\"" >> /srv/hadoop/etc/hadoop/yarn-env.sh
Define Master configuration tags /srv/hadoop/etc/hadoop/core-site.xml
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoopmaster:9000</value>
        <description>NameNode URI</description>
    </property>
Define Common hosts configuration tags /srv/hadoop/etc/hadoop/hdfs-site.xml
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///srv/hadoop/data/namenode</value>
        <description>NameNode directory for namespace and transaction logs storage.</description>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///srv/hadoop/data/datanode</value>
        <description>DataNode directory</description>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.use.datanode.hostname</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
        <value>false</value>
    </property>
Define Slaves host names configuration tags /srv/hadoop/etc/hadoop/slaves
NODE_HOSTNAME_1 NODE_HOSTNAME_2 ... NODE_HOSTNAME_N
Define YARN scheduler configuration tags /srv/hadoop/etc/hadoop/mapred-site.xml
sudo su - hadoop cp /srv/hadoop/etc/hadoop/mapred-site.xml.template /srv/hadoop/etc/hadoop/mapred-site.xml exit
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapred.task.timeout</name>
        <value>0</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1536</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1024M</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>3072</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx2560M</value>
    </property>
Define YARN scheduler configuration tags /srv/hadoop/etc/hadoop/yarn-site.xml
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.nodemanager.log.retain-seconds</name>
        <value>259200</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>hadoopmaster:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>hadoopmaster:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>hadoopmaster:8032</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>
sudo su - hadoop scp -r ../etc/hadoop NODE_HOSTNAME_1:etc_hadoop # copy to node 1 scp -r ../etc/hadoop NODE_HOSTNAME_2:etc_hadoop # copy to node 2 ... # ... scp -r ../etc/hadoop NODE_HOSTNAME_N:etc_hadoop # copy to node N exit
Commands executed at any Worker host
Install Hadoop and apply configs
sudo tar -xzf /srv/hadoop/home/hadoop-2.7.1.tar.gz -C /srv/ sudo mv /srv/hadoop-2.7.1/* /srv/hadoop/ sudo rmdir /srv/hadoop-2.7.1/ sudo rm -rf /srv/hadoop/etc/hadoop/ sudo chown -R hadoop:hadoop /srv/hadoop/ sudo su - hadoop mv etc_hadoop ../etc/hadoop exit
Starting/Stopping Hadoop
- You can start hadoop manually or configure a daemon service at host startup
 
Option 1: Manual startup
Start
Setting up hadoop from Master host
sudo su - hadoop # Format a new distributed filesystem (execute only first time) hdfs namenode -format # # Start the HDFS with the following command, run on the designated NameNode: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode # # Start the YARN with the following command, run on the designated ResourceManager: yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager exit
Setting up Slave hosts (this must executed at all node hosts)
sudo su - hadoop # Run a script to start DataNodes on all slaves: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode # # Run a script to start NodeManagers on all slaves: yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager exit
Stop
When needed, stop hadoop from Master host
sudo su - hadoop # Stop the NameNode with the following command, run on the designated NameNode: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode # # Stop the ResourceManager with the following command, run on the designated ResourceManager: yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager # # this can solve some problems #rm -rf /srv/hadoop/data/namenode/current/ #rm -rf /srv/hadoop/logs/* exit
And stop hadoop services in all node hosts
sudo su - hadoop # Run a script to stop DataNodes on all slaves: hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode # # Run a script to stop NodeManagers on all slaves: yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager # # this can solve some problems #rm -rf /srv/hadoop/data/datanode/current/ #rm -rf /srv/hadoop/logs/* exit
Option 2: Automatic boot startup
On Master host
- Create the script /etc/init.d/hadoop_master.sh at Master host
 
#!/bin/bash
#
# hadoop_master.sh
# version 1.0
# from http://www.campisano.org/wiki/en/Script_chroot.sh
### BEGIN INIT INFO
# Provides:          hadoop_master.sh
# Required-Start:    $remote_fs $syslog $time
# Required-Stop:     $remote_fs $syslog $time
# Should-Start:      $network
# Should-Stop:       $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Hadoop Master Node
# Description:       Hadoop Master Node
### END INIT INFO
PATH=/sbin:/bin:/usr/sbin:/usr/bin;
OWNER[1]="hadoop";
MSG[1]="Hadoop (Master) namenode daemon ...";
START_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs start namenode";
STOP_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs stop namenode";
OWNER[2]="hadoop";
MSG[2]="Hadoop (Master) yarn resourcemanager daemon ...";
START_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR start resourcemanager";
STOP_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR stop resourcemanager";
function start()
{
    N=1;
    while test -n "${OWNER[$N]}"; do
        echo -ne "Starting ${MSG[$N]} ...";
        su "${OWNER[$N]}" -l -c "${START_CMD[$N]}";
        echo "Done.";
        N=$(($N+1));
    done;
}
function stop()
{
    N=${#OWNER[@]};
    while test -n "${OWNER[$N]}"; do
        echo -ne "Stopping ${MSG[$N]} ...";
        su "${OWNER[$N]}" -l -c "${STOP_CMD[$N]}" || break;
        echo "Done.";
        N=$(($N-1));
    done;
}
function usage()
{
    echo "Usage: $0 {start|stop}"
    exit 0
}
case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    *)
        usage
esac;
# End
- Configure as startup script: debian example
 
#remeber to do that the first time: hdfs namenode -format sudo chmod 755 /etc/init.d/hadoop_master.sh sudo update-rc.d hadoop_master.sh defaults # Format a new distributed filesystem (execute only first time) sudo su - hadoop hdfs namenode -format exit sudo service hadoop_master.sh start
On all slave hosts
- Create the script /etc/init.d/hadoop_slave.sh at all slave hosts
 
#!/bin/bash
#
# hadoop_slave.sh
# version 1.0
# from http://www.campisano.org/wiki/en/Script_chroot.sh
### BEGIN INIT INFO
# Provides:          hadoop_slave.sh
# Required-Start:    $remote_fs $syslog $time
# Required-Stop:     $remote_fs $syslog $time
# Should-Start:      $network
# Should-Stop:       $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Hadoop Slave Node
# Description:       Hadoop Slave Node
### END INIT INFO
PATH=/sbin:/bin:/usr/sbin:/usr/bin;
OWNER[1]="hadoop";
MSG[1]="Hadoop (Slave) datanode daemon ...";
START_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs start datanode";
STOP_CMD[1]="\$HADOOP_HOME/sbin/hadoop-daemon.sh --config \$HADOOP_CONF_DIR --script hdfs stop datanode";
OWNER[2]="hadoop";
MSG[2]="Hadoop (Slave) yarn nodemanager daemon ...";
START_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR start nodemanager";
STOP_CMD[2]="\$HADOOP_HOME/sbin/yarn-daemon.sh --config \$HADOOP_CONF_DIR stop nodemanager";
function start()
{
    N=1;
    while test -n "${OWNER[$N]}"; do
        echo -ne "Starting ${MSG[$N]} ...";
        su "${OWNER[$N]}" -l -c "${START_CMD[$N]}";
        echo "Done.";
        N=$(($N+1));
    done;
}
function stop()
{
    N=${#OWNER[@]};
    while test -n "${OWNER[$N]}"; do
        echo -ne "Stopping ${MSG[$N]} ...";
        su "${OWNER[$N]}" -l -c "${STOP_CMD[$N]}" || break;
        echo "Done.";
        N=$(($N-1));
    done;
}
function usage()
{
    echo "Usage: $0 {start|stop}"
    exit 0
}
case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    *)
        usage
esac;
# End
- Configure as startup script
 
sudo chmod 755 /etc/init.d/hadoop_slave.sh sudo update-rc.d hadoop_slave.sh defaults sudo service hadoop_slave.sh start
Hadoop WEB interface
- WEB interface usually start at Master node, at port 8088: http://MASTER_HOST:8088
 
- If you have any problem to access this port (ex firewall lock), you can tunneling this port on local computer with the follow command:
 
ssh -C -N -L 8088:localhost:8088 -p MASTER_SSH_PORT root@MASTER_HOST
- now you can access the hadoop interface opening this url http://localhost:8088
 
NOTE: the HDFS interface usually run on http://MASTER_HOST:50070
- Hadoop ports are described here http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
 
Run
First ready example
sudo su - hadoop hdfs namenode -format hdfs dfs -mkdir -p /user/hadoop hdfs dfs -put /srv/hadoop/etc/hadoop /user/hadoop/input hadoop jar /srv/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+' exit
First Java example
sudo su - hadoop
mkdir java_mapreduce_example
cd java_mapreduce_example
mkdir txt_books
wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books
wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books
wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
hdfs dfs -mkdir -p /user/hadoop/java_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/java_mapreduce_example/txt_books
# create file WordCount.java
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount /user/hadoop/java_mapreduce_example/txt_books /user/hadoop/java_mapreduce_example/output
exit
Creating a python example
sudo su - hadoop mkdir python_mapreduce_example cd python_mapreduce_example
- create file mapper.py, mode 755
 
#!/usr/bin/env python
#
# from http://www.michael-noll.com/tutorials/\
#     writing-an-hadoop-mapreduce-program-in-python/
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
- create file reducer.py, mode 755
 
#!/usr/bin/env python
#
# from http://www.michael-noll.com/tutorials/\
#     writing-an-hadoop-mapreduce-program-in-python/
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)
- getting input data
 
mkdir txt_books wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
- putting data into hdfs, run and cleanup
 
hdfs dfs -mkdir -p /user/hadoop/python_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/python_mapreduce_example/txt_books
hadoop jar ~/../share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
    -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py \
    -input /user/hadoop/python_mapreduce_example/txt_books \
    -output /user/hadoop/python_mapreduce_example/output
hadoop fs -copyToLocal /user/hadoop/python_mapreduce_example/output
hadoop fs -rm -r -f /user/hadoop/python_mapreduce_example
exit
Creating the R equivalent example
- from previous example and https://weblogs.java.net/blog/manningpubs/archive/2012/10/10/r-and-streaming-hadoop-practice
 
sudo su - hadoop mkdir r_mapreduce_example cd r_mapreduce_example
- create file mapper.r, mode 755
 
#!/usr/bin/env Rscript
options(warn=-1)            # disables warnings
sink("/dev/null")           # redirect all R output to /dev/null
# input comes from STDIN (standard input)
input <- file("stdin", "r")
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
    # remove leading and trailing whitespace
    line <- gsub("^\\s+|\\s+$", "", line)
    # split the line into words
    words <- strsplit(line, "[ \t\n\r\f]")[[1]]
    words <- words[! words == ""]
    # increase counters
    for (word in words){
        sink()              # reenable normal output
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        cat(word, 1, "\n", sep="\t")
        sink("/dev/null")   # redirect all R output to /dev/null
    }
}
close(input)
- create file reducer.r, mode 755
 
#!/usr/bin/env Rscript
options(warn=-1)            # disables warnings
sink("/dev/null")           # redirect all R output to /dev/null
# input comes from STDIN (standard input)
input <- file("stdin", "r")
current_word <- NULL
current_count <- 0
word <- NULL
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
    # remove leading and trailing whitespace
    line <- gsub("^\\s+|\\s+$", "", line)
    # split the line into words
    tuples <- strsplit(line, "\t", fixed=TRUE)[[1]]
    if(length(tuples) != 2 || is.na(suppressWarnings(as.integer(tuples[2])))) {
        next
    }
    word <- tuples[1]
    count <- as.integer(tuples[2])
    if((!is.null(current_word)) && current_word == word) {
        current_count <- current_count + 1
    } else {
        if(!is.null(current_word)) {
            # write result to STDOUT
            sink()              # reenable normal output
            cat(current_word, current_count, "\n", sep="\t")
            sink("/dev/null")   # redirect all R output to /dev/null
        }
        current_count <- count
        current_word <- word
    }
}
# do not forget to output the last word if needed!
if((! is.null(current_word)) && current_word == word) {
    sink()              # reenable normal output
    cat(current_word, current_count, "\n", sep="\t")
    sink("/dev/null")   # redirect all R output to /dev/null
}
- getting input data
 
mkdir txt_books wget http://www.gutenberg.org/ebooks/20417.txt.utf-8 -P txt_books wget http://www.gutenberg.org/files/5000/5000-8.txt -P txt_books wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 -P txt_books
- putting data into hdfs, run and cleanup
 
hdfs dfs -mkdir -p /user/hadoop/r_mapreduce_example/txt_books
hdfs dfs -put txt_books/* /user/hadoop/r_mapreduce_example/txt_books
hadoop jar ~/../share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
    -files mapper.r,reducer.r -mapper mapper.r -reducer reducer.r \
    -input /user/hadoop/r_mapreduce_example/txt_books \
    -output /user/hadoop/r_mapreduce_example/output
hadoop fs -copyToLocal /user/hadoop/r_mapreduce_example/output
hadoop fs -rm -r -f /user/hadoop/r_mapreduce_example
exit
References
- http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html
 - http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
 - https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
 - https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
 - http://disi.unitn.it/~lissandrini/notes/installing-hadoop-on-ubuntu-14.html
 - https://districtdatalabs.silvrback.com/creating-a-hadoop-pseudo-distributed-environment
 - http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ A really nice example to explain MapReduce concept
 
- an updated article: https://dzone.com/articles/install-a-hadoop-cluster-on-ubuntu-18041