Setup Hadoop Multi Node Cluster

Install Hadoop on Multi Node Cluster

  1. Install Hadoop (Single Node) on many node

    a. Show Java Location: $JAVA_HOME

    b. See Setup Hadoop Single Node

  2. Configure Network

    a. Config hostname for master and slave

    vi /etc/sysconfig/network
    
    # change hostname in network file
    HOSTNAME="master.zero"
    HOSTNAME="slave1.zero"
    HOSTNAME="slave2.zero"
    
    # reboot node to take affect
    # or restart rsyslog
    /sbin/service rsyslog restart
    # or restart network
    # /etc/init.d/network restart
    

    b. Get ip of master and slave, then edit hosts file of master and slave

    ifconfig
    vi /etc/hosts
    
    # update hosts file with content
    172.16.35.1  master.zero
    172.16.35.2  slave1.zero
    172.16.35.3  slave2.zero
    

  3. Config ssh for master can connect to slave

    a. From hadoop@master.zero create ssh connect to hadoop@master.zero

    ssh master.zero
    

    b. From hadoop@master.zero create ssh connect to hadoop@slave.zero

    To connect without password, we need add public ssh key of hadoop@master.zero to authorized_keys of hadoop@slave.zero

    ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave1.zero
    ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave2.zero
    

    c. Check ssh connect from master to slave

    ssh slave1.zero
    ssh slave2.zero
    

    4. Configure hadoop of master and slave

    conf/masters (master only)

    # The primary NameNode and JobTracker on master when run bin/start-dfs.sh
    # and bin/start-mapred.sh (respectively bin/start-all.sh)
    master.zero
    

    conf/slaves (master only)

    # Hadoop slave daemons (DataNodes and TaskTrackers) will be run
    master.zero
    slave1.zero
    slave2.zero
    

    conf/core-site.xml (all machines)

    <property>
      <name>fs.default.name</name>
      <value>hdfs://master.zero:9000</value>
      <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
    </property>
    

    conf/mapred-site.xml (all machines)

    <property>
      <name>mapred.job.tracker</name>
      <value>master.zero:9001</value>
      <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
    </property>
    

    conf/hdfs-site.xml (all machines)

    <property>
      <name>dfs.replication</name>
      <value>2</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
    </property>
    

  4. Formatting the HDFS filesystem via the NameNode (this case is master)

    bin/hadoop namenode -format
    # Warning: Do not format a running cluster
    # because this will erase all existing data in the HDFS filesytem!
    
  5. Start multi-node cluster

    Step 1. Start HDFS deamons (this case is master)

    bin/start-dfs.sh
    # NameNode deamon started on master,
    # then DataNode daemons would be auto start on all of slaves (master and slave)
    

    Step 2. Start MapReduce deamons (master)

    bin/start-mapred.sh
    # JobTracker started on master
    # then TaskTracker deamons would be auto start on all of slaves (master and slave)
    

  6. Stop multi-node cluster

    Step 1. Stop MapReduce deamons (master)

    bin/stop-mapred.sh
    # JobTracker stopped on master
    # then TaskTracker deamons would be auto stop on all of slaves (master and slave)
    

    Step 2. Stop HDFS deamons (this case is master)

    bin/stop-dfs.sh
    # NameNode deamon stopped master,
    # then DataNode daemons would be auto stop on all of slaves (master and slave)
    

  7. Run WordCount

    a. Put input directory from local to HDFS

    bin/hadoop fs -put /path/local/input/ /path/hdfs/input
    

    b. Check input directory in HDFS

    bin/hadoop dfs -ls /path/hdfs/input
    

    c. Run MapReduce

    bin/hadoop jar /path/local/file.jar package.MainClass /path/hdfs/input/ /path/hdfs/output/
    

    d. Check output in HDFS

    bin/hadoop dfs -ls /path/hdfs/output
    

    e. Get output from HDFS to local

    bin/hadoop fs -get /path/hdfs/output/part-r-* /path/local/result.file
    

Comments

Popular posts from this blog

Reduce TIME_WAIT Socket Connections