Troubleshoot Hadoop

With troubleshoot of Hadoop, you can free your memory

  1. [MapReduce][Java] java.lang.VerifyError: (class: org/apache/hadoop/mapreduce/Job, method: submit signature: ()V) Incompatible argument to function

    a. Cause

    A VerifyError usually means that you loaded in a class file that is somehow malformed or which references another class file that has changed in a way that causes the code in another class file to no longer be valid.

    For example, if you compiled a class file that referenced a method in some other class, then independently modified and recompiled the second class after altering that method's signature, you'd get this sort of error.

    b. Solution 1

    I'd suggest doing a clean build and seeing if this issue goes away. If not, check whether you're using up-to-date JARs and source files.

    c. Solution 2

    Recompiling using javac *.java and running using java without extra flags still reproduces the problem. The only other class I'm using is the Pair class that is a generic class with two public members and a constructor. Other than that only standard library dependencies.

  2. [Mapreduce][Java] Error: GC overhead limit exceeded. Error: Java heap space

    a. Cause

    This error generally means that your MapReduce program requires more JVM heap space than has been configured by default. See more at Hadoop Memory Intensive

    b. Solution

    Add more parameters when create job flow by elastic mapreduce ruby client: --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

    Or by ruby :bootstrap_actions => bootstrap_actions

    bootstrap_actions = Array.new
    bootstrap_actions.push({
      :name => 'Configure Daemons',
      :script_bootstrap_action => {
        :path => 's3://elasticmapreduce/bootstrap-actions/configure-daemons',
        :args => ['--namenode-heap-size=2048','--namenode-opts=-XX:GCTimeRatio=19'],
      }
    })
    emr = AWS::EMR.new
    job_flow = emr.client.run_job_flow(
      :name => 'Job Flow Name',
      :log_uri => log_dir,
      :instances => {
        :instance_count => instance_count.to_i,
        :master_instance_type => master_instance_type,
        :slave_instance_type => slave_instance_type,
        :hadoop_version => "1.0.3",
      },
      :bootstrap_actions => bootstrap_actions,
    )
    
  3. [MapReduce][Java] java.io.IOException: File already exists: s3://yourbucket/tmp/part-r-00012.gz

    a. Cause

    A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

    b. Explain

    When a tasktracker with a completed map task failed, the map task will be re-exectuted, and all reduce tasks that haven't read the data from that tasktracker should be re-executed. But the reduce task that have read the data from that tasktracker will not be re-executed.

  4. [MapReduce][Java] "Too many fetch-failures" or "Error reading task output"

    a. Cause

    • DNS issues
    • Not enough http threads on the mapper side for the reducers
    • JVM bug

    b. Explain

    • Reducer fetch operations fail to retrieve mapper outputs
    • Too many fetch failures occur on a particular tasktracker (blacklisted tasktracker)

    c. Solution 1

    // Allows reducers from other jobs to run while a big job waits on mappers (defaults to 0.05) mapred.reduce.slowstart.completed.maps = 0.80 //mapred-site.xml

    // Specifies # threads (defaults to 40) used by the tasktracker to serve map output to reducers tasktracker.http.threads = 80 //mapred-site.xml

    // Specifies # parallel copies (defaults to 20) used by reducers to fetch map output mapred.reduce.parallel.copies = SQRT(NodeCount) with a floor of 10 mapred.reduce.parallel.copies = 1 //mapred-site.xml

    # core-site.xml <=> -c; mapred-site.xml <=> -m
    bootstrap_actions = Array.new
    bootstrap_actions.push({
      :name => 'Configure Hadoop',
      :script_bootstrap_action => {
        :path => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop',
        :args => ['-m,mapred.reduce.slowstart.completed.maps=1.00','-m,tasktracker.http.threads=80','-m,mapred.reduce.parallel.copies=1'],
      }
    })
    

    d. Solution 2

    // Terminate job if too many fetch failures occur mapreduce.reduce.shuffle.maxfetchfailures = 2 //mapred-site.xml

  5. [Hadoop] Hadoop Namenode Initializing, Hadoop Namenode isn't Starting Up

    a. Cause

    Hadoop configuration incorrect

    b. Solution

    bin/stop-all.sh
    

    Edit the file conf/hdfs-site.xml and add below configuration if its missing

    <property>
      <name>dfs.data.dir</name>
      <value>/app/hadoop/tmp/dfs/name/data</value>
      <final>true</final>
    </property> 
    <property>
      <name>dfs.name.dir</name>
      <value>/app/hadoop/tmp/dfs/name</value>
      <final>true</final>
    </property>
    
    bin/hadoop namenode -format
    bin/start-all.sh
    
  6. [Hadoop] hadoop namenode -format error "SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: hadoop: hadoop"

    a. Cause

    Machine doesn't know how to resolve the hostname hadoop to a specific IP address. There are typically two places/methods resolution occurs - either via a DNS server or using a local hosts file (hosts file takes precedence).

    b. Solution

    Append your /etc/hosts to include a hostname loopback mapping

    127.0.0.1    hadoop
    
  7. [Hadoop] Name node is in safe mode

    a. Cause

    Namenode enters in safe mode in unusual situations

    b. Solution

    hadoop dfsadmin -safemode leave
    

Comments

Popular posts from this blog

Reduce TIME_WAIT Socket Connections