Snippet Hive

With snippet Hive, you can free your mind

  1. Hive Local Mode

    # add "SET hive.exec.mode.local.auto=true;" to your $HOME/.hiverc filevi $HOME/.hiverc
    SET hive.exec.mode.local.auto=true;
    
  2. Hive in Noninteractive Mode

    # run script filehive -S -f hive-script.hqlhive -S -f hive-script.hql -d INPUT=s3://bucket/input -d OUTPUT=s3://bucket/output
    # run queryhive -S -e "SHOW TABLES"
    # debugging in Hive and see on the consolehive -hiveconf hive.root.logger=INFO,console
    
  3. Hive Useful Configuration

    # print the current database as part of the prompt
    hive> SET hive.cli.print.current.db=true;
    # show column headers when query
    hive> SET hive.cli.print.header=true;
    # enable compression for MapReduce job outputs
    hive> SET hive.exec.compress.output=true;
    # enable compression for intermediate map outputs
    hive> SET hive.exec.compress.intermediate=true;
    # set the number of reducers for the job
    hive> SET mapred.reduce.tasks=2;
    # improve the performance of aggregation
    hive> SET hive.map.aggr=true;
    
  4. Distributed Cache

    # add a resource to the Distributed Cache
    hive> ADD [ARCHIVE|FILE|JAR] ;
    
  5. Table Sample

    hive> CREATE TABLE employees (
            name STRING,
            salary FLOAT,
            subordinates ARRAY,
            deductions MAP,
            address STRUCT
          )
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\001'
          COLLECTION ITEMS TERMINATED BY '\002'
          MAP KEYS TERMINATED BY '\003'
          LINES TERMINATED BY '\n'
          STORED AS TEXTFILE;
    

    The ROW FORMAT DELIMITED sequence of keywords must appear before any of the other clauses. Because

    • ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
    • ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '\002'
    • ROW FORMAT DELIMITED MAP KEYS TERMINATED BY '\003'

  6. Table External Sample

    hive> CREATE EXTERNAL TABLE employees (
            name STRING,
            salary FLOAT
          )
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t'
          LOCATION 's3://bucket/path';
    

    The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated.

    When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.

    An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir

  7. Table Location to a File

    This first, we create table without LOCATION

    hive> CREATE EXTERNAL TABLE employees (
            name STRING,
            salary FLOAT
          )
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t';
    

    And then, alter table with set LOCATION to data file

    ALTER TABLE employees SET LOCATION 's3://bucket/path/to/file.gz';
    
  8. Table Location From Hive Script Parameter

    hive> CREATE EXTERNAL TABLE IF NOT EXISTS product (
      name STRING,
      price FLOAT
          )
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY ','
          LOCATION '${INPUT}';
    

    Note: Must be single quote

  9. Dynamic Partition Inserts

    # weblogs contains all data
    hive> CREATE TABLE weblogs (
            client_ip STRING,
            full_request_date STRING,
            day STRING
          )
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t';
    # we need partition weblogs data for each ip, we create table weblogs_by_ip with partitioned by ip
    hive> CREATE TABLE weblogs_by_ip (
            client_ip STRING,
            full_request_date STRING,
            day STRING
          )
          PARTITIONED BY (ip STRING)
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t';
    # change to dynamic partition, default is true
    hive> SET hive.exec.dynamic.partition=true;
    # set dynamic partition mode, default is strict
    hive> SET hive.exec.dynamic.partition.mode=nonstrict;
    # set maximum number of dynamic partitions that can be created by each mapper or reducer, default is 100
    hive> SET hive.exec.max.dynamic.partitions.pernode=1000;
    # set total number of dynamic partitions could be created by one DML, default is 1000
    hive> SET hive.exec.max.dynamic.partitions=10000;
    # set maximum total number of files created by all mappers and reducers, default is 100000
    hive> SET hive.exec.max.created.files=1000000;
    # partition data from weblogs to weblogs_by_id, client_ip added one more time at last as data for partition column
    hive> INSERT OVERWRITE TABLE weblogs_by_id PARTITION (ip)
          SELECT client_ip, full_request_date, day, client_ip
          FROM weblogs
    
  10. Convert to ORCFile

    # create weblogs stored as orc file
    hive> CREATE TABLE weblogs_orc (
            client_ip STRING,
            full_request_date STRING,
            day STRING
          )
          STORED AS ORC;
    # insert data to orc table
    hive> INSERT OVERWRITE TABLE weblogs_orc SELECT * FROM weblogs;
    
  11. SELECT … FROM Clauses

    # specify columns with regular expressions
    hive> SELECT symbol, `price.*` FROM stocks;
    
  12. XPath

    hive> SELECT xpath(\'b1b2\',\'//@id\')
    hive> FROM src LIMIT 1;
    
  13. SORT BY vs. ORDER BY

    • SORT BY only sorts results within a reducer, so if your job is running with multiple reducers, you won't get total ordering
    • But ORDER BY does give you total ordering across all the reducers
  14. Join Optimizations

    • When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used
    • Hive also assumes that the last table in the query is the largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on inpidual records. Therefore, you should structure your join queries so the largest table is last [BEST WAY]
    • You don't have to put the largest table last in the query. Hive also provides a "hint" mechanism to tell the query optimizer which table should be streamed
    • hive> SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.pidend
      hive> FROM stocks s JOIN pidends d ON s.ymd = d.ymd AND s.symbol = d.symbol
      hive> WHERE s.symbol = 'ZERO';
      
  15. Replicated Join

    A replicated join is a map-side join where a small table is cached in memory and the big table is streamed

    To perform a replicated join, there's a hint you can give to Hive to indicate which table is small and should be cached:

    hive> SELECT /*+ MAPJOIN(movie_categories)*/
            viewed_movies.movie, movie_categories.category
          FROM viewed_movies
          JOIN movie_categories
          ON viewed_movies.movie=movie_categories.title;
    

    The hint which triggers the map join and also tells Hive which table (movie_categories) to cache

    Replicated Join Limitations: Only inner joins are currently supported with replicated joins

  16. Creating Custom UDFs

    Jar file: hive-exec-0.12.0.jar

    First, you need to create a new class that extends UDF, with one or more methods named evaluate

    package com.example.hive.udf;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    public final class Lower extends UDF {
      public Text evaluate(final Text s) {
        if (s == null) { return null; }
        return new Text(s.toString().toLowerCase());
      }
    }
    

    After compiling your code to a jar, you need to add this to the Hive classpath

    hive> ADD JAR lower.jar;
    hive> list jars;
    

    Once Hive is started up with your jars in the classpath, the final step is to register your function

    CREATE TEMPORARY FUNCTION my_lower as 'com.example.hive.udf.Lower';
    

    Now you can start using it

    hive> select my_lower(title), sum(freq) from titles group by my_lower(title);
    
  17. Add Custom UDFs from s3 to Hadoop lib, create function without ADD JAR

    hive -e "dfs -get s3://butket/hive/udf/custom_udf.jar ${HADOOP_HOME}/lib/"
    

Comments

Popular posts from this blog

Reduce TIME_WAIT Socket Connections