• Apache Hive

  • Data Warehouse software project built on top of Apache Hadoop for providing data query and analysis

  • Familiar

    • Query data with a SQL-based language
  • Fast

    • Interactive response times, even over huge datasets
  • Scalable and Extensible

    • As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance
  • Compatible

    • Works with traditional data integration and data analytics tools

Architecture

ComponentDescription
Hive ClientsApache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers
Hive ServicesHive provides various services like web Interface, CLI etc. to perform queries
Meta StoreRepository for metadata which consists of data for each table like its location and schema
Processing frameworkSpark, Tez, MapReduce
Resource ManagementYARN (Yet Another Resource Negotiator)
Distributed StorageUses the underlying HDFS for the distributed storage

Hive Data Model

HiveQL Data Types

Primitive types

  • Numeric data types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL
  • String data types: STRING, VARCHAR, CHAR
  • Date/Time data type: TIMESTAMP, DATE
  • Miscellaneous data types: BOOLEAN, BINARY

Complex types

  • Structs: STRUCT<col_name:data_type[COMMENT col_comment],...>
  • Maps: MAP<primitive_type,data_type>
  • Arrays: ARRAY<data_type>
  • Union Types: UNIONTYPE<data_type,data_type,...>

Operators

Built In Operators

  • Relational Operators: A = B, A BETWEEN B AND C, A REGEXP B
  • Arithmetic Operators: A + B, A & B, ~A
  • Logical Operators: A AND B, A IN (value1, value2, …), NOT EXISTS (subquery)
  • Operators on Complex Types: A[i], M[key], S.a

Functions

Built In Functions

  • Mathematical functions: rand(...), pi(...)
  • Collection functions: size(...)
  • Type conversion functions: cast(...), binary(...)
  • Date functions: year(...), month(...)
  • Conditional functions: coalesce(...), if(...)
  • String functions: upper(...), trim(...)
  • Aggregate functions: sum(...), count(*)
  • Table-generating functions: explode(...), json_tuple(...)

User-Defined Functions

  • UDF: User-Defined Function
  • UDAF: User-Defined Aggregating Function
  • UDTF: User-Defined Table-Generating Function

Partitioning and Bucketing

Tables

There are two main types of table in Hive - Managed tables and External tables.

Partitioning

Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. This allows us to have a faster query on slices of the data.

Bucketing

In bucketing, we specify the fixed number of buckets in which entire data is to be decomposed. Bucketing concept is based on the hashing principle, where same type of keys are always sent to the same bucket.

File Formats

FormatDescription
TEXTFILEThis is the default format specification in Hive. A text file is naturally splittable and able to be processed in parallel.
SEQUENCEFILEThis is a binary storage format for key/value pairs. The benefit of a sequence file is that it is more compact than a text file and fits well with the MapReduce output format.
AVROThis is also a binary format. More than that, it is also a serialization and deserialization framework.
RCFILEThe Record Columnar File splits data horizontally into row groups and allows Hive to skip irrelevant parts of the data and get the results faster and cheaper.
ORCOptimized Row Columnar provides a larger block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB), optimized for large sequential reads on HDFS.
PARQUETParquet has a wider range of support for the majority of projects in the ecosystem and leverages the best practices in the design of Google’s Dremel to support the nested structure of data.

User-Defined Functions

User-defined functions provide a way to use the user’s own application/business logic for processing column values during an HQL query. Hive defines the following three types of user-defined functions, which are extensible:

  • UDF: It stands for User-Defined Function, which operates row-wise and outputs one result for one row, such as most built-in mathematics and string functions.
  • UDAF: It stands for User-Defined Aggregating Function, which operates row-wise or group-wise and outputs one row for the whole table or one row for each group as a result, such as the max(...) and count(...) built-in functions.
  • UDTF: It stands for User-Defined Table-Generating Function, which also operates row-wise, but produces multiple rows/tables as a result, such as the explode(...) function. UDTF can be used after the SELECT or LATERAL VIEW statement.

Although all In functions in HQL are implemented in Java, UDF can also be implemented in any JVM-compatible language, such as Scala.