Data Analysis with Hadoop and Hive, Knowledgeable

11 min readJan 17, 2025

After working with Hadoop and Hive for few years, I just thought about writing something which will helps the developer or big data learners to how they can start with Hadoop and hive to analysis the data for their system

Installation and Using predefined docker images for Learning purpose

Topics of the articles

Using the pre configured the hive and Hadoop cluster
Architecture of Hadoop
Architecture of Hive
HDFS
Common Operation of Hive (Create database and table)
Discuss data Types
Partitioning
Bucketing

GitHub - big-data-europe/docker-hive

This is a docker container for Apache Hive 2.3.2. It is based on https://github.com/big-data-europe/docker-hadoop so…

github.com

# Clone the Docker Project for docker hive
git clone https://github.com/big-data-europe/docker-hive
cd docker-hive
# Run and Enter into docker container
docker-compose up -d
docker-compose exec hive-server bash
# To Run Hive with Beeline
/opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

Basic Knowledge Base about Hadoop Structure

HDFS

HDFS stands for Hadoop Distributed File System . Works as master slave architecture. Hadoop has Name Node and Data Node to control operation, There is one name node and multiple data node, it stores and handle the large amount of dataset with distributed manner.

The main feature of the HDFS is

Provides distributed storage and it can be implemented on commodity hardware
Data security and highly fault-tolerant

MAP Reduce

Hadoop MapReduce is pice of software and framework which purpose is to process the vast amount of data in parallel on large amount of clusters using commodity hardware and fault-tolerant manner.

Hadoop Map Reducer Process Stage

Input           --> Input the data 

Map Stage      --> Map the job to process the input data or information
Shuffle Stage  --> Suffle the data and mapping it
Sorting State  --> and Sorting the data and provide output to Reduce stage
Reduce Stage  --> Reduce the map with reduce algorithem

Output        -->  Provide the output result

Apache Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator.

The responsibility of YARN is to split the task and functionally

1.1 Resource Management
1.2 Job Management for Hadoop Server
1.3 Job Scehduling 
1.4 Monitoring the Seperate servers and deamons

2. It has 2 mechanism

2.1 Global Resouce Manager (RM)
2.2 Pre-Application Master (AM)

YARN invoke the single application as JOB or a DAG of Jobs

It also has a responsibility for manage the cluster resource to distribute the task in different machine or resource in cluster. And Make sure the job and job scheduler in right path.

HADOOP Common

It is a common component which provide essential library and service and utilities for properly functioning the Hadoop ecosystem.

It contains IO operation , RPC Protocol and serialization module
Also provide the necessary Java Archive (JAR) files and scripts required to run Hadoop, ensure seamless interaction and communication between component
Also Provide the API for FileSystem and data storage system integration with HDFS
Also has source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.

Hive Architecture

HiveServer2

HiveServer 2 is an improved implementation of HiveServer1.
It support concurrent connection and session by Thrift Service
Also Can be Authorized and Authenticate by LDAP, Kerberos
Support different driver such as ODBC and JDBC ans also support the optimized execution of Hive Query

Metastore

Hive need to save the metadata for run different operation, so it’s need the store to save meta data called Meta store.

Hive Meta store default comes with Derby database, not support concurrent execution and query
In production, we need to use Relational Database like MySQL, PostgreSQL, Oracle as database for Meta Store, it will support the concurrent feature of execution of operation and query.
MetaStore Actually store the information about structure of Hive table and partitions, column and column types and locations

Hive clients and Beeline

Early version of hive is using the hive cli version for run and query through command line and which is deprecated.

In recent and latest version of Hive recommend the developer or user to use Beeline as command line (CLI) for run execute the Query and other operation of Hive and Hadoop.
Beeline connect through the HiveService2 and act as interface or client for user to run queries and output the result.
Beeline also support the Pretty output of Hive query result with column name and field and all hive CLI features

Hive Driver

The Hive driver receives the HiveQL statements submitted by the user through the command shell.
It creates the session handles for the Query and sends the Query to the compiler.

Hive Compiler

Hive compiler parses the Query by performs semantice analysis and type checking on the different Query blocks and Query Expression with the help of Meta Store.
Generate the Execution Plan and The execution plan created by complier is called DAG (Directed Acyclic Graph),
DAG contains stage of map/reduce Job and Operation on HDFS and Meta Store Operation.

Optimizer

Optimizer performs the transformation operations on the execution plan and splits the task to improve efficiency and scalability.

Different Operation of HADOOP by Hive

Using HDFS and Different tools for create the database in warehouse in HDFS

# Create the storages for database creation
/opt/hadoop-2.7.4/bin/hdfs dfs -mkdir -p /user/hive/warehouse/bookstore.db
# check warehouse location is created for bookstore
/opt/hadoop-2.7.4/bin/hdfs dfs -ls /user/hive/warehouse/

Create Database by beeline CLI

/opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

0: jdbc:hive2://localhost:10000> create database
. . . . . . . . . . . . . . . .> bookstore
. . . . . . . . . . . . . . . .> location '/user/hive/warehouse/bookstore.db'
. . . . . . . . . . . . . . . .> ;

Check the database is created on specific directory of HDFS

0: jdbc:hive2://localhost:10000> desc database bookstore;

Second Approach, Create database with HDFS pre defined Storage

0: jdbc:hive2://localhost:10000> create database
. . . . . . . . . . . . . . . .> bookstore_qa
. . . . . . . . . . . . . . . .> ;
No rows affected (0.073 seconds)

Create Table in hive

Table Creation Syntax

CREATE [ EXTERNAL ] TABLE <table name> [IF NOT EXISTS] (
   column_name1  data_type  COMMENT ....,
   column_name2. data_type  COMMENT ....
)
PARTITIONED BY (column_name3, column_name4)
CLUSTERED BY ( col_name1, col_name2, ...)
SORTED BY ( col_name1 [ ASC | DESC ])
ROW FORMAT row_format
--- IF ROW Format is Text and delemeted --
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
-- End IF --
STORED AS file_format
LOCATION 'path ....'
TBLPROPERTIES ( key1=val1, key2=val2, ... )

Create table with changing the key properties and and file format and location

Sample of creating table without text file format

CREATE TABLE authors (
    id INTEGER,
    author_name STRING
)
PARTITIONED BY (dt              STRING)
TBLPROPERTIES ("transactional" = 'false')
;

Creating the file with text based formats (publisher)

CREATE TABLE publishers (
   publisher_id INTEGER,
   publisher_name STRING,
   name STRING,
   address STRING,
   founding_date STRING
)
PARTITIONED BY (dt              STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
TBLPROPERTIES ("transactional" = 'false')
;

Another example of creating table with text base format (book_info)

CREATE TABLE bookstore.book_info (
  book_id INTEGER,
  title STRING,
  authors ARRAY<STRING>,
  avarage_rating DECIMAL(10,2),
  isbn INTEGER,
  isbn12 INTEGER,
  language_code ARRAY<STRING>,
  number_of_page INTEGER,
  rating_count INTEGER,
  text_review INTEGER,
  publisher_id INTEGER,
  publish_date STRING
)
PARTITIONED BY  (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
TBLPROPERTIES ("transactional" = 'false')
;

Example for External Table for Hadoop and Hive database

CREATE EXTERNAL TABLE customer (
    id INTEGER,
    customer_name STRUCT<firstname: STRING, middlename: STRING, lastname: STRING>,
    personal_info MAP<STRING, STRING>,
    email STRUCT<one: STRING, two: STRING, three: STRING>,
    contact STRUCT<one: STRING, two: STRING, three: STRING>
)
PARTITIONED BY (dt STRING, country STRING)
LOCATION '/user/hive/customer_info'
TBLPROPERTIES ("transactional" = 'false');

Variable Declaration

We can declare and access the Hive variable by 3 way

hiveconf — hive started with this, all the hive configuration is stored as part of this conf.
hivevar: To store user variables.
system: To store system variables.

We can do it by different way, Access the Beeline from command line, Then write the code below

> SET KEY=VALUE

> SET dt=2013-10-12
> SELECT * FROM bookstore.publishers WHERE dt=${dt}

RUN the application with hive environment variable

SELECT 
*
FROM
  book_info
WHERE
dt=${dt}

> hive -hiveconf dt='2024-07-24' -f test.hql

# test.ini.hql
SET hivevar:tablename=mytable;

> beeline -i /user/test.ini.hql

hive> select * from ${tablename}
hive> select * from ${hivevar:tablename}

Data Type

Numeric Types: INT/INTEGER ,BIGINT, FLOAT , DOUBLE, DECIMAL, NUMERIC

Date/Time Types: TIMESTAMP , DATE , INTERVAL

String Types: STRING, VARCHAR, CHAR

Misc Types BOOLEAN, BINARY

Complex Types

Arrays: ARRAY<data_type>
maps: MAP<primitive_type, data_type>
structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
union: UNIONTYPE<data_type, data_type, ...>

Hive Partitioning

Hive table partition is using for split the large table and large amount of data into smaller parts according to specific columns and keys. Hive is internally split the records into sub directory on HDFS with key’s and values.

Fast accessed to the data
Optimize the Query and reduce the query execution times
Provides the ability to perform an operation on a smaller dataset

Such as, We have one year of data so if we use partition with each date with can run data with specific date so we do not have to think about other data because of partition.

Hive Bucketing

Hive Bucketing is a clustering technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number.

Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data to improve the query performance of the partitioned table.

Each bucket is stored as a file within the table’s directory or the partitions directories on HDFS.

Records with the same value in a column will always be stored in the same bucket.

Hive bucketing commonly created in two scenarios.

Create a bucket on top of the Partitioned table to further divide the table for better query performance.
Create Bucketing on the table where you cannot choose the partition column due to (too many distinct values on columns).

In our example below, I will be explaining the first approach where I create Bucketing on top of the partitioned table.

Sample Command show database's and table structure

Show databases

> show databases;

Show all table

> use <database_name>;
> show tables;

Table description

> use <database_name>
> desc <table_name>;

# Show in formated way
> desc formatted <table_name>

File Format in Hive

Following are the Apache Hive different file formats:

Text File
Sequence File
RC File
AVRO File
ORC File
Parquet File

Text File Format

It’s a default storage and plane file format for hive
Data is stored in row and column format and terminate with new line (\n).
Use the compression like (GZIP2 and other compression) to reduce the file size to store the information.

Sequence File Format

This format is key-value pair format
it is able to split into different file and also have feature to merge two or more file into one file during different operation

3. To create Sequence file format you have add ‘STORED AS SEQUENCEFILE’ during create table command.

Hive RC File Format

RCFile format is column based file format also support multiple row at time
it also support row level compression and data as key-value pair format
To create RCFile format we have to add the option STORED BY RCFILE` during CREAT TABLE command.

The ORC file

ORC format is also Row and columnar format
This file format is efficient to save data and improve the reading and writing process of hive during large scale of data
We can create the ORC file format data by add the option ‘STORED AS ORC’ during CREATE TABLE command.

Hive Parquet File Format

Parquet file format is column oriented binary file format
We can also split the data into different file and merge different file during different operation
It Provide better performance during Query through large-scale data
It supports different compression like Snappy, GZIP
You need to specify ‘STORE AS PARQUET’ option during CREATE TABLE command, if you want to use this as file format.

About Loading data to table hive through `hdfs` and `hive` and `beeline`

Sample CSV data for publisher (Publisher.csv)

> mkdir -p doc
> cd doc
> [curl] https://raw.githubusercontent.com/tariqulislam/sample-hive-datasource/main/Publishers.csv
> docker container ls
> docker cp Publisher.csv  docker-hive_hive-server_1:/opt/doc/

Process of Insert data into HDFS

> hdfs dfs -ls /
> hdfs dfs -mkdir /user/hive/publisher/
> hdfs dfs -put -f /opt/doc/Publisher.csv /user/hive/publisher/
> hdfs dfs -cat /user/hive/publisher/Publisher.csv
> /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

Load the data into Publisher and check

0: jdbc:hive2://localhost:10000> LOAD 
. . . . . . . . . . . . . . . .> DATA INPATH '/user/hive/publisher/Publishers.csv' 
. . . . . . . . . . . . . . . .> OVERWRITE INTO TABLE bookstore.publishers
. . . . . . . . . . . . . . . .> PARTITION (dt='2024-07-24')
. . . . . . . . . . . . . . . .> ;
> SELECT * FROM  publishers where dt="2024-07-24"

Insert data by Running Insert Query

> beeline -u jdbc:hive2://localhost:10000/default
beeline> use bookstore;

Insert Command

-- Inserting data for partition dt='2024-07-06'
INSERT INTO authors PARTITION (dt='2024-07-06') VALUES
    (1, 'Jane Doe'),
    (2, 'John Smith'),
    (3, 'Alice Johnson');

-- Inserting data for partition dt='2024-07-05'
INSERT INTO authors PARTITION (dt='2024-07-05') VALUES
    (4, 'Michael Brown'),
    (5, 'Emily Davis');

-- Inserting more data for partition dt='2024-07-06'
INSERT INTO authors PARTITION (dt='2024-07-06') VALUES
    (6, 'David Wilson'),
    (7, 'Sarah Lee');

-- Inserting data for partition dt='2024-07-04'
INSERT INTO authors PARTITION (dt='2024-07-04') VALUES
    (8, 'Daniel Martin'),
    (9, 'Sophia White'),
    (10, 'Oliver Garcia');

Process the sample data for books table in HDFS


# curl is not using in code block in medium so i add [curl], please
# remove the [] from curl command
> [curl] https://github.com/tariqulislam/sample-hive-datasource/blob/main/Books.csv
> docker cp Books.csv  docker-hive_hive-server_1:/opt/doc/
> docker exec -it docker-hive_hive-server_1 bash
> hdfs dfs -ls 
> hdfs dfs -mkdir /user/hive/books/
> hdfs dfs -put -f /usr/local/hive/books/Books.csv /user/hive/books/

Loading the data into books table

> LOAD DATA INPATH '/user/hive/book_info/Books.csv' \
> OVERWRITE INTO TABLE bookstore.books
> PARTITION (dt='2024-07-24')
> SELECT * FROM  book_info where dt="2024-07-24"

Process the sample data for customers table in HDFS

# curl is not using in code block in medium so i add [curl], please
# remove the [] from curl command
> [curl] https://raw.githubusercontent.com/tariqulislam/sample-hive-datasource/main/Customer.csv
> docker cp Customers.csv  docker-hive_hive-server_1:/opt/doc/
> docker exec -it docker-hive_hive-server_1 bash
> hdfs dfs -ls 
> hdfs dfs -mkdir /user/hive/customers/
> hdfs dfs -put -f \
  /usr/local/hive/customers.csv /user/hive/customers/

Loading data into customers table and check the data by select Query Statement

> LOAD DATA INPATH '/user/hive/data/customers/Customers.csv' 
   INTO TABLE bookstore.customers;
> SELECT * FROM customers where dt="2024-07-24"

Run Query from sample HQL file by command line

SET hivevar:tbl_book = bookstore.books
SET hivevar:start_date = 2024-07-24
SET hivevar:end_date = 2024-07-24

SELECT 
   book_id
   , isbn
   , publisher
   , concat_ws(',', language_code) AS language_code
   , text_review
FROM 
  ${tbl_book}
WHERE
  publish_date BETWEEN ${start_date} AND ${end_date}

WITH GET_AUTHORS_HAS_BOOK AS (
   SELECT
      id,
      name,
      contact
   WHERE
     book_count > 1
)
SELECT
    book_id, 
    name,
    author.id
FROM
 book
JOIN GET_AUTHORS_4_OUTOF_5_STAR as author ON author.id = book.author_id

Below command is using for run the Query

> beeline -u 
> jdbc:hive2://localhost:10000/ -f /path/to/queries.hql

Data Analysis with Hadoop and Hive, Knowledgeable

GitHub - big-data-europe/docker-hive

This is a docker container for Apache Hive 2.3.2. It is based on https://github.com/big-data-europe/docker-hadoop so…

Basic Knowledge Base about Hadoop Structure

Apache Hadoop YARN

Hive Architecture

HiveServer2

Metastore

Hive clients and Beeline

Hive Driver

Hive Compiler

Optimizer

Different Operation of HADOOP by Hive

Create Table in hive

Variable Declaration

Data Type

Complex Types

Sample Command show database's and table structure

File Format in Hive

Hive RC File Format

Hive Parquet File Format

About Loading data to table hive through `hdfs` and `hive` and `beeline`

Run Query from sample HQL file by command line

Written by Tariqul Islam

No responses yet

Data Analysis with Hadoop and Hive, Knowledgeable

GitHub - big-data-europe/docker-hive

This is a docker container for Apache Hive 2.3.2. It is based on https://github.com/big-data-europe/docker-hadoop so…

Basic Knowledge Base about Hadoop Structure

Apache Hadoop YARN

Hive Architecture

HiveServer2

Metastore

Hive clients and Beeline

Hive Driver

Hive Compiler

Optimizer

Different Operation of HADOOP by Hive

Create Table in hive

Variable Declaration

Data Type

Complex Types

Sample Command show database's and table structure

File Format in Hive

Hive RC File Format

Hive Parquet File Format

About Loading data to table hive through hdfs and hive and beeline

Run Query from sample HQL file by command line

Written by Tariqul Islam

No responses yet

About Loading data to table hive through `hdfs` and `hive` and `beeline`