Data Analysis with Hadoop and Hive, Knowledgeable

Tariqul Islam
11 min readJan 17, 2025

--

After working with Hadoop and Hive for few years, I just thought about writing something which will helps the developer or big data learners to how they can start with Hadoop and hive to analysis the data for their system

Installation and Using predefined docker images for Learning purpose

Topics of the articles

  1. Using the pre configured the hive and Hadoop cluster
  2. Architecture of Hadoop
  3. Architecture of Hive
  4. HDFS
  5. Common Operation of Hive (Create database and table)
  6. Discuss data Types
  7. Partitioning
  8. Bucketing
# Clone the Docker Project for docker hive
git clone https://github.com/big-data-europe/docker-hive
cd docker-hive
# Run and Enter into docker container
docker-compose up -d
docker-compose exec hive-server bash
# To Run Hive with Beeline
/opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

Basic Knowledge Base about Hadoop Structure

HDFS

HDFS stands for Hadoop Distributed File System . Works as master slave architecture. Hadoop has Name Node and Data Node to control operation, There is one name node and multiple data node, it stores and handle the large amount of dataset with distributed manner.

The main feature of the HDFS is

  1. Provides distributed storage and it can be implemented on commodity hardware
  2. Data security and highly fault-tolerant

MAP Reduce

Hadoop MapReduce is pice of software and framework which purpose is to process the vast amount of data in parallel on large amount of clusters using commodity hardware and fault-tolerant manner.

Hadoop Map Reducer Process Stage

Input           --> Input the data 

Map Stage --> Map the job to process the input data or information
Shuffle Stage --> Suffle the data and mapping it
Sorting State --> and Sorting the data and provide output to Reduce stage
Reduce Stage --> Reduce the map with reduce algorithem

Output --> Provide the output result

Apache Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator.

  1. The responsibility of YARN is to split the task and functionally
1.1 Resource Management
1.2 Job Management for Hadoop Server
1.3 Job Scehduling
1.4 Monitoring the Seperate servers and deamons

2. It has 2 mechanism

2.1 Global Resouce Manager (RM)
2.2 Pre-Application Master (AM)

YARN invoke the single application as JOB or a DAG of Jobs

It also has a responsibility for manage the cluster resource to distribute the task in different machine or resource in cluster. And Make sure the job and job scheduler in right path.

HADOOP Common

It is a common component which provide essential library and service and utilities for properly functioning the Hadoop ecosystem.

  1. It contains IO operation , RPC Protocol and serialization module
  2. Also provide the necessary Java Archive (JAR) files and scripts required to run Hadoop, ensure seamless interaction and communication between component
  3. Also Provide the API for FileSystem and data storage system integration with HDFS
  4. Also has source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.

Hive Architecture

HiveServer2

  1. HiveServer 2 is an improved implementation of HiveServer1.
  2. It support concurrent connection and session by Thrift Service
  3. Also Can be Authorized and Authenticate by LDAP, Kerberos
  4. Support different driver such as ODBC and JDBC ans also support the optimized execution of Hive Query

Metastore

Hive need to save the metadata for run different operation, so it’s need the store to save meta data called Meta store.

  1. Hive Meta store default comes with Derby database, not support concurrent execution and query
  2. In production, we need to use Relational Database like MySQL, PostgreSQL, Oracle as database for Meta Store, it will support the concurrent feature of execution of operation and query.
  3. MetaStore Actually store the information about structure of Hive table and partitions, column and column types and locations

Hive clients and Beeline

Early version of hive is using the hive cli version for run and query through command line and which is deprecated.

  1. In recent and latest version of Hive recommend the developer or user to use Beeline as command line (CLI) for run execute the Query and other operation of Hive and Hadoop.
  2. Beeline connect through the HiveService2 and act as interface or client for user to run queries and output the result.
  3. Beeline also support the Pretty output of Hive query result with column name and field and all hive CLI features

Hive Driver

  1. The Hive driver receives the HiveQL statements submitted by the user through the command shell.
  2. It creates the session handles for the Query and sends the Query to the compiler.

Hive Compiler

  1. Hive compiler parses the Query by performs semantice analysis and type checking on the different Query blocks and Query Expression with the help of Meta Store.
  2. Generate the Execution Plan and The execution plan created by complier is called DAG (Directed Acyclic Graph),
  3. DAG contains stage of map/reduce Job and Operation on HDFS and Meta Store Operation.

Optimizer

Optimizer performs the transformation operations on the execution plan and splits the task to improve efficiency and scalability.

Different Operation of HADOOP by Hive

Using HDFS and Different tools for create the database in warehouse in HDFS

# Create the storages for database creation
/opt/hadoop-2.7.4/bin/hdfs dfs -mkdir -p /user/hive/warehouse/bookstore.db
# check warehouse location is created for bookstore
/opt/hadoop-2.7.4/bin/hdfs dfs -ls /user/hive/warehouse/

Create Database by beeline CLI

/opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> create database
. . . . . . . . . . . . . . . .> bookstore
. . . . . . . . . . . . . . . .> location '/user/hive/warehouse/bookstore.db'
. . . . . . . . . . . . . . . .> ;

Check the database is created on specific directory of HDFS

0: jdbc:hive2://localhost:10000> desc database bookstore;

Second Approach, Create database with HDFS pre defined Storage

0: jdbc:hive2://localhost:10000> create database
. . . . . . . . . . . . . . . .> bookstore_qa
. . . . . . . . . . . . . . . .> ;
No rows affected (0.073 seconds)

Create Table in hive

Table Creation Syntax

CREATE [ EXTERNAL ] TABLE <table name> [IF NOT EXISTS] (
column_name1 data_type COMMENT ....,
column_name2. data_type COMMENT ....
)
PARTITIONED BY (column_name3, column_name4)
CLUSTERED BY ( col_name1, col_name2, ...)
SORTED BY ( col_name1 [ ASC | DESC ])
ROW FORMAT row_format
--- IF ROW Format is Text and delemeted --
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
-- End IF --
STORED AS file_format
LOCATION 'path ....'
TBLPROPERTIES ( key1=val1, key2=val2, ... )

Create table with changing the key properties and and file format and location

Sample of creating table without text file format

CREATE TABLE authors (
id INTEGER,
author_name STRING
)
PARTITIONED BY (dt STRING)
TBLPROPERTIES ("transactional" = 'false')
;

Creating the file with text based formats (publisher)

CREATE TABLE publishers (
publisher_id INTEGER,
publisher_name STRING,
name STRING,
address STRING,
founding_date STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
TBLPROPERTIES ("transactional" = 'false')
;

Another example of creating table with text base format (book_info)

CREATE TABLE bookstore.book_info (
book_id INTEGER,
title STRING,
authors ARRAY<STRING>,
avarage_rating DECIMAL(10,2),
isbn INTEGER,
isbn12 INTEGER,
language_code ARRAY<STRING>,
number_of_page INTEGER,
rating_count INTEGER,
text_review INTEGER,
publisher_id INTEGER,
publish_date STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
TBLPROPERTIES ("transactional" = 'false')
;

Example for External Table for Hadoop and Hive database

CREATE EXTERNAL TABLE customer (
id INTEGER,
customer_name STRUCT<firstname: STRING, middlename: STRING, lastname: STRING>,
personal_info MAP<STRING, STRING>,
email STRUCT<one: STRING, two: STRING, three: STRING>,
contact STRUCT<one: STRING, two: STRING, three: STRING>
)
PARTITIONED BY (dt STRING, country STRING)
LOCATION '/user/hive/customer_info'
TBLPROPERTIES ("transactional" = 'false');

Variable Declaration

We can declare and access the Hive variable by 3 way

  1. hiveconf — hive started with this, all the hive configuration is stored as part of this conf.
  2. hivevar: To store user variables.
  3. system: To store system variables.

We can do it by different way, Access the Beeline from command line, Then write the code below

> SET KEY=VALUE
> SET dt=2013-10-12
> SELECT * FROM bookstore.publishers WHERE dt=${dt}

RUN the application with hive environment variable

SELECT 
*
FROM
book_info
WHERE
dt=${dt}
> hive -hiveconf dt='2024-07-24' -f test.hql
# test.ini.hql
SET hivevar:tablename=mytable;
> beeline -i /user/test.ini.hql
hive> select * from ${tablename}
hive> select * from ${hivevar:tablename}

Data Type

Numeric Types: INT/INTEGER ,BIGINT, FLOAT , DOUBLE, DECIMAL, NUMERIC

Date/Time Types: TIMESTAMP , DATE , INTERVAL

String Types: STRING, VARCHAR, CHAR

Misc Types BOOLEAN, BINARY

Complex Types

  • Arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
  • union: UNIONTYPE<data_type, data_type, ...>

Hive Partitioning

Hive table partition is using for split the large table and large amount of data into smaller parts according to specific columns and keys. Hive is internally split the records into sub directory on HDFS with key’s and values.

  • Fast accessed to the data
  • Optimize the Query and reduce the query execution times
  • Provides the ability to perform an operation on a smaller dataset

Such as, We have one year of data so if we use partition with each date with can run data with specific date so we do not have to think about other data because of partition.

Hive Bucketing

Hive Bucketing is a clustering technique to split the data into more manageable files, (By specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-defined number.

Bucketing can be created on just one column, you can also create bucketing on a partitioned table to further split the data to improve the query performance of the partitioned table.

Each bucket is stored as a file within the table’s directory or the partitions directories on HDFS.

Records with the same value in a column will always be stored in the same bucket.

Hive bucketing commonly created in two scenarios.

  • Create a bucket on top of the Partitioned table to further divide the table for better query performance.
  • Create Bucketing on the table where you cannot choose the partition column due to (too many distinct values on columns).

In our example below, I will be explaining the first approach where I create Bucketing on top of the partitioned table.

Sample Command show database's and table structure

Show databases

> show databases;

Show all table

> use <database_name>;
> show tables;

Table description

> use <database_name>
> desc <table_name>;

# Show in formated way
> desc formatted <table_name>

File Format in Hive

Following are the Apache Hive different file formats:

  • Text File
  • Sequence File
  • RC File
  • AVRO File
  • ORC File
  • Parquet File

Text File Format

  1. It’s a default storage and plane file format for hive
  2. Data is stored in row and column format and terminate with new line (\n).
  3. Use the compression like (GZIP2 and other compression) to reduce the file size to store the information.

Sequence File Format

  1. This format is key-value pair format
  2. it is able to split into different file and also have feature to merge two or more file into one file during different operation

3. To create Sequence file format you have add ‘STORED AS SEQUENCEFILE’ during create table command.

Hive RC File Format

  1. RCFile format is column based file format also support multiple row at time
  2. it also support row level compression and data as key-value pair format
  3. To create RCFile format we have to add the option STORED BY RCFILE` during CREAT TABLE command.

The ORC file

  1. ORC format is also Row and columnar format
  2. This file format is efficient to save data and improve the reading and writing process of hive during large scale of data
  3. We can create the ORC file format data by add the option ‘STORED AS ORC’ during CREATE TABLE command.

Hive Parquet File Format

  1. Parquet file format is column oriented binary file format
  2. We can also split the data into different file and merge different file during different operation
  3. It Provide better performance during Query through large-scale data
  4. It supports different compression like Snappy, GZIP
  5. You need to specify ‘STORE AS PARQUET’ option during CREATE TABLE command, if you want to use this as file format.

About Loading data to table hive through hdfs and hive and beeline

Sample CSV data for publisher (Publisher.csv)

> mkdir -p doc
> cd doc
> [curl] https://raw.githubusercontent.com/tariqulislam/sample-hive-datasource/main/Publishers.csv
> docker container ls
> docker cp Publisher.csv docker-hive_hive-server_1:/opt/doc/

Process of Insert data into HDFS

> hdfs dfs -ls /
> hdfs dfs -mkdir /user/hive/publisher/
> hdfs dfs -put -f /opt/doc/Publisher.csv /user/hive/publisher/
> hdfs dfs -cat /user/hive/publisher/Publisher.csv
> /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000

Load the data into Publisher and check

0: jdbc:hive2://localhost:10000> LOAD 
. . . . . . . . . . . . . . . .> DATA INPATH '/user/hive/publisher/Publishers.csv'
. . . . . . . . . . . . . . . .> OVERWRITE INTO TABLE bookstore.publishers
. . . . . . . . . . . . . . . .> PARTITION (dt='2024-07-24')
. . . . . . . . . . . . . . . .> ;
> SELECT * FROM publishers where dt="2024-07-24"

Insert data by Running Insert Query

login into Beeline

> beeline -u jdbc:hive2://localhost:10000/default
beeline> use bookstore;

Insert Command

-- Inserting data for partition dt='2024-07-06'
INSERT INTO authors PARTITION (dt='2024-07-06') VALUES
(1, 'Jane Doe'),
(2, 'John Smith'),
(3, 'Alice Johnson');

-- Inserting data for partition dt='2024-07-05'
INSERT INTO authors PARTITION (dt='2024-07-05') VALUES
(4, 'Michael Brown'),
(5, 'Emily Davis');

-- Inserting more data for partition dt='2024-07-06'
INSERT INTO authors PARTITION (dt='2024-07-06') VALUES
(6, 'David Wilson'),
(7, 'Sarah Lee');

-- Inserting data for partition dt='2024-07-04'
INSERT INTO authors PARTITION (dt='2024-07-04') VALUES
(8, 'Daniel Martin'),
(9, 'Sophia White'),
(10, 'Oliver Garcia');

Process the sample data for books table in HDFS


# curl is not using in code block in medium so i add [curl], please
# remove the [] from curl command
> [curl] https://github.com/tariqulislam/sample-hive-datasource/blob/main/Books.csv
> docker cp Books.csv docker-hive_hive-server_1:/opt/doc/
> docker exec -it docker-hive_hive-server_1 bash
> hdfs dfs -ls
> hdfs dfs -mkdir /user/hive/books/
> hdfs dfs -put -f /usr/local/hive/books/Books.csv /user/hive/books/

Loading the data into books table

> LOAD DATA INPATH '/user/hive/book_info/Books.csv' \
> OVERWRITE INTO TABLE bookstore.books
> PARTITION (dt='2024-07-24')
> SELECT * FROM book_info where dt="2024-07-24"

Process the sample data for customers table in HDFS

# curl is not using in code block in medium so i add [curl], please
# remove the [] from curl command
> [curl] https://raw.githubusercontent.com/tariqulislam/sample-hive-datasource/main/Customer.csv
> docker cp Customers.csv docker-hive_hive-server_1:/opt/doc/
> docker exec -it docker-hive_hive-server_1 bash
> hdfs dfs -ls
> hdfs dfs -mkdir /user/hive/customers/
> hdfs dfs -put -f \
/usr/local/hive/customers.csv /user/hive/customers/

Loading data into customers table and check the data by select Query Statement

> LOAD DATA INPATH '/user/hive/data/customers/Customers.csv' 
INTO TABLE bookstore.customers;
> SELECT * FROM customers where dt="2024-07-24"

Run Query from sample HQL file by command line

SET hivevar:tbl_book = bookstore.books
SET hivevar:start_date = 2024-07-24
SET hivevar:end_date = 2024-07-24

SELECT
book_id
, isbn
, publisher
, concat_ws(',', language_code) AS language_code
, text_review
FROM
${tbl_book}
WHERE
publish_date BETWEEN ${start_date} AND ${end_date}
WITH GET_AUTHORS_HAS_BOOK AS (
SELECT
id,
name,
contact
WHERE
book_count > 1
)
SELECT
book_id,
name,
author.id
FROM
book
JOIN GET_AUTHORS_4_OUTOF_5_STAR as author ON author.id = book.author_id

Below command is using for run the Query

> beeline -u 
> jdbc:hive2://localhost:10000/ -f /path/to/queries.hql

--

--

Tariqul Islam
Tariqul Islam

Written by Tariqul Islam

Tariqul Islam have 10+ years of software development experience. He knows Python, Java, C#,PHP and other Computer languages.

No responses yet