Hadoop Hive Full Tutorial: All You Need to Know About Hive

Hive Hadoop

Do you want to know about Hadoop Hive? If yes, then give your few minutes to this blog to know What is Hadoop Hive and all details related to Hadoop Hive.

Hello, & Welcome!

In this blog, I am gonna tell you-

  1. What is Hadoop Hive?
  2. Hadoop Hive Components.
  3. Hive Modes.
  4. Data Storage in Hive.
  5. Hive Table.
  6. What is the Hive Data Type?
  7. Hive Limitations.
  8. Steps to Write Code.
  9. Basic Queries in Hive.
  10. Joins in Hive.
  11. Data Mining on Wikipedia datasets

Firstly, I would like to start with-

What is Hadoop Hive?

Hadoop Hive is a Data warehousing tool on top of Hadoop. It has an SQL like interface. Hive provides SQL like language to analyze the data stored on HDFS. It can be used by people who know SQL and who don’t want to write a huge amount of code.

By using Hive, you can perform the same task by writing a few lines, the same as by writing huge code in java. So if someone who doesn’t know to code, they can also become a data analyst by using Apache Hive.

Hive doesn’t support all traditional SQL capabilities. Under the hood, hive queries are executed as MapReduce jobs. No extra work is required.

Hive stores data in TABLE format. Hive supports only structured data, not unstructured data.

Let’s see how Hive works with the help of this flow chart-

Hadoop Hive
Flow Chart

Hadoop Hive Components-

Now, Its time to discuss Hadoop Hive Components. Hive has the following components-

  1. MetaStore- Its a database consisting of table definitions and other metadata.By default stored on the local machine on the derby database. It can be kept on some shared machine-like relational database if multiple users are using
  2. Query Engine- Hive-QL which gives SQL like query. Internally Hive queries are run as a map-reduce job.
  3. Hive Data Models- Hive forms or layers table definitions, on top of data residing on HDFS.
  4. Databases- Namespace that separates tables from other units from naming conflicting.
  5. Table- Homogenous unit of data having the same schema.
  6. Partition- Determines how the data is stored Virtual columns, not part of data but derived from load Buckets / Cluster.

Hive Modes-

Hive work on two types of Modes-

  1. Local Mode- This is Query-by-Query mode, which means, you write one query then execute it and then write the next query and so on.
  2. Batch Mode- Here, you write all code first and then save this code in .hql file and then run this file as-
hive>source filename.hql

Data Storage in Hive-

In Hive, you can store data in two ways-

  1. Temporary- It is the same as PIG. The lifetime of temporary data is until you are in hive shell. You can use Temporary data when you load data from Hive.
  2. Permanent– The purpose of permanent is to store data permanently. Data means Data+MetaData. You can use Permanent when you load data from outside Hadoop.

Hive Tables-

Hive has four types of Tables-

  1. Internal Table.
  2. External Table.
  3. Partition Table.
  4. Cluster Table

Internal Table-

By default, the table in Hive is Internal. In the internal table, data will store in /tmp dir means a temporary store.

Let’s see how to create Internal Table

CREATE TABLE IF NOT EXISTS student1 ( name string, id int , course string, year int )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Here, no need to pass the “INTERNAL” keyword, because it is by default internal.

External Table-

In external table store data in a permanent directory. Here you need to pass the “EXTERNAL” Keyword.

Let’s see how to create an external table-

CREATE EXTERNAL TABLE IF NOT EXISTS student3 ( name string, id int , course string, year int )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Here, we pass the “EXTERNAL” keyword.

Partition Table-

Partition Table is used when there is huge data means big file.

Let’s see how to create a Partition Table-

CREATE TABLE IF NOT EXISTS student_partition ( name string , id int ) 
PARTITIONED BY ( course string , year int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

What is the Hive Data Type?

Hive has two types of data-

  1. Primary Types.
  2. Complex Type.

Primary Types- Primary type contains the following data types-

– TINYINT (1 byte)

– SMALLINT(2 bytes)

– INT(4 bytes)

– BIGINT(8 bytes)

– FLOAT(4 bytes)

– BOOLEAN(true/false)

– DOUBLE(8 bytes)

– STRING(char string)

– BINARY ( recently included- byte array)

Complex Type- Complex type has the following data type-

– ARRAY (an ordered collection of fields.The fields must all be of the same type) (ex:-array(1,2))

– MAP (An unordered collection of Key values) (ex:- map(‘a’,1,’d’,2..)

– STRUCT (a collection of different type of named fields)

Hive Limitations-

Hive has following Limitation-

  1. All the standard SQL queries are not supported.
  2. Subqueries are not supported.
  3. No Support for UPDATE and DELETE operation.
  4. Cannot insert single rows.
  5. Only Select, load, retrieve commands.

Steps to Write Code

Step 1-

Create Input Table based on FF (float file) structure.

Step 2-

Load Data into Input data.

Step 3-

Create Output Table.

Step 4-

Insert into the Output table. Select from the input table.

Basic Queries in Hive-

To Show Table- (Note:- TABLES is Data Dictionary in HIVE)

hive> SHOW TABLES;

How to Create Table-

hive> CREATE TABLE sample(firstName STRING, lastName
STRING, id INT)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

To Show Table Definition-

hive> DESCRIBE sample; 

How To Load Data Into Table From HDFS-

LOAD DATA INPATH “sample_data” INTO TABLE sample;

To Load Data Into Table From Local File System

NOTE- Give the path to the data present on your local filesystem

Hive>LOAD DATA LOCAL INPATH “sample_local_data” INTO TABLE sample;

Basic Select Queries

Note:-This normal select query does not run the map-reduce job. Just queries the metadata and displays the entire data.

SELECT * FROM sample;

Note:-Second query executes the map-reduce job. When we are doing FILTER like WHERE the MR will be activated.

SELECT * FROM sample WHERE id >100
ORDER BY id ASEC
LIMIT 20;

DROP table in HIVE

hive> drop table sample;

CREATE DATABASE

CREATE DATABASE IF NOT EXISTS kalyan;

TO ALTER DATABASE-

ALTER DATABASE kalyan SET DBPROPERTIES ('key1' = 'value11', 'key3' = 'value3');

Joins in Hive-

Hive has four types of Joins-

Inner Join-

It contains only common records. The syntax is (E ⋂ D).

Let’s see how to create Inner Join.

SELECT A.id, B.id,A.firstName FROM
sample A JOIN otherTable B
ON (A.id = B.id);

Left Outer Join-

Here, it contains the left side table full data plus common data.

Left side Table+ common record

The syntax of Left Outer Join is, E + (E ⋂ D).

Let’s see how to create a left outer join.

SELECT A.id, B.id,A.firstName FROM
sample A  LEFT OUTER JOIN otherTable B
ON (A.id = B.id);

Right Outer Join-

Here, it contains the right side table full data plus common data.

Right side Table+ common record

The syntax of Right Outer Join is, D + (E ⋂ D).

Let’s see how to create a right outer join.

SELECT A.id, B.id,A.firstName FROM
sample A  RIGHT OUTER JOIN otherTable B
ON (A.id = B.id);

Full Outer Join-

In Full outer join, the full data come from both the table. Syntax of a full outer join is- (E U D)

Let’s see how to create a full outer join.

SELECT A.id, B.id,A.firstName FROM
sample A  FULL OUTER JOIN otherTable B
ON (A.id = B.id);

Data Mining on Wikipedia datasets.

Here, I am gonna discuss with you how you can perform mining on the Wikipedia dataset by using a Hive. You can perform data mining with simple four steps as we discussed before.

Let’s see how to do, start with the first step-

1. Create Input Table based on FF Structure.

Hive>CREATE EXTERNAL TABLE wiki(projectname STRING,
Pagename STRING, pageview INT, pagesize INT)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ' '
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION ‘/user/dev/wiki';

2. To load data from HDFS file –

Hive>load data inpath '/user/dev/wiki' into table wiki1;

3. Create another table to store the output-

Hive>CREATE TABLE wiki1(pagename STRING,sum INT);

4. Fire the actual query to calculate the sum of the pageview-

Hive>INSERT OVERWRITE TABLE wiki1
SELECT pagename,sum(pageview) 
       FROM wiki
     WHERE projectname='en' 
    group by pagename;
    order by sum(pageview)desc limit 10;

5. View the results on the console in descending order-

Hive>Select * from wiki1 order by sum DESC limit 20

Congratulations! That’s all for Hadoop Hive.

I hope, now you have a clear idea about What is Hadoop Hive and all its details.

Enjoy Learning!

All the Best!

Related Search

Best Online Courses for Data Science to become A Skilled Data Scientist

15 Best Books on Data Science Everyone Should Read in 2024
Data Science vs Data Analyst: Ultimate Guide to Clear Doubts
How to make Data Science Resume to Get Hired?
What is Big Data Analytics? Things no one tells you
Data Science: Top 8 Most Demanding Skills to Get You Hired

Explore More about Data Science, Visit Here

Thank YOU!

Though of the Day…

It’s what you learn after you know it all that counts.’

John Wooden

Written By Aqsa Zafar

Founder of MLTUT, Machine Learning Ph.D. scholar at Dayananda Sagar University. Research on social media depression detection. Create tutorials on ML and data science for diverse applications. Passionate about sharing knowledge through website and social media.

Leave a Comment

Your email address will not be published. Required fields are marked *