Hadoop PIG: How to Master with Super Easy Tutorial

hadoop pig

Do you want to know about Hadoop Pig? If yes, then give your few minutes to this blog to know What is Hadoop Pig and all details related to Hadoop Pig.

Hello, & Welcome!

In this blog, I am gonna tell you-

  1. What is Hadoop Pig?
  2. The motivation of Using Pig.
  3. Applications of Pig.
  4. Data Types in Pig.
  5. Limitations of Pig.
  6. Modes in Pig.
  7. The syntax for Writing Pig Commands.
  8. Pig Commands.
  9. Joins in Pig.
  10. Data Mining on Cricket Data using Pig.

Firstly, I would like to start with-

What is Hadoop Pig?

Apache Pig is used for processing. Pig provides a higher-level language, Pig Latin, that increases productivity. it opens the system to non-Java programmers. It provides common operations like join, group, filter, sort.

Pig Latin is-

  • A data flow language rather than procedural or declarative.
  • User code and existing binaries can be included almost anywhere.
  • Metadata not required, but used when available.
  • Support for nested types.
  • Operates on files in HDFS.
  • In one test– 10 lines of Pig Latin ≈ 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin.

Pig is in the processing layer, once data is in HDFS, then for processing Pig is used. Pig is Yahoo Product. It store files in Object format.

To do processing on data, the first step is to put files into Pig object then do processing on that data, and after that transfer output into Hadoop. Once the output is transformed the data is removed from the Pig.

Pig can’t store data permanently, after processing data is removed from the Pig. Pig works on structured data as well as unstructured data.

Yahoo! Research developed Pig to address the need for a higher-level language. Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.

The motivation of Using Pig.

Most of MapReduce programming is done in java. So one who is solving big data problems must have good Java programming background to understand MapReduce and its concepts very well. Solving a word count problems needs 100 lines of code.

Solving the data mining problem (on Wikipedia data set, which we will see in the upcoming section) requires two MapReduce. Map Reduce in Java is very low level needs to deal with all the intricacies not only programmer, many non-programmers like Business Analyst, Data Scientist also wants to analyze the data Requirement.

So those who don’t know to code can use Apache Pig. The same task can be performed with a few lines of code. Pig took less time as compared to Java.

Applications of Pig.

Hadoop Pig works on these fields-

  • Weblog processing.
  • Data processing for web search platforms.
  • Ad hoc queries across large data sets.
  • Rapid prototyping of algorithms for processing large data sets.

Data Types in Pig.

Hadoop Pig has two types of Data-

  1. Scalar Type-
    1. Int
    2. Long
    3. Double
    4. Char array
    5. Byte array
  2. Complex Type-
    1. Map: associative array. For eg- ENo. [1….10]
    2. Tuple: an ordered list of data, elements may be of any scalar or complex type. For eg- (1,A,10),(2,B,20),(3,C,10)
    3. Bag: an unordered collection of tuples. For eg- { (1,A,10),(2,B,20),(3,C,10) }

Limitations of Pig.

Hadoop Pig has following limitation-

  • No schema.
  • No Delete, insert. update.
  • There are no subqueries.
  • There is no projection.
  • Pig can’t store data permanently.

Modes in Pig-

There are two modes in Pig for writing script-

  1. Interactive Mode- This is a query-by-query mode. A developer uses this mode.
  2. Batch Mode- In that mode write full code and save it as .png, then run it. In Pig, the file is saved as .png.

The syntax for Writing Pig Commands.

To write any Pig commands, you need to follow this syntax-

<Object_name>=<command><Object_file> [by<condition>];

Here,

Object name- is what object name you are trying to store.

Command- is what command you are passing.

Object_file- is the name of the file.

Condition- What condition you wanna pass.

Pig Commands.

Here, I will discuss the most used Pig Commands-

Loading Data-

Read Data from the file system. In PIG files stores in the form of OBJECTS. The lifetime of objects within logon and logoff PIG.

To load a file from HDFS to PIG

Grunt> records = LOAD ‘pigsample’ as (firstname:chararray, seondname:chararray);

TAB is the default delimiter, If the input file is not tab-delimited, even though loading will not be a problem but subsequent operations will be erroneous.

Other delimiters can be specified by “using PigStorage(‘ ‘)”. In this case, space is the delimiter-

Grunt> records = LOAD ‘pigsample’ as 
 				using PigStorage (‘ |’) as 
				(firstname:chararray, seondname:chararray);

For eg.-

inputData= LOAD 'wiki/input1'using PigStorage('') as (projectName:chararray,pageName:chararray,pageCount:int,pageSize:int);

Here, Data “wiki/input1” is stored on HDFS. The fields are separated by space which is defined by “USING PigStorage(‘ ‘).

NOTE- when you are defining the column name, you need to specify the data types of the column, else NULL will be stored.

To view objects in PIG

grunt> dump records;

To describe an object in PIG

grunt>desribe records;

Filter Records in Pig-

grunt>output= FILTER records by projectName='en'

Here, ‘records’ is the data set and ‘projectName’ is the column name, which is defined in the ‘records’.

GROUPING and SORTING

Grouping-

grunt>group_records = GROUP filtered_records by pagename;

Each Tuple in ‘group_records’ has two parts- group element which is a column name on which you are grouping. And filtered_records is itself a bag containing all tuples from filtered_records with matching ‘pagename‘.

Sorting-

grunt >Sorting_records = ORDER reords by $0 desc;

It will sort in the following order-

$0🡪first col

$1🡪second col

$2🡪third col

There are few more commands in Pig-

Pig CommandsWhat it does
StoreWrite Data to file System.
foreachApply expression to each record and output one or more records.
joinJoin two or more inputs based on a key.
distinctRemove Duplicate Records
unionmerge two dataset
splitsplit data into two or more sets,based on filter condition.
streamsend all records through a user provided binary
dumpwrite output to stdout
limitLimit the number of records.

Joins in Pig.

There are four types of join in Pig, similar to Hive

  1. INNER Join
  2. Left Outer Join
  3. Right Outer Join
  4. Full Outer Join

For more details about Joins, you can read in my Hive Tutorial.

1. INNER JOIN

The default is Inner join. Therefore, no need to pass the INNER keyword. Let’s see in the example-

grunt> result = JOIN persons(tbl 1 object) by personid,orders(tbl 2 object) by personid;

2. Left Outer Join-

grunt> result = JOIN persons by personid LEFTOUTER,orders by personid;

3. Right Outer Join-

grunt> result = JOIN persons by personid RIGHTOUTER,orders by personid;

4. Full Outer Join-

grunt> result = JOIN persons by personid FULLOUTER,orders by personid;

Data Mining on Cricket Data using Pig.

Problem Statement- In that example, there is a file, where cricket info is stored with city name, cricketer name, and likes. Here, we have to do analysis for in Bangalore which player gets more like.

Dataset- This is not a huge dataset, but small one, just to understand you, how to perform analysis using Pig.

CityCricketer_NameLikes
ChennaiKohli10
ChennaiDhoni50
HyderabadKohli40
BangaloreDhoni30
BangaloreKohli10
BangaloreDhoni30
BangaloreKohli40

So, from here we need to find how many likes Kohli gets in Bangalore and how many likes Dhoni gets-

City- BangaloreLikes
Kohli?
Dhoni?

So, this dataset or file is in HDFS, to perform analysis, first, load it into PIG. That’s why the first step is loading the file into PIG.

Load File into PIG-

grunt>cric_in=LOAD '/user/demo/cricket'
           using PigStorage('')
         as (City:chararray,
             Cricketer_Name:chararray,
             Likes:int
            );

Here, Cric_in is the Object name, it may be anything. It’s up to your choice. ‘/user/demo/cricket’ is the path from the HDFS.

To View the File-

Once the file has loaded in PIg, In order to view the file under Pig, you need to write-

grunt>dump cric_in

Select a City as Bangalore from File-

grunt>cric_ban=filter cric_in by City=='Bangalore';

To Group it by Cricketer Name-

grunt>cric_grp=group cric_ban by Cricketer_Name;

So, when you view this file after that by writing-

grunt>dump cric_grp

then, you will get something like that-

(Dhoni), {(Bangalore, Dhoni,10), (Bangalore,Dhoni,30)}

(Kohli), {(Bangalore, Kohli,10), (Bangalore,Kohli,40)}

It group by name as Dhoni and Kohli.

Now you have to find the more likes for each cricketer, so for that, you have to add like for each person. Let’s see in the example-

grunt>cric_sum=foreach cric_grp generate group,Sum(cric_ban.Likes);

Here, ‘group’ is case sensitive.

To view the Results-

grunt> dump cric_sum

The output will be like that-

Dhoni 60
Kohli 50

Sort it in Descending Order-

If you want that, Kohli comes first than Dhoni, then you can perform sorting.

grunt>cric_sort=Order cric_sum by $0 desc.

Here, $0 is a name position.

Store result in Hadoop-

The last and final step is to store the result in Hadoop. So for that, write the following line-

grunt>store cric_sort into '/user/demo/cric_pig.out;

Congratulations! That’s all for Hadoop Pig.

I hope, now you have a clear idea about What is Hadoop Pig and all its details.

Enjoy Learning!

All the Best!

Related Search

Best Online Courses for Data Science to become A Skilled Data Scientist

15 Best Books on Data Science Everyone Should Read in 2024
Data Science vs Data Analyst: Ultimate Guide to Clear Doubts
How to make Data Science Resume to Get Hired?
What is Big Data Analytics? Things no one tells you
Data Science: Top 8 Most Demanding Skills to Get You Hired

Explore More about Data Science, Visit Here

Thank YOU!

Though of the Day…

It’s what you learn after you know it all that counts.’

John Wooden
author image

Written By Aqsa Zafar

Founder of MLTUT, Machine Learning Ph.D. scholar at Dayananda Sagar University. Research on social media depression detection. Create tutorials on ML and data science for diverse applications. Passionate about sharing knowledge through website and social media.

Leave a Comment

Your email address will not be published. Required fields are marked *