Interview Questions
Hadoop and Hive
-
As soon as the ‘Big Data Revolution’ started influencing the IT industry in its core, the drastic demand and advancement on Big Data followed. Storing and processing huge data warehouses with large datasets comprising of huge volume, high velocity, and different types of data is termed as the Big Data. This Big Data tends to increase in this volume every other day. Apache Software Foundation (ASF) introduced a revolutionary framework called Hadoop to solve Big Data management and processing challenges. No other traditional framework can accomplish the performance and capabilities that this framework can do.
Apache Hadoop
Apache Hadoop is an open-source framework that can store and process Big Data effectively in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).
MapReduce
MapReduce is a parallel programming model that is familiarly used for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware.
HDFS (Hadoop Distributed File System)
Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules.
Sqoop
It is used to import and export data to and from between HDFS and RDBMS.
Pig
It is a procedural language platform used to develop a script for MapReduce operations.
Hive
It is a platform used to develop SQL typescripts to do MapReduce operations. Note: There are various ways to execute MapReduce operations. One is the traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data. And the other is the scripting approach for MapReduce to process structured and semi-structured data using Pig. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive.
Apache Hive is a component of Hortonworks Data Platform (HDP). Hive provides an SQL-like interface to data stored in HDP. In the previous tutorial, we used Pig, which is a scripting language with a focus on data flows. Hive provides a database query interface to Apache Hadoop.
Which one to choose? Hive or Pig?
People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL-like query language is often used as the interface to an Apache Hadoop-based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo!. From a technical point of view, both Pig and Hive are feature complete, so you can do tasks in either tool. However, you will find one tool or the other will be preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.
Get in touch with training experts Get Free Quotes