Interview Questions
MapReduce
-
In order to enable parallel processing of large datasets stored in the Hadoop Distributed File System parallel, Apache MapReduce software framework is integrated in the Hadoop Architecture. Typically, a Hadoop environment will be comprised of thousands of nodes which are made of commodity hardware. So, it is more likely the system needs an application created by MapReduce framework to enable a reliable and fault-tolerant data processing.
The Tasks and Terminology
The term MapReduce is named for the two specific tasks carried out by the software framework. They are mapping and reducing tasks.
Map Task
In this initial task, the data extracted from the data source is converted into a key/value pairs. This will be the base data format which will face the next step ‘Reduce Task’
Reduce Task
After the map task, the key/value pairs generated are further combined into smaller sets of tuples.
While performing these tasks, the software framework itself will take care of the scheduling, monitoring, executing and re-executing the tasks.
Master and Slave
Every cluster-node in MapReduce framework contains one master JobTracker and slave TaskTracker. These two are responsible for the task management and execution.
Master JobTracker
Being the master, JobTracker is responsible for creating and assigning the tasks to be run on MapReduce. It assigns the job to TaskTracker enabling it to run on DataNodes. After running the job, the TaskTracker reports the status to the JobTracker. Job Tracker is also responsible for keeping track of the resources available and the resources consumed. Since the JobTracker is the crucial point for MapReduce to run the jobs, the failure of which would lead to complete halt of MapReduce?
Slave TaskTracker
TaskTracker is designated to accomplish the tasks which are assigned by the JobTracker. And while running those tasks, the TaskTracker will keep informed the JobTracker about the task status at regular intervals.
Get in touch with training experts Get Free Quotes