Post your need
  • Free Big Data Tutorial
  • Significance of Big Data
  • Different Big Data Platforms
  • Hadoop and Big Data
  • Installation of Hadoop
  • HDFS Tutorial
  • Introduction to MapReduce
  • Working with MapReduce
  • Introduction to Sqoop
  • Introduction ot FLUME
  • Hadoop PIG Installation
  • Advanced Big Data Concepts

Hadoop and Big Data

    • It all began with Doug Cutting, Mike Cafarella and their team making use of technology solutions provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

      Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for huge amounts of data.

      Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage.

      Hadoop Architecture

      Hadoop framework includes following four modules:

      • Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provide filesystem and OS level abstractions and contain the necessary Java files and scripts required to start Hadoop.
      • Hadoop YARN: This is a framework for job scheduling and cluster resource management.
      • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
      • Hadoop MapReduce: This is a YARN-based system for parallel processing of large data sets.

      We can use the following diagram to depict these four components available in Hadoop framework.

      Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but also to the collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.

      Hadoop Operation Modes

      Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three supported modes:

      • Local/Standalone Mode: After downloading Hadoop in your system, by default, it is configured in a standalone mode and can be run as a single java process.
      • Pseudo Distributed Mode: It is a distributed simulation on a single machine. Each Hadoop daemon such as HDFS, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development.
      • Fully-Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster. We will come across this mode in detail in the coming chapters.
Interested about Big Data?
Get in touch with training experts Get Free Quotes
Leave a comment