Post your need
  • Free Big Data Tutorial
  • Significance of Big Data
  • Different Big Data Platforms
  • Hadoop and Big Data
  • Installation of Hadoop
  • HDFS Tutorial
  • Introduction to MapReduce
  • Working with MapReduce
  • Introduction to Sqoop
  • Introduction ot FLUME
  • Hadoop PIG Installation
  • Advanced Big Data Concepts

HDFS Tutorial

    • The Hadoop Distributed File System (HDFS) in the Apache Hadoop framework is a distributed file system designated dedicatedly to store huge clusters of data (terabytes or even petabytes), and provide high-throughput access to this information. Different types of data and information are stored in a redundant fashion across multiple machines in the HDFS file system to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it. In addition to that, HDFS file system also makes applications available to parallel processing.

      Features of Hadoop Distributed File System

      • It is suitable for the distributed storage and processing.
      • Hadoop provides a command interface to interact with HDFS.
      • The built-in servers of namenode and datanode help users to easily check the status of cluster.
      • Streaming access to file system data.
      • HDFS provides file permissions and authentication.

      HDFS Architecture

      This revolutionary open source distributed file system follows the master-slave architecture which is comprised of the following components,

      Namenode

      The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

      • Manages the file system namespace.
      • Regulates client’s access to files.
      • It also executes file system operations such as renaming, closing, and opening files and directories.

      Datanode

      The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

      • Datanodes perform read-write operations on the file systems, as per client request.
      • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

      Block

      Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

      Vulnerabilities of HDFS System

      • HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
      • HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
      • HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.
      • HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.
Interested about Big Data?
Get in touch with training experts Get Free Quotes
Leave a comment