Easywaylearnhadoop: HDFS Architecture

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

NameNode Functions :

The NameNode maintains and executes the file system namespace. If there are any modifications in the file system namespace or in its properties, this is tracked by the NameNode.
It directs the Datanodes (Slave nodes) to execute the low-level I/O operations.
It keeps a record of how the files in HDFS are divided into blocks, in which nodes these blocks are stored and by and large the NameNode manages cluster configuration.
It maps a file name to a set of blocks and maps a block to the DataNodes where it is located.
It records the metadata of all the files stored in the cluster, e.g. the location, the size of the files, permissions, hierarchy, etc.
With the help of a transactional log, that is, the EditLog, the NameNode records each and every change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
The NameNode is also responsible to take care of the replication factor of all the blocks. If there is a change in the replication factor of any of the blocks, the NameNode will record this in the EditLog.
NameNode regularly receives a Heartbeat and a Blockreport from all the DataNodes in the cluster to make sure that the datanodes are working properly. A Block Report contains a list of all blocks on a DataNode.
In case of a datanode failure, the Namenode chooses new datanodes for new replicas, balances disk usage and also manages the communication traffic to the datanodes.

DataNode Functions :

Datanodes are the slave nodes in HDFS. Datanode is a block server that stores the data in the local file ext3 or ext4.

Datanodes perform the low-level read and write requests from the file system’s clients.
They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
They regularly send a report on all the blocks present in the cluster to the NameNode.
Datanodes also enables pipelining of data.
They forward data to other specified DataNodes.
Datanodes send heartbeats to the NameNode once every 3 seconds, to report the overall health of HDFS.
The DataNode stores each block of HDFS data in separate files in its local file system.
When the Datanodes gets started, they scan through its local file system, creates a list of all HDFS data blocks that relate to each of these local files and send a Blockreport to the NameNode.

Safemode :

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

Secondary NameNode:

In the HDFS Architecture, the name – Secondary NameNode gives an impression that it is a substitute of the NameNode. Alas! It is not!

Now, at this point, we know that NameNode stores vital information related to the Metadata of all the blocks stored in HDFS. This data is not only stored in the main memory, but also in the disk.
The two associated files are:

Fsimage: An image of the file system on starting the NameNode.
EditLogs: A series of modifications done to the file system after starting the NameNode.

The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system. It is responsible for combining the editlogs with fsimage from the NameNode. It downloads the EditLogs from the NameNode at regular intervals and applies to fsimage. The new fsimage is copied back to the NameNode, which is used whenever the Namenode is started the next time.

However, as the secondary NameNode is unable to process the metadata onto the disk, it is not a substitute to the NameNode. So if the NameNode fails, the entire Hadoop HDFS goes down and you will lose the entire RAM present in the RAM. It just performs regular checkpoints in HDFS. Just a helper, a checkpoint node!

Easywaylearnhadoop

Wednesday, February 17, 2016

HDFS Architecture

NameNode Functions :

DataNode Functions :

Safemode :

Secondary NameNode:

No comments:

Post a Comment