Understanding the working of hadoop cluster

What is NameNode?

NameNode stores all the information of the datanodes. When client stores data on DataNode ,NameNode gives the location of the data to the client.

What is a Data Node?

Data Node, also known as SlaveNode/EdgeNode. Data Nodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple Data Nodes for reliability and so that localized computation can be executed near the data.

What is Client Node?

Client node is used for uploading and reading files from the cluster. Client node system does not contribute its resources to the cluster, and is a independent system used to access the resources on the system.

Steps to install Hadoop on AWS EC2 instance:

  1. Run a EC2 instance on AWS.
  2. Check the public IP of AWS instance and convert the instance security key in .pem format to .ppk format using puttygen.
  3. Download JDK and Hadoop softwares from the internet.
  4. Use WinSCP to transfer these softwares to AWS instance. WinSCP helps in creating a SSH connection to the remote system. This will require public IP of AWS instance and instance security key in .ppk format.
  5. To check whether file has successfully transferred in your AWS instance or not check with command #ls.
  6. To install file we need to use root permission. Follow the below commands to for the same.

root@IPADDRESS ec2-user]# rpm -ivh jdk-8u171-linux-x64.rpm

Now, Install Hadoop

root@IPADDRESS ec2-user]# rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force

Go to Hadoop directory using command:

Configuring Master Node:

File hdfs-site.xml:

File core-site.xml:

root@IPADDRESS hadoop]# hadoop namenode -format root@IPADDRESS hadoop]# systemctl stop firewalld

Now, run the name node using following command:

root@IPADDRESS hadoop]# hadoop-daemon.sh start namenode root@IPADDRESS hadoop]# jps

hadoop-daemon.sh start namenode will start the namenode. To check if name node has started run jps command which return the status of the namenode process on the system.

Configuring Data/Slave/Edge Node:

File Core-site.xml:

File hdfs-site.xml:

root@IPADDRESS hadoop]# hadoop-daemon.sh start datanode root@IPADDRESS hadoop]# jps

Configuring Client Node:

To configure client node go to the Hadoop directory same as we did in the above case and edit core-site.xml file only.

Task: File operations by client node on the Hadoop cluster

To upload a file, client node is responsible. Run the following commands to upload a file.

root@IPADDRESS hadoop]# touch filename.txt root@IPADDRESS hadoop]# vim filename.txt root@IPADDRESS hadoop]# hadoop fs -put filename.txt /

To upload the file to Hadoop cluster use the command:

root@IPADDRESS hadoop]# hadoop fs -put filename.txt /

Viewing Hadoop cluster in Web UI

To view Hadoop cluster in Web UI use the following URL in web browser:

How are these files uploaded by client are stored in cluster?

Hadoop cluster stores files in form of blocks of data. By default, this block size is of 64 MB but this can be altered as per requirement. File uploaded by client is divided into blocks of 64 MB which further are replicated onto different nodes to save the data incase of node failure or connection issues. Replication increases the chances of availability of data on the cluster. More is the replication factor, less will be the chances of losing data due unforeseen circumstances and so more will be the memory consumed for storing the file. By default, replication factor is set to 3 which can be changed. Client node controls the replication value of a file to be stored on the Hadoop cluster.

Some questions which arises to a novice when working on the Hadoop cluster:

Who stores the files on the data node?--> master node or the client node itself?

Some people believe that file uploaded by client is stored by the master node on the data nodes. Well, we observed that client node itself uploads the file on the data nodes. Master node is responsible for managing the resources in Hadoop distributed file system.

Does client node individually stores file on each data nodes with its replica?

No. During the implementation of the cluster we found that file is stored by the client only once on one of the nodes, Hadoop cluster itself manages the replicas to be created and to be stored between different nodes for the value of up to the replication factor.

How can we find whether from which system the file is being read by the client node?

Client reaches to nearest node and copies the file from there, if that node is not available then it searches for other nodes where its replica is stored.

root@IPADDRESS hadoop]# tcpdump -i eth0 tcp port 50010 -n -x

This command tells us about from which IP address the packets are being sent to or received from. While executing the above command, here's how it looks on the system:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jyoti Pawar

Jyoti Pawar

Devops || AWS || ML || Deep learning || Python || Flask || Ansible RH294 || OpenShift DO180