Understanding the working of hadoop cluster
Setting up Hadoop Cluster
In this cluster, we will be setting up cluster with 4 data nodes and one client node.
What is NameNode?
NameNode stores all the information of the datanodes. When client stores data on DataNode ,NameNode gives the location of the data to the client.
What is a Data Node?
Data Node, also known as SlaveNode/EdgeNode. Data Nodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple Data Nodes for reliability and so that localized computation can be executed near the data.
What is Client Node?
Client node is used for uploading and reading files from the cluster. Client node system does not contribute its resources to the cluster, and is a independent system used to access the resources on the system.
So to create a cluster we need to configure all the systems setting for master node, slave node and client node.
To implement this cluster, me and my team used AWS EC2 instances.
Steps to install Hadoop on AWS EC2 instance:
- Run a EC2 instance on AWS.
- Check the public IP of AWS instance and convert the instance security key in .pem format to .ppk format using puttygen.
- Download JDK and Hadoop softwares from the internet.
- Use WinSCP to transfer these softwares to AWS instance. WinSCP helps in creating a SSH connection to the remote system. This will require public IP of AWS instance and instance security key in .ppk format.
- To check whether file has successfully transferred in your AWS instance or not check with command #ls.
- To install file we need to use root permission. Follow the below commands to for the same.
Firstly, Install JDK
root@IPADDRESS ec2-user]# rpm -ivh jdk-8u171-linux-x64.rpm
Now, Install Hadoop
root@IPADDRESS ec2-user]# rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force
Go to Hadoop directory using command:
# cd /etc/hadoop/
Here we have to edit core-site.xml and hdfs-site.xml file, there are different configuration for master node, data node and client node. So, do this part carefully.
Configuring Master Node:
File hdfs-site.xml:
In the attribute, name.dir tell that this node is name node. While value /nn1 refers to the location of the name node directory.
File core-site.xml:
Before running the name node, we have to format the name node directory and stop the firewall, to do the same execute following commands:
root@IPADDRESS hadoop]# hadoop namenode -format root@IPADDRESS hadoop]# systemctl stop firewalld
Now, run the name node using following command:
root@IPADDRESS hadoop]# hadoop-daemon.sh start namenode root@IPADDRESS hadoop]# jps
hadoop-daemon.sh start namenode will start the namenode. To check if name node has started run jps command which return the status of the namenode process on the system.
Configuring Data/Slave/Edge Node:
File Core-site.xml:
File hdfs-site.xml:
Now run the data node by using following command:
root@IPADDRESS hadoop]# hadoop-daemon.sh start datanode root@IPADDRESS hadoop]# jps
Configuring Client Node:
To configure client node go to the Hadoop directory same as we did in the above case and edit core-site.xml file only.
Task: File operations by client node on the Hadoop cluster
To upload a file, client node is responsible. Run the following commands to upload a file.
root@IPADDRESS hadoop]# touch filename.txt root@IPADDRESS hadoop]# vim filename.txt root@IPADDRESS hadoop]# hadoop fs -put filename.txt /
To upload the file to Hadoop cluster use the command:
root@IPADDRESS hadoop]# hadoop fs -put filename.txt /
Viewing Hadoop cluster in Web UI
To view Hadoop cluster in Web UI use the following URL in web browser:
IPADDRESS(of master node):PORT_NUMBER(50070)
Similar to what is shown in the image below: 13.232.51.255:50070
It shows that we have 4 live nodes which means 4 data nodes. When we click on live nodes hyperlink, following windows shows up.
We can observe that after uploading the file, 3 replicas are created and file is stored in form of blocks in each node.
How are these files uploaded by client are stored in cluster?
Hadoop cluster stores files in form of blocks of data. By default, this block size is of 64 MB but this can be altered as per requirement. File uploaded by client is divided into blocks of 64 MB which further are replicated onto different nodes to save the data incase of node failure or connection issues. Replication increases the chances of availability of data on the cluster. More is the replication factor, less will be the chances of losing data due unforeseen circumstances and so more will be the memory consumed for storing the file. By default, replication factor is set to 3 which can be changed. Client node controls the replication value of a file to be stored on the Hadoop cluster.
Some questions which arises to a novice when working on the Hadoop cluster:
Who stores the files on the data node?--> master node or the client node itself?
Some people believe that file uploaded by client is stored by the master node on the data nodes. Well, we observed that client node itself uploads the file on the data nodes. Master node is responsible for managing the resources in Hadoop distributed file system.
Does client node individually stores file on each data nodes with its replica?
No. During the implementation of the cluster we found that file is stored by the client only once on one of the nodes, Hadoop cluster itself manages the replicas to be created and to be stored between different nodes for the value of up to the replication factor.
How can we find whether from which system the file is being read by the client node?
Client reaches to nearest node and copies the file from there, if that node is not available then it searches for other nodes where its replica is stored.
So how we found this out?
We used the tcpdump command on data nodes to find how these packets are transferred in the cluster.
root@IPADDRESS hadoop]# tcpdump -i eth0 tcp port 50010 -n -x
This command tells us about from which IP address the packets are being sent to or received from. While executing the above command, here's how it looks on the system:
Here I am running one client node(on left) and one data node(on right) side by side on two separate AWS instances. IP addresses of these systems are mentioned below for reference
Here the client node is running -cat command, i.e. read operation. So, the client node will read only from one of the nodes. We can observe that request on the data node is received from the client IP address and so file packets will be transferred to directly to the client node. There is one on one interaction between client and data node. So that explains files are not transferred through master node.
In the above snippet we can observe that I executed -put command i.e. file with name shalu.txt is being uploaded on the cluster on the data node window we can see that file received is not from the client IP address but some other data node, which proves that client do not store the file individually on each system and stores the file only once on any one data node. Here file stored on my data node is replica getting stored from other data node.
I hope this will be useful for the people out there who are struggling with Hadoop.
I would like to thank Vimal Daga Sir for his lessons on Hadoop. I would also like to thank ARTH technical team and Learner Success heads for their support. At last but very special I would like to thank my group ARTH2020_14_G1 who have been constant support in the learning presented in this blog, without them this content would not have been successful.