Hadoop DataNodes with Dynamic Storage using LVM

Hadoop is an open-source, Java-based framework used for storing and processing big data. Hadoop is used by Tech giants like Facebook to manage data and moreover, it has the largest HDFS cluster. This short article covers using DataNodes over a dynamically allocated storage using Logical Volume Management (LVM).

A Hadoop Cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. An HDFS cluster consists of a number of Datanodes and a NameNode. The cluster has a master-slave topology.

Logical volume management (LVM) is a form of storage virtualization that offers system administrators a more flexible approach to managing disk storage space than traditional partitioning.

The article focuses on the practical for creating an HDFS cluster where the data nodes shared directory will have a specific sized volume. Also, if needed the volume of the directory could be changed on-the-air. This architecture will have near-zero downtime if the size of the shared volume is to be increased or decreased.

We would be using the Oracle VirtualBox Hypervisor. For the purpose of the demonstration, we would create a single DataNode architecture. The OS used for the NameNode as well as the DataNode is Red Hat Enterprise Linux 8.

Attaching a Virtual Hard-drive to the Datanode

To create Dynamic Storage, let's attach a virtual hard drive to the Datanode. The storage we are attaching to the Virtual Machine is of 100Gb.

Create a VDI and attach it to the DataNode

Next, we boot up the Machine and Check the Hard-Disk.

Creating the Logical Volume

To create a logical volume we need to perform a series of steps:

  • Convert the Storage device to a physical volume and display the created physical volume
  • Creating and Displaying Volume Group
  • Creating and Displaying Logical Volume

A Logical volume is created by creating a partition in the Volume Group. We can make any number of the partition using the Volume Group, unlike the Physical Partition which can have at most 4 partitions.

If we now take a look at the Volume Group again, we can see that the Current LV attached is 1.

  • Format the newly created Logical Volume.

IN order to insert data into any partition, we need to format the partition. We will be using the ext4 file system to format the logical partition.

Our Logical Volume is now ready to be mounted on the shared directory of the data node.

  • Create a directory and mount the hard-disk on the directory

It is assumed that the Machine is the first time being configured as a DataNode. Let’s create a directory to store the data of the client when pushed.

Configure Hadoop to store data in the /servera_data directory

This is the configuration file locating the directory to store the data in the data node. Set the configuration value to the /servera_data directory.

And that’s it! Let’s connect the Datanode to the cluster.

Checking the admin-report of the cluster

Let’s now try increasing the size of the volume to 80Gigs.

  • Increase the size of the volume
  • Resize the partition

Let’s check the admin report again,

Check that the size of the storage had increased by almost 30 Gb without even restarting the service. Using the resize2fs command we just format and append the storage on the device which has not been written yet, this way we do not lose any of the data. We can also reduce the size of the partition in a similar fashion. This is how we create a Hadoop DataNodes with Dynamic Storage. This method is very useful if we are not sure of the scale of the data that would be pushed by the client.

Thank You!

Final Year Student, understanding the industrial approach and tools