Hadoop DataNodes with Dynamic Storage using LVM

Prathamesh Mistry
5 min readJan 1, 2021

Hadoop is an open-source, Java-based framework used for storing and processing big data. Hadoop is used by Tech giants like Facebook to manage data and moreover, it has the largest HDFS cluster. This short article covers using DataNodes over a dynamically allocated storage using Logical Volume Management (LVM).

A Hadoop Cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. An HDFS cluster consists of a number of Datanodes and a NameNode. The cluster has a master-slave topology.

Logical volume management (LVM) is a form of storage virtualization that offers system administrators a more flexible approach to managing disk storage space than traditional partitioning.

The article focuses on the practical for creating an HDFS cluster where the data nodes shared directory will have a specific sized volume. Also, if needed the volume of the directory could be changed on-the-air. This architecture will have near-zero downtime if the size of the shared volume is to be increased or decreased.

We would be using the Oracle VirtualBox Hypervisor. For the purpose of the demonstration, we would create a single DataNode architecture. The OS used for the NameNode as well as the DataNode is Red Hat Enterprise Linux 8.

Attaching a Virtual Hard-drive to the Datanode

To create Dynamic Storage, let's attach a virtual hard drive to the Datanode. The storage we are attaching to the Virtual Machine is of 100Gb.

Create a VDI and attach it to the DataNode

Next, we boot up the Machine and Check the Hard-Disk.

Creating the Logical Volume

To create a logical volume we need to perform a series of steps:

  • Convert the Storage device to a physical volume and display the created physical volume
$ sudo pvcreate /dev/sdb 
Physical volume "/dev/sdb" successfully created.
  • Creating and Displaying Volume Group
$ sudo vgcreate datanode_lv_vol /dev/sdb
Volume group "datanode_lv_vol" successfully created
  • Creating and Displaying Logical Volume

A Logical volume is created by creating a partition in the Volume Group. We can make any number of the partition using the Volume Group, unlike the Physical Partition which can have at most 4 partitions.

$ sudo lvcreate --size 50G --name vol_01 datanode_lv_vol
Logical volume "vol_01" created.

If we now take a look at the Volume Group again, we can see that the Current LV attached is 1.

  • Format the newly created Logical Volume.

IN order to insert data into any partition, we need to format the partition. We will be using the ext4 file system to format the logical partition.

$ mkfs.ext4 /dev/datanode_lv_vol/vol_01

Our Logical Volume is now ready to be mounted on the shared directory of the data node.

  • Create a directory and mount the hard-disk on the directory

It is assumed that the Machine is the first time being configured as a DataNode. Let’s create a directory to store the data of the client when pushed.

$ sudo mkdir -pv /servera_data
mkdir: created directory '/servera_data'

Configure Hadoop to store data in the /servera_data directory

This is the configuration file locating the directory to store the data in the data node. Set the configuration value to the /servera_data directory.

And that’s it! Let’s connect the Datanode to the cluster.

$ sudo hadoop-daemon.sh start datanode
starting datanode, logging to /var/log/hadoop/servera/hadoop-servera-datanode-datanode.out

Checking the admin-report of the cluster

Let’s now try increasing the size of the volume to 80Gigs.

  • Increase the size of the volume
$ sudo lvextend --size +30G /dev/datanode_lv_vol/vol_01
Size of logical volume datanode_lv_vol/vol_01 changed from 50.00 GiB (12800 extents) to 80.00 GiB (20480 extents).
Logical volume datanode_lv_vol/vol_01 successfully resized.
  • Resize the partition
$ sudo resize2fs /dev/datanode_lv_vol/vol_01 
resize2fs 1.44.3 (10-July-2018)
Filesystem at /dev/datanode_lv_vol/vol_01 is mounted on /servera_data; on-line resizing required
old_desc_blocks = 9, new_desc_blocks = 10
The filesystem on /dev/datanode_lv_vol/vol_01 is now 20971520 (4k) blocks long.

Let’s check the admin report again,

Check that the size of the storage had increased by almost 30 Gb without even restarting the service. Using the resize2fs command we just format and append the storage on the device which has not been written yet, this way we do not lose any of the data. We can also reduce the size of the partition in a similar fashion. This is how we create a Hadoop DataNodes with Dynamic Storage. This method is very useful if we are not sure of the scale of the data that would be pushed by the client.

Thank You!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Prathamesh Mistry
Prathamesh Mistry

Written by Prathamesh Mistry

Final Year Student, understanding the industrial approach and tools

No responses yet

Write a response