Automating Hadoop Cluster Configuration using Ansible

Prathamesh Mistry
7 min readMar 21, 2021

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop is used at many high-end companies including the Tech-Giant Facebook. This article covers configure an HDFS cluster using Ansible. Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. Ansible is widely used in the market and is the leading IT Automation tool.

# tree 
.
├── ansible.cfg ---> configuration file
├── inventory ---> setup inventory
├── client-playbook.yml ---> setup client
├── pre-req-playbook.yml ---> installing preq on all the nodes
├── master-playbook.yml ---> setting up
├── slave-playbook.yml ---> setting up worker/data nodes
├── hadoop_system_files
│ ├── client_files
│ │ └── core-site.xml
│ ├── master_files
│ │ ├── core-site.xml
│ │ └── hdfs-site.xml
│ └── slave_files
│ ├── core-site.xml
│ └── hdfs-site.xml
├── master_slave_client_vars
│ ├── directory_vars.yml
│ └── master_ip_file.yml
└─── packages
├── hadoop-1.2.1-1.x86_64.rpm
└── jdk-8u171-linux-x64.rpm

The above tree diagram shows the content of the directory we are about to create.

Let’s start with the practical

Creating an Ansible Configuration File

# vim ansible.cfg
[defaults]
inventory = inventory

Make sure you are using the correct configuration file for ansible using the following command.

# ansible --version

Create an Inventory

This file will contain all the systems and nodes you want to manage from your workstation.

# vim inventory
[master_node]
192.168.3.101
[slave_nodes]
192.168.3.102

The master node group will contain the IP address of the master node whereas the slave_nodes group will contain all the IPs of the slave you want to connect to the cluster.

Note: For this practical will be having 1 Master node and 1 Slave work.

Also, check if you have an active connection with all your nodes

# ansible all -a 'id'

Installing Hadoop and all the required packages on all the nodes in the cluster

# vim pre-req-playbook.yml
---
- name: Prerequisties for hadoop
hosts: all
tasks:
- name: copy the packages at the target
copy:
dest: '/'
src: './packages'

- name: Find all rpm files
find:
paths: "/packages"
patterns: "*.rpm"
register: rpm_result
- debug:
msg:
"{{ item }}"
loop:
- "{{ rpm_result }}"
- name: Install hadoop packages
yum:
name: '{{ rpm_result.files[0].path }}'
state: present
- name: Install jdk package
yum:
name: '{{ rpm_result.files[1].path }}'
state: present
- firewalld:
port: "9001/tcp"
permanent: true
state: enabled
immediate: yes

The Above Playbook copies the required rpm packaged on the managed node and install them. Also, it allows the 9001 port required by Hadoop through the firewall

Running the playbook and all the nodes in the cluster

# ansible-playbook pre-req-playbook.yml
PLAY [Prerequisties for hadoop] ************************************************
TASK [Gathering Facts] *********************************************************
ok: [192.168.3.101]
ok: [192.168.3.102]
TASK [copy the packages at the target] *****************************************
ok: [192.168.3.101]
ok: [192.168.3.102]
TASK [Find all rpm files] ******************************************************
ok: [192.168.3.101]
ok: [192.168.3.102]
TASK [debug] *******************************************************************
ok: [192.168.3.101] => (item={'files': [{'path': '/packages/hadoop-1.2.1-1.x86_64.rpm', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 0, 'gid': 0, 'size': 36823910, 'inode': 69106274, 'dev': 64768, 'nlink': 1, 'atime': 1616335769.4599726, 'mtime': 1606578241.8902164, 'ctime': 1606578242.7012124, 'gr_name': 'root', 'pw_name': 'root', 'wusr': True, 'rusr': True, 'xusr': False, 'wgrp': False, 'rgrp': True, 'xgrp': False, 'woth': False, 'roth': True, 'xoth': False, 'isuid': False, 'isgid': False}, {'path': '/packages/jdk-8u171-linux-x64.rpm', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 0, 'gid': 0, 'size': 175262413, 'inode': 69106357, 'dev': 64768, 'nlink': 1, 'atime': 1616335772.4909647, 'mtime': 1606578251.9751682, 'ctime': 1606578253.877159, 'gr_name': 'root', 'pw_name': 'root', 'wusr': True, 'rusr': True, 'xusr': False, 'wgrp': False, 'rgrp': True, 'xgrp': False, 'woth': False, 'roth': True, 'xoth': False, 'isuid': False, 'isgid': False}], 'changed': False, 'msg': '', 'matched': 2, 'examined': 2, 'failed': False}) => {
"msg": {
"changed": false,
"examined": 2,
"failed": false,
"files": [
{
"atime": 1616335769.4599726,
"ctime": 1606578242.7012124,
"dev": 64768,
"gid": 0,
"gr_name": "root",
"inode": 69106274,
"isblk": false,
"ischr": false,
"isdir": false,
"isfifo": false,
"isgid": false,
"islnk": false,
"isreg": true,
"issock": false,
"isuid": false,
"mode": "0644",
"mtime": 1606578241.8902164,
"nlink": 1,
"path": "/packages/hadoop-1.2.1-1.x86_64.rpm",
"pw_name": "root",
"rgrp": true,
"roth": true,
"rusr": true,
"size": 36823910,
"uid": 0,
"wgrp": false,
"woth": false,
"wusr": true,
"xgrp": false,
"xoth": false,
"xusr": false
},
{
"atime": 1616335772.4909647,
"ctime": 1606578253.877159,
"dev": 64768,
"gid": 0,
"gr_name": "root",
"inode": 69106357,
"isblk": false,
"ischr": false,
"isdir": false,
"isfifo": false,
"isgid": false,
"islnk": false,
"isreg": true,
"issock": false,
"isuid": false,
"mode": "0644",
"mtime": 1606578251.9751682,
"nlink": 1,
"path": "/packages/jdk-8u171-linux-x64.rpm",
"pw_name": "root",
"rgrp": true,
"roth": true,
"rusr": true,
"size": 175262413,
"uid": 0,
"wgrp": false,
"woth": false,
"wusr": true,
"xgrp": false,
"xoth": false,
"xusr": false
}
],
"matched": 2,
"msg": ""
}
}
ok: [192.168.3.102] => (item={'files': [{'path': '/packages/hadoop-1.2.1-1.x86_64.rpm', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 0, 'gid': 0, 'size': 36823910, 'inode': 3084254, 'dev': 64768, 'nlink': 1, 'atime': 1616335769.4396415, 'mtime': 1607243810.630506, 'ctime': 1607243811.571501, 'gr_name': 'root', 'pw_name': 'root', 'wusr': True, 'rusr': True, 'xusr': False, 'wgrp': False, 'rgrp': True, 'xgrp': False, 'woth': False, 'roth': True, 'xoth': False, 'isuid': False, 'isgid': False}, {'path': '/packages/jdk-8u171-linux-x64.rpm', 'mode': '0644', 'isdir': False, 'ischr': False, 'isblk': False, 'isreg': True, 'isfifo': False, 'islnk': False, 'issock': False, 'uid': 0, 'gid': 0, 'size': 175262413, 'inode': 3084256, 'dev': 64768, 'nlink': 1, 'atime': 1616335772.3826203, 'mtime': 1607243821.1384513, 'ctime': 1607243822.2174456, 'gr_name': 'root', 'pw_name': 'root', 'wusr': True, 'rusr': True, 'xusr': False, 'wgrp': False, 'rgrp': True, 'xgrp': False, 'woth': False, 'roth': True, 'xoth': False, 'isuid': False, 'isgid': False}], 'changed': False, 'msg': '', 'matched': 2, 'examined': 2, 'failed': False}) => {
"msg": {
"changed": false,
"examined": 2,
"failed": false,
"files": [
{
"atime": 1616335769.4396415,
"ctime": 1607243811.571501,
"dev": 64768,
"gid": 0,
"gr_name": "root",
"inode": 3084254,
"isblk": false,
"ischr": false,
"isdir": false,
"isfifo": false,
"isgid": false,
"islnk": false,
"isreg": true,
"issock": false,
"isuid": false,
"mode": "0644",
"mtime": 1607243810.630506,
"nlink": 1,
"path": "/packages/hadoop-1.2.1-1.x86_64.rpm",
"pw_name": "root",
"rgrp": true,
"roth": true,
"rusr": true,
"size": 36823910,
"uid": 0,
"wgrp": false,
"woth": false,
"wusr": true,
"xgrp": false,
"xoth": false,
"xusr": false
},
{
"atime": 1616335772.3826203,
"ctime": 1607243822.2174456,
"dev": 64768,
"gid": 0,
"gr_name": "root",
"inode": 3084256,
"isblk": false,
"ischr": false,
"isdir": false,
"isfifo": false,
"isgid": false,
"islnk": false,
"isreg": true,
"issock": false,
"isuid": false,
"mode": "0644",
"mtime": 1607243821.1384513,
"nlink": 1,
"path": "/packages/jdk-8u171-linux-x64.rpm",
"pw_name": "root",
"rgrp": true,
"roth": true,
"rusr": true,
"size": 175262413,
"uid": 0,
"wgrp": false,
"woth": false,
"wusr": true,
"xgrp": false,
"xoth": false,
"xusr": false
}
],
"matched": 2,
"msg": ""
}
}
TASK [Install hadoop packages] *************************************************
ok: [192.168.3.102]
ok: [192.168.3.101]
TASK [Install jdk package] *****************************************************
ok: [192.168.3.102]
ok: [192.168.3.101]
TASK [firewalld] ***************************************************************
ok: [192.168.3.102]
ok: [192.168.3.101]
PLAY RECAP *********************************************************************
192.168.3.101 : ok=7 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.3.102 : ok=7 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Configuring the master node

Note: We have set the path of the shared directory used by Hadoop in the master_slave_client_vars/directory_vars.yml file.

The following file configures the master node

# vim master-playbook.yml
---
- name: Master node configured
hosts: master_node
tasks:
- include_vars:
file: './master_slave_client_vars/directory_vars.yml'
name: dir
- name: Copy the core site file to target
template:
dest: '/etc/hadoop/core-site.xml'
src: './hadoop_system_files/master_files/core-site.xml'
- name: Copy the hdfs-site file to target
template:
dest: '/etc/hadoop/hdfs-site.xml'
src: './hadoop_system_files/master_files/hdfs-site.xml'
- name: Format the name node
command: hadoop namenode -format
- name: get the master ip
debug:
msg: "{{ ansible_enp0s8.ipv4.address }}"
register: master_ip
- name: Store the ip of master to a file
hosts: localhost
tasks:

- debug:
msg: "{{ hostvars[groups['master_node'][0]]['ansible_enp0s8']['ipv4']['address'] }}"
- copy:
dest: './master_slave_client_vars/master_ip_file.yml'
content: "master_ip: '{{ hostvars[groups['master_node'][0]]['ansible_enp0s8']['ipv4']['address'] }}'"

The above playbook configures Hadoop master on the managed node and also stores the IP of the master node in a file which will be later used by the data nodes to connect to the master.

running the playbook on the master_node group

# ansible-playbook master-playbook.ymlPLAY [Master node configured] **************************************************TASK [Gathering Facts] *********************************************************
ok: [192.168.3.101]
TASK [include_vars] ************************************************************
ok: [192.168.3.101]
TASK [Copy the core site file to target] **************************************
changed: [192.168.3.101]
TASK [Copy the hdfs-site file to target] ***************************************
changed: [192.168.3.101]
TASK [Format the name node] ****************************************************
changed: [192.168.3.101]
TASK [get the master ip] *******************************************************
ok: [192.168.3.101] => {
"msg": "192.168.3.101"
}
PLAY [Store the ip of master to a file] ****************************************TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [debug] *******************************************************************
ok: [localhost] => {
"msg": "192.168.3.101"
}
TASK [copy] ********************************************************************
ok: [localhost]
PLAY RECAP *********************************************************************
192.168.3.101 : ok=6 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Updated variable file:

# cat master_slave_client_vars/master_ip_file.yml 
master_ip: '192.168.3.101'

Configure worker nodes

The following playbook configures all the worker nodes in the cluster

# vim slave-playbook.yml
---
- name: Configure Slave
hosts: slave_nodes
tasks: - include_vars:
dir: ./master_slave_client_vars
name: my_vars
- name: create the to-share directory
file:
state: directory
path: '{{ my_vars.slave_data_dir }}'
- name: Copy the core file to the slave-target
template:
dest: '/etc/hadoop/core-site.xml'
src: './hadoop_system_files/slave_files/core-site.xml'
- name: Copy the hdfs file to the slave-target
template:
dest: '/etc/hadoop/hdfs-site.xml'
src: './hadoop_system_files/slave_files/hdfs-site.xml'

running the playbook on the slave_nodes group

# ansible-playbook slave-playbook.ymlPLAY [Configure Slave] *********************************************************TASK [Gathering Facts] *********************************************************
ok: [192.168.3.102]
TASK [include_vars] ************************************************************
ok: [192.168.3.102]
TASK [create the to-share directory] *******************************************
changed: [192.168.3.102]
TASK [Copy the core file to the slave-target] **********************************
changed: [192.168.3.102]
TASK [Copy the hdfs file to the slave-target] **********************************
changed: [192.168.3.102]
PLAY RECAP *********************************************************************
192.168.3.102 : ok=5 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Checking the Cluster Report

Let’s check the admin report at the MasterNode.

We have one node with GB capacity ready to store data!

Note: jps command is used t check if master/data node is successfully connected

Thank You!

--

--

Prathamesh Mistry

Final Year Student, understanding the industrial approach and tools