Bulid a multi-server Hadoop cluster in OpenStack in minutes

October 22, 2015

Software Engineering

2 Comments

Daniel Watrous

In a previous post I demonstrated a method to deploy a multi-node Hadoop cluster using Vagrant and Ansible. This post builds on that and shows how to deploy a Hadoop cluster with an arbitrary number of slave nodes in minutes on OpenStack. This process makes use of the OpenStack orchestration layer HEAT to provision the resources, after which Ansible use used to configure those resources.

All the scripts to do this yourself is available on github to clone and fork:
https://github.com/dwatrous/hadoop-multi-server-ansible

I have recorded a video demonstrating the entire process, including scaling the cluster after initial deployment:

Scope

The scope of this article is to create a Hadoop cluster with an arbitrary number of slave nodes, which can be automatically scaled up or down to accommodate changes in capacity as workloads change. The following diagram illustrates this:
hadoop-design-openstack

Build the servers

For convenience, this process still uses Vagrant to create a server that will function as the heat and ansible controller. It’s also possible create a server in OpenStack to fill this role. In this case you could simply use the bootstrap-master.sh script to configure that server. The steps to create the servers in OpenStack using heat are:

Install openstack clients (we do this in a python virtual environment)
Download and source the openrc file from your OpenStack environment
Use the openstack clients to get details about keypairs, images, networks, etc.
Update the heat template for your environment
Use heat to build your servers

Install and Run Hadoop

Once the servers are provisioned, it’s time to install Hadoop. This is done using Ansible and can be run from the same host where heat was used (the vagrant created server in this case). Ansible requires an inventory file to run. Since heat is aware of the server resources it created, I added a python script to request information about provisioned servers from heat and write an inventory file. Any time you update your stack using heat, be sure to run the heat-inventory.py script so Ansible is working against the current state. Keep in mind that if you have a proxied environment, you may need to update group_vars/all.

When the Ansible scripts complete, remember to connect to the master node in openstack. From the Hadoop master node, the same process as before can be followed to start Hadoop and run a job.

Security and Configuration

In this example, a floating IP is attached to every server so the local vagrant server can connect via SSH and configure them. If a server was manually prepared in the same openstack environment, the SSH connectivity could leverage IP addresses on the private network. This would eliminate all but one floating IP address, which is still required for the master node.

Future work might include additional automation to tie together the steps I’ve demonstrated. These can also be executed as part of a CI/CD tool chain for fully automated deployments.

PRV POST

NXT POST

Comments

malouke : December 22, 2015 at 3:09 am

thank you so much,
i have computer with fllow config :
core i7 8 cores
and
32 Go ram

i want install spark in multi node for a home cluster can i ask you how ?
thank you .

Reply
- Daniel Watrous : January 14, 2016 at 3:06 pm
  
  You can probably just develop some modified Ansible scripts based on what I’ve done here.
  
  Reply