Install and configure a Multi-node Hadoop cluster using Ansible

I’ve recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In each case that I’ve observed they start by manually provisioning some servers and then follow a series of tutorials to manually install and configure a cluster. The typical experience seems to take about a week to setup a cluster. There is often a lot of wasted time to deal with networking and connectivity between hosts. After telling several groups that they should automate the installation of Hadoop using something like Ansible, I decided to create an example. All the scripts to install a new Hadoop cluster in minutes are on github for you to fork: https://github.com/dwatrous/hadoop-multi-server-ansible I have also recorded a video demonstration of the following process: Scope The scope of this article is to create a three node cluster on a single computer (Windows in my case) using VirtualBox and Vagrant. The cluster includes HDFS and mapreduce running on all three nodes. The following diagram will help to visualize the cluster. Build the servers The first step is to install VirtualBox and Vagrant. Clone hadoop-multi-server-ansible and open a console window to the directory where you cloned. The Vagrantfile defines three Ubuntu 14.04 servers. Each server needs 3GB RAM, so you’ll need to make sure you have enough RAM available. Now run vagrant up and wait a few minutes for the new servers to come up. C:\Users\watrous\Documents\hadoop>vagrant up Bringing machine ‘master’ up with ‘virtualbox’ provider… Bringing machine ‘data1’ up with ‘virtualbox’ provider… Bringing machine ‘data2’ up with ‘virtualbox’ provider… ==> master: Importing base box ‘ubuntu/trusty64’… ==> master: Matching MAC address for NAT networking… … Continue reading Install and configure a Multi-node Hadoop cluster using Ansible