Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged hdfs

Software Engineering

Bulid a multi-server Hadoop cluster in OpenStack in minutes

In a previous post I demonstrated a method to deploy a multi-node Hadoop cluster using Vagrant and Ansible. This post builds on that and shows how to deploy a Hadoop cluster with an arbitrary number of slave nodes in minutes on OpenStack. This process makes use of the OpenStack orchestration layer HEAT to provision the resources, after which Ansible use used to configure those resources.

All the scripts to do this yourself is available on github to clone and fork:
https://github.com/dwatrous/hadoop-multi-server-ansible

I have recorded a video demonstrating the entire process, including scaling the cluster after initial deployment:

Scope

The scope of this article is to create a Hadoop cluster with an arbitrary number of slave nodes, which can be automatically scaled up or down to accommodate changes in capacity as workloads change. The following diagram illustrates this:
hadoop-design-openstack

Build the servers

For convenience, this process still uses Vagrant to create a server that will function as the heat and ansible controller. It’s also possible create a server in OpenStack to fill this role. In this case you could simply use the bootstrap-master.sh script to configure that server. The steps to create the servers in OpenStack using heat are:

  1. Install openstack clients (we do this in a python virtual environment)
  2. Download and source the openrc file from your OpenStack environment
  3. Use the openstack clients to get details about keypairs, images, networks, etc.
  4. Update the heat template for your environment
  5. Use heat to build your servers

Install and Run Hadoop

Once the servers are provisioned, it’s time to install Hadoop. This is done using Ansible and can be run from the same host where heat was used (the vagrant created server in this case). Ansible requires an inventory file to run. Since heat is aware of the server resources it created, I added a python script to request information about provisioned servers from heat and write an inventory file. Any time you update your stack using heat, be sure to run the heat-inventory.py script so Ansible is working against the current state. Keep in mind that if you have a proxied environment, you may need to update group_vars/all.

When the Ansible scripts complete, remember to connect to the master node in openstack. From the Hadoop master node, the same process as before can be followed to start Hadoop and run a job.

Security and Configuration

In this example, a floating IP is attached to every server so the local vagrant server can connect via SSH and configure them. If a server was manually prepared in the same openstack environment, the SSH connectivity could leverage IP addresses on the private network. This would eliminate all but one floating IP address, which is still required for the master node.

Future work might include additional automation to tie together the steps I’ve demonstrated. These can also be executed as part of a CI/CD tool chain for fully automated deployments.

Software Engineering

Install and configure a Multi-node Hadoop cluster using Ansible

I’ve recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In each case that I’ve observed they start by manually provisioning some servers and then follow a series of tutorials to manually install and configure a cluster. The typical experience seems to take about a week to setup a cluster. There is often a lot of wasted time to deal with networking and connectivity between hosts.

After telling several groups that they should automate the installation of Hadoop using something like Ansible, I decided to create an example. All the scripts to install a new Hadoop cluster in minutes are on github for you to fork: https://github.com/dwatrous/hadoop-multi-server-ansible

I have also recorded a video demonstration of the following process:

Scope

The scope of this article is to create a three node cluster on a single computer (Windows in my case) using VirtualBox and Vagrant. The cluster includes HDFS and mapreduce running on all three nodes. The following diagram will help to visualize the cluster.

hadoop-design

Build the servers

The first step is to install VirtualBox and Vagrant.

Clone hadoop-multi-server-ansible and open a console window to the directory where you cloned. The Vagrantfile defines three Ubuntu 14.04 servers. Each server needs 3GB RAM, so you’ll need to make sure you have enough RAM available. Now run vagrant up and wait a few minutes for the new servers to come up.

C:\Users\watrous\Documents\hadoop>vagrant up
Bringing machine 'master' up with 'virtualbox' provider...
Bringing machine 'data1' up with 'virtualbox' provider...
Bringing machine 'data2' up with 'virtualbox' provider...
==> master: Importing base box 'ubuntu/trusty64'...
==> master: Matching MAC address for NAT networking...
==> master: Checking if box 'ubuntu/trusty64' is up to date...
==> master: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> master: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> master: `vagrant box update` to update.
==> master: Setting the name of the VM: master
==> master: Clearing any previously set forwarded ports...
==> master: Clearing any previously set network interfaces...
==> master: Preparing network interfaces based on configuration...
    master: Adapter 1: nat
    master: Adapter 2: hostonly
==> master: Forwarding ports...
    master: 22 => 2222 (adapter 1)
==> master: Running 'pre-boot' VM customizations...
==> master: Booting VM...
==> master: Waiting for machine to boot. This may take a few minutes...
    master: SSH address: 127.0.0.1:2222
    master: SSH username: vagrant
    master: SSH auth method: private key
    master: Warning: Connection timeout. Retrying...
==> master: Machine booted and ready!
==> master: Checking for guest additions in VM...
==> master: Setting hostname...
==> master: Configuring and enabling network interfaces...
==> master: Mounting shared folders...
    master: /home/vagrant/src => C:/Users/watrous/Documents/hadoop
==> master: Running provisioner: file...
==> master: Running provisioner: shell...
    master: Running: C:/Users/watrous/AppData/Local/Temp/vagrant-shell20150930-12444-1lgl5bq.sh
==> master: stdin: is not a tty
==> master: Ign http://archive.ubuntu.com trusty InRelease
==> master: Ign http://archive.ubuntu.com trusty-updates InRelease
==> master: Ign http://security.ubuntu.com trusty-security InRelease
==> master: Hit http://archive.ubuntu.com trusty Release.gpg
==> master: Get:1 http://security.ubuntu.com trusty-security Release.gpg [933 B]
==> master: Get:2 http://archive.ubuntu.com trusty-updates Release.gpg [933 B]
==> master: Hit http://archive.ubuntu.com trusty Release
==> master: Get:3 http://security.ubuntu.com trusty-security Release [63.5 kB]
==> master: Get:4 http://archive.ubuntu.com trusty-updates Release [63.5 kB]
==> master: Get:5 http://archive.ubuntu.com trusty/main Sources [1,064 kB]
==> master: Get:6 http://security.ubuntu.com trusty-security/main Sources [96.2 kB]
==> master: Get:7 http://security.ubuntu.com trusty-security/universe Sources [31.1 kB]
==> master: Get:8 http://security.ubuntu.com trusty-security/main amd64 Packages [350 kB]
==> master: Get:9 http://archive.ubuntu.com trusty/universe Sources [6,399 kB]
==> master: Get:10 http://security.ubuntu.com trusty-security/universe amd64 Packages [117 kB]
==> master: Get:11 http://security.ubuntu.com trusty-security/main Translation-en [191 kB]
==> master: Get:12 http://security.ubuntu.com trusty-security/universe Translation-en [68.2 kB]
==> master: Hit http://archive.ubuntu.com trusty/main amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/universe amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/main Translation-en
==> master: Hit http://archive.ubuntu.com trusty/universe Translation-en
==> master: Get:13 http://archive.ubuntu.com trusty-updates/main Sources [236 kB]
==> master: Get:14 http://archive.ubuntu.com trusty-updates/universe Sources [139 kB]
==> master: Get:15 http://archive.ubuntu.com trusty-updates/main amd64 Packages [626 kB]
==> master: Get:16 http://archive.ubuntu.com trusty-updates/universe amd64 Packages [320 kB]
==> master: Get:17 http://archive.ubuntu.com trusty-updates/main Translation-en [304 kB]
==> master: Get:18 http://archive.ubuntu.com trusty-updates/universe Translation-en [168 kB]
==> master: Ign http://archive.ubuntu.com trusty/main Translation-en_US
==> master: Ign http://archive.ubuntu.com trusty/universe Translation-en_US
==> master: Fetched 10.2 MB in 4s (2,098 kB/s)
==> master: Reading package lists...
==> master: Reading package lists...
==> master: Building dependency tree...
==> master:
==> master: Reading state information...
==> master: The following extra packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-distlib
==> master:   python-distlib-whl python-html5lib python-html5lib-whl python-pip-whl
==> master:   python-requests-whl python-setuptools python-setuptools-whl python-six-whl
==> master:   python-urllib3-whl python-wheel python2.7-dev python3-pkg-resources
==> master: Suggested packages:
==> master:   debian-keyring g++-multilib g++-4.8-multilib gcc-4.8-doc libstdc++6-4.8-dbg
==> master:   libstdc++-4.8-doc python-genshi python-lxml python3-setuptools zip
==> master: Recommended packages:
==> master:   python-dev-all
==> master: The following NEW packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-dev
==> master:   python-distlib python-distlib-whl python-html5lib python-html5lib-whl
==> master:   python-pip python-pip-whl python-requests-whl python-setuptools
==> master:   python-setuptools-whl python-six-whl python-urllib3-whl python-wheel
==> master:   python2.7-dev python3-pkg-resources unzip
==> master: 0 upgraded, 32 newly installed, 0 to remove and 29 not upgraded.
==> master: Need to get 41.3 MB of archives.
==> master: After this operation, 80.4 MB of additional disk space will be used.
==> master: Get:1 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libexpat1-dev amd64 2.1.0-4ubuntu1.1 [115 kB]
==> master: Get:2 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libpython2.7-dev amd64 2.7.6-8ubuntu0.2 [22.0 MB]
==> master: Get:3 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libstdc++-4.8-dev amd64 4.8.4-2ubuntu1~14.04 [1,052 kB]
==> master: Get:4 http://archive.ubuntu.com/ubuntu/ trusty-updates/main g++-4.8 amd64 4.8.4-2ubuntu1~14.04 [15.0 MB]
==> master: Get:5 http://archive.ubuntu.com/ubuntu/ trusty/main g++ amd64 4:4.8.2-1ubuntu6 [1,490 B]
==> master: Get:6 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libdpkg-perl all 1.17.5ubuntu5.4 [179 kB]
==> master: Get:7 http://archive.ubuntu.com/ubuntu/ trusty-updates/main dpkg-dev all 1.17.5ubuntu5.4 [726 kB]
==> master: Get:8 http://archive.ubuntu.com/ubuntu/ trusty/main build-essential amd64 11.6ubuntu6 [4,838 B]
==> master: Get:9 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-perl all 1.19.02-3 [50.0 kB]
==> master: Get:10 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-xs-perl amd64 0.04-2build4 [12.6 kB]
==> master: Get:11 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-merge-perl all 0.08-2 [12.7 kB]
==> master: Get:12 http://archive.ubuntu.com/ubuntu/ trusty/main libfile-fcntllock-perl amd64 0.14-2build1 [15.9 kB]
==> master: Get:13 http://archive.ubuntu.com/ubuntu/ trusty/main libpython-dev amd64 2.7.5-5ubuntu3 [7,078 B]
==> master: Get:14 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python3-pkg-resources all 3.3-1ubuntu2 [31.7 kB]
==> master: Get:15 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-chardet-whl all 2.2.1-2~ubuntu1 [170 kB]
==> master: Get:16 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama all 0.2.5-0.1ubuntu2 [18.4 kB]
==> master: Get:17 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama-whl all 0.2.5-0.1ubuntu2 [18.2 kB]
==> master: Get:18 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python2.7-dev amd64 2.7.6-8ubuntu0.2 [269 kB]
==> master: Get:19 http://archive.ubuntu.com/ubuntu/ trusty/main python-dev amd64 2.7.5-5ubuntu3 [1,166 B]
==> master: Get:20 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib all 0.1.8-1ubuntu1 [113 kB]
==> master: Get:21 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib-whl all 0.1.8-1ubuntu1 [140 kB]
==> master: Get:22 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib all 0.999-3~ubuntu1 [83.5 kB]
==> master: Get:23 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib-whl all 0.999-3~ubuntu1 [109 kB]
==> master: Get:24 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-six-whl all 1.5.2-1ubuntu1 [10.5 kB]
==> master: Get:25 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-urllib3-whl all 1.7.1-1ubuntu3 [64.0 kB]
==> master: Get:26 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-requests-whl all 2.2.1-1ubuntu0.3 [227
kB]
==> master: Get:27 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools-whl all 3.3-1ubuntu2 [244 kB]
==> master: Get:28 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip-whl all 1.5.4-1ubuntu3 [111 kB]
==> master: Get:29 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools all 3.3-1ubuntu2 [230 kB]
==> master: Get:30 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip all 1.5.4-1ubuntu3 [97.2 kB]
==> master: Get:31 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-wheel all 0.24.0-1~ubuntu1 [44.7 kB]
==> master: Get:32 http://archive.ubuntu.com/ubuntu/ trusty-updates/main unzip amd64 6.0-9ubuntu1.3 [157 kB]
==> master: dpkg-preconfigure: unable to re-open stdin: No such file or directory
==> master: Fetched 41.3 MB in 20s (2,027 kB/s)
==> master: Selecting previously unselected package libexpat1-dev:amd64.
==> master: (Reading database ... 61002 files and directories currently installed.)
==> master: Preparing to unpack .../libexpat1-dev_2.1.0-4ubuntu1.1_amd64.deb ...
==> master: Unpacking libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Selecting previously unselected package libpython2.7-dev:amd64.
==> master: Preparing to unpack .../libpython2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package libstdc++-4.8-dev:amd64.
==> master: Preparing to unpack .../libstdc++-4.8-dev_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++-4.8.
==> master: Preparing to unpack .../g++-4.8_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++.
==> master: Preparing to unpack .../g++_4%3a4.8.2-1ubuntu6_amd64.deb ...
==> master: Unpacking g++ (4:4.8.2-1ubuntu6) ...
==> master: Selecting previously unselected package libdpkg-perl.
==> master: Preparing to unpack .../libdpkg-perl_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package dpkg-dev.
==> master: Preparing to unpack .../dpkg-dev_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package build-essential.
==> master: Preparing to unpack .../build-essential_11.6ubuntu6_amd64.deb ...
==> master: Unpacking build-essential (11.6ubuntu6) ...
==> master: Selecting previously unselected package libalgorithm-diff-perl.
==> master: Preparing to unpack .../libalgorithm-diff-perl_1.19.02-3_all.deb ...
==> master: Unpacking libalgorithm-diff-perl (1.19.02-3) ...
==> master: Selecting previously unselected package libalgorithm-diff-xs-perl.
==> master: Preparing to unpack .../libalgorithm-diff-xs-perl_0.04-2build4_amd64.deb ...
==> master: Unpacking libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Selecting previously unselected package libalgorithm-merge-perl.
==> master: Preparing to unpack .../libalgorithm-merge-perl_0.08-2_all.deb ...
==> master: Unpacking libalgorithm-merge-perl (0.08-2) ...
==> master: Selecting previously unselected package libfile-fcntllock-perl.
==> master: Preparing to unpack .../libfile-fcntllock-perl_0.14-2build1_amd64.deb ...
==> master: Unpacking libfile-fcntllock-perl (0.14-2build1) ...
==> master: Selecting previously unselected package libpython-dev:amd64.
==> master: Preparing to unpack .../libpython-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python3-pkg-resources.
==> master: Preparing to unpack .../python3-pkg-resources_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-chardet-whl.
==> master: Preparing to unpack .../python-chardet-whl_2.2.1-2~ubuntu1_all.deb ...
==> master: Unpacking python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Selecting previously unselected package python-colorama.
==> master: Preparing to unpack .../python-colorama_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python-colorama-whl.
==> master: Preparing to unpack .../python-colorama-whl_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python2.7-dev.
==> master: Preparing to unpack .../python2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package python-dev.
==> master: Preparing to unpack .../python-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking python-dev (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python-distlib.
==> master: Preparing to unpack .../python-distlib_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-distlib-whl.
==> master: Preparing to unpack .../python-distlib-whl_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib.
==> master: Preparing to unpack .../python-html5lib_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib-whl.
==> master: Preparing to unpack .../python-html5lib-whl_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-six-whl.
==> master: Preparing to unpack .../python-six-whl_1.5.2-1ubuntu1_all.deb ...
==> master: Unpacking python-six-whl (1.5.2-1ubuntu1) ...
==> master: Selecting previously unselected package python-urllib3-whl.
==> master: Preparing to unpack .../python-urllib3-whl_1.7.1-1ubuntu3_all.deb ...
==> master: Unpacking python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Selecting previously unselected package python-requests-whl.
==> master: Preparing to unpack .../python-requests-whl_2.2.1-1ubuntu0.3_all.deb ...
==> master: Unpacking python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Selecting previously unselected package python-setuptools-whl.
==> master: Preparing to unpack .../python-setuptools-whl_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip-whl.
==> master: Preparing to unpack .../python-pip-whl_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-setuptools.
==> master: Preparing to unpack .../python-setuptools_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip.
==> master: Preparing to unpack .../python-pip_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-wheel.
==> master: Preparing to unpack .../python-wheel_0.24.0-1~ubuntu1_all.deb ...
==> master: Unpacking python-wheel (0.24.0-1~ubuntu1) ...
==> master: Selecting previously unselected package unzip.
==> master: Preparing to unpack .../unzip_6.0-9ubuntu1.3_amd64.deb ...
==> master: Unpacking unzip (6.0-9ubuntu1.3) ...
==> master: Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
==> master: Processing triggers for mime-support (3.54ubuntu1.1) ...
==> master: Setting up libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Setting up libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Setting up libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++ (4:4.8.2-1ubuntu6) ...
==> master: update-alternatives: using /usr/bin/g++ to provide /usr/bin/c++ (c++) in auto mode
==> master: Setting up libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Setting up dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Setting up build-essential (11.6ubuntu6) ...
==> master: Setting up libalgorithm-diff-perl (1.19.02-3) ...
==> master: Setting up libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Setting up libalgorithm-merge-perl (0.08-2) ...
==> master: Setting up libfile-fcntllock-perl (0.14-2build1) ...
==> master: Setting up libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Setting up python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Setting up python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Setting up python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Setting up python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Setting up python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Setting up python-dev (2.7.5-5ubuntu3) ...
==> master: Setting up python-distlib (0.1.8-1ubuntu1) ...
==> master: Setting up python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Setting up python-html5lib (0.999-3~ubuntu1) ...
==> master: Setting up python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Setting up python-six-whl (1.5.2-1ubuntu1) ...
==> master: Setting up python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Setting up python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Setting up python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Setting up python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Setting up python-setuptools (3.3-1ubuntu2) ...
==> master: Setting up python-pip (1.5.4-1ubuntu3) ...
==> master: Setting up python-wheel (0.24.0-1~ubuntu1) ...
==> master: Setting up unzip (6.0-9ubuntu1.3) ...
==> master: Downloading/unpacking ansible
==> master:   Running setup.py (path:/tmp/pip_build_root/ansible/setup.py) egg_info for package ansible
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'packaging'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master: Downloading/unpacking paramiko (from ansible)
==> master: Downloading/unpacking jinja2 (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): PyYAML in /usr/lib/python2.7/dist-packages (from
ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): pycrypto>=2.6 in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Downloading/unpacking ecdsa>=0.11 (from paramiko->ansible)
==> master: Downloading/unpacking MarkupSafe (from jinja2->ansible)
==> master:   Downloading MarkupSafe-0.23.tar.gz
==> master:   Running setup.py (path:/tmp/pip_build_root/MarkupSafe/setup.py) egg_info for package MarkupSafe
==> master:
==> master: Installing collected packages: ansible, paramiko, jinja2, ecdsa, MarkupSafe
==> master:   Running setup.py install for ansible
==> master:     changing mode of build/scripts-2.7/ansible from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-playbook from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-pull from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-doc from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-galaxy from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-vault from 644 to 755
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master:     changing mode of /usr/local/bin/ansible-galaxy to 755
==> master:     changing mode of /usr/local/bin/ansible-playbook to 755
==> master:     changing mode of /usr/local/bin/ansible-doc to 755
==> master:     changing mode of /usr/local/bin/ansible-pull to 755
==> master:     changing mode of /usr/local/bin/ansible-vault to 755
==> master:     changing mode of /usr/local/bin/ansible to 755
==> master:   Running setup.py install for MarkupSafe
==> master:
==> master:     building 'markupsafe._speedups' extension
==> master:     x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c markupsafe/_speedups.c -o build/temp.linux-x86_64-2.7/markupsafe/_speedups.o
==> master:     x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/markupsafe/_speedups.o -o build/lib.linux-x86_64-2.7/markupsafe/_speedups.so
==> master: Successfully installed ansible paramiko jinja2 ecdsa MarkupSafe
==> master: Cleaning up...
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: read (192.168.51.6): No route to host
==> master: read (192.168.51.6): No route to host
==> data1: Importing base box 'ubuntu/trusty64'...
==> data1: Matching MAC address for NAT networking...
==> data1: Checking if box 'ubuntu/trusty64' is up to date...
==> data1: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data1: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data1: `vagrant box update` to update.
==> data1: Setting the name of the VM: data1
==> data1: Clearing any previously set forwarded ports...
==> data1: Fixed port collision for 22 => 2222. Now on port 2200.
==> data1: Clearing any previously set network interfaces...
==> data1: Preparing network interfaces based on configuration...
    data1: Adapter 1: nat
    data1: Adapter 2: hostonly
==> data1: Forwarding ports...
    data1: 22 => 2200 (adapter 1)
==> data1: Running 'pre-boot' VM customizations...
==> data1: Booting VM...
==> data1: Waiting for machine to boot. This may take a few minutes...
    data1: SSH address: 127.0.0.1:2200
    data1: SSH username: vagrant
    data1: SSH auth method: private key
    data1: Warning: Connection timeout. Retrying...
==> data1: Machine booted and ready!
==> data1: Checking for guest additions in VM...
==> data1: Setting hostname...
==> data1: Configuring and enabling network interfaces...
==> data1: Mounting shared folders...
    data1: /vagrant => C:/Users/watrous/Documents/hadoop
==> data1: Running provisioner: file...
==> data2: Importing base box 'ubuntu/trusty64'...
==> data2: Matching MAC address for NAT networking...
==> data2: Checking if box 'ubuntu/trusty64' is up to date...
==> data2: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data2: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data2: `vagrant box update` to update.
==> data2: Setting the name of the VM: data2
==> data2: Clearing any previously set forwarded ports...
==> data2: Fixed port collision for 22 => 2222. Now on port 2201.
==> data2: Clearing any previously set network interfaces...
==> data2: Preparing network interfaces based on configuration...
    data2: Adapter 1: nat
    data2: Adapter 2: hostonly
==> data2: Forwarding ports...
    data2: 22 => 2201 (adapter 1)
==> data2: Running 'pre-boot' VM customizations...
==> data2: Booting VM...
==> data2: Waiting for machine to boot. This may take a few minutes...
    data2: SSH address: 127.0.0.1:2201
    data2: SSH username: vagrant
    data2: SSH auth method: private key
    data2: Warning: Connection timeout. Retrying...
==> data2: Machine booted and ready!
==> data2: Checking for guest additions in VM...
==> data2: Setting hostname...
==> data2: Configuring and enabling network interfaces...
==> data2: Mounting shared folders...
    data2: /vagrant => C:/Users/watrous/Documents/hadoop
==> data2: Running provisioner: file...

Shown in the output above is the bootstrap-master.sh script installing ansible and other required libraries. At this point all three servers are ready for Hadoop to be installed and your VirtualBox console would look something like this:

virtualbox-hadoop-hosts

Limit to a single datanode

If you are low on RAM, you can make a couple of small changes to install only two servers with the same effect. To do this change the following files.

  • Vagrantfile: Remove or comment the definition of the unwanted datanode
  • group_vars/all: Remove or comment the unused host
  • hosts-dev: Remove or comment the unused host

Conversely it is possible to add as many datanodes as you like by modifying the same files above. Those changes will trickle through to as many hosts as you define. I’ll discuss that more in a future post when we use this same Ansible scripts to deploy to a cloud provider.

Install Hadoop

It’s now time to install Hadoop. There are several commented lines in the bootstrap-master.sh script that you can copy and paste to perform the next few steps. The easiest is to login to the hadoop-master server and run the ansible playbook.

Proxy management

If you happen to be behind a proxy then you’ll need to make sure that you update the proxy settings in bootstrap-master.sh and group_vars/all. For the group_vars, if you don’t have a proxy, just leave the none: false setting in place, otherwise the ansible playbook will fail since it’s expecting that to be a dictionary.

Run the Ansible playbook

Below you can see the Ansible output from configuring and installing Hadoop and all its dependencies on all three servers in your new cluster.

vagrant@hadoop-master:~$ cd src/
vagrant@hadoop-master:~/src$ ansible-playbook -i hosts-dev playbook.yml
 
PLAY [Install hadoop master node] *********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.4]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.4]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.4]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.4]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.4]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.4]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.4]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.4]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [master | Copy private key into place] **********************************
changed: [192.168.51.4]
 
TASK: [master | Copy slaves into place] ***************************************
changed: [192.168.51.4]
 
TASK: [master | prepare known_hosts] ******************************************
# 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# 192.168.51.5 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# 192.168.51.6 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [master | add 0.0.0.0 to known_hosts for secondary namenode] ************
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4]
 
PLAY [Install hadoop data nodes] **********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.5]
ok: [192.168.51.6]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
PLAY RECAP ********************************************************************
192.168.51.4               : ok=27   changed=26   unreachable=0    failed=0
192.168.51.5               : ok=23   changed=22   unreachable=0    failed=0
192.168.51.6               : ok=23   changed=22   unreachable=0    failed=0

Start Hadoop and run a job

Now that you have Hadoop installed, it’s time to format HDFS and start up all the services. All the commands to do this are available as comments in the bootstrap-master.sh file. The first step is to format the hdfs namenode. All of the commands that follow are executed as the hadoop user.

vagrant@hadoop-master:~/src$ sudo su - hadoop
hadoop@hadoop-master:~$ hdfs namenode -format
15/09/30 16:06:36 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-master/192.168.51.4
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.1
STARTUP_MSG:   classpath = [truncated]
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a; compiled by 'jenkins' on 2015-06-29T06:04Z
STARTUP_MSG:   java = 1.8.0_60
************************************************************/
15/09/30 16:06:36 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
15/09/30 16:06:36 INFO namenode.NameNode: createNameNode [-format]
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-1c37e2f0-ba4b-4ad7-84d7-223dec53d34a
15/09/30 16:06:36 INFO namenode.FSNamesystem: No KeyProvider found.
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsLock is fair:true
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Sep 30 16:06:36
15/09/30 16:06:36 INFO util.GSet: Computing capacity for map BlocksMap
15/09/30 16:06:36 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:36 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
15/09/30 16:06:36 INFO util.GSet: capacity      = 2^21 = 2097152 entries
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: defaultReplication         = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplication             = 512
15/09/30 16:06:36 INFO blockmanagement.BlockManager: minReplication             = 1
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
15/09/30 16:06:36 INFO namenode.FSNamesystem: supergroup          = supergroup
15/09/30 16:06:36 INFO namenode.FSNamesystem: isPermissionEnabled = true
15/09/30 16:06:36 INFO namenode.FSNamesystem: HA Enabled: false
15/09/30 16:06:36 INFO namenode.FSNamesystem: Append Enabled: true
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map INodeMap
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^20 = 1048576 entries
15/09/30 16:06:37 INFO namenode.FSDirectory: ACLs enabled? false
15/09/30 16:06:37 INFO namenode.FSDirectory: XAttrs enabled? true
15/09/30 16:06:37 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
15/09/30 16:06:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map cachedBlocks
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^18 = 262144 entries
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^15 = 32768 entries
15/09/30 16:06:37 INFO namenode.FSImage: Allocated new BlockPoolId: BP-992546781-192.168.51.4-1443629197156
15/09/30 16:06:37 INFO common.Storage: Storage directory /home/hadoop/hadoop-data/hdfs/namenode has been successfully formatted.
15/09/30 16:06:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/09/30 16:06:37 INFO util.ExitUtil: Exiting with status 0
15/09/30 16:06:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.51.4
************************************************************/

Start DFS

Next start the dfs services, as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-dfs.sh
Starting namenodes on [hadoop-master]
hadoop-master: Warning: Permanently added the RSA host key for IP address '192.168.51.4' to the list of known hosts.
hadoop-master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-data2: Warning: Permanently added the RSA host key for IP address '192.168.51.6' to the list of known hosts.
hadoop-data1: Warning: Permanently added the RSA host key for IP address '192.168.51.5' to the list of known hosts.
hadoop-master: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-master.out
hadoop-data2: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data2.out
hadoop-data1: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

At this point you can access the HDFS status and see all three datanodes attached wtih this URL: http://192.168.51.4:50070/dfshealth.html#tab-datanode.

Start yarn

Next start the yarn service as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-data2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data2.out
hadoop-data1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data1.out
hadoop-master: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-master.out

At this point you can access information about the compute nodes in the cluster and currently running jobs at this URL: http://192.168.51.4:8088/cluster/nodes

Verify that Java processes are running

Hadoop provides a useful script to run a command on all nodes listed in slaves. For example, you can confirm that all expected Java processes are running as expected with the following command.

hadoop@hadoop-master:~$ $HADOOP_HOME/sbin/slaves.sh jps
hadoop-data2: 3872 DataNode
hadoop-data2: 4180 Jps
hadoop-data2: 4021 NodeManager
hadoop-master: 7617 NameNode
hadoop-data1: 3872 DataNode
hadoop-data1: 4180 Jps
hadoop-master: 8675 Jps
hadoop-data1: 4021 NodeManager
hadoop-master: 8309 NodeManager
hadoop-master: 8150 ResourceManager
hadoop-master: 7993 SecondaryNameNode
hadoop-master: 7788 DataNode

Run an example job

Finally, it’s possible to confirm that everything is working as expected by running one of the example jobs. Let’s find the number pi.

hadoop@hadoop-master:~$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 10 30
Number of Maps  = 10
Samples per Map = 30
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/09/30 19:54:28 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/192.168.51.4:8032
15/09/30 19:54:29 INFO input.FileInputFormat: Total input paths to process : 10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: number of splits:10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443642855962_0001
15/09/30 19:54:29 INFO impl.YarnClientImpl: Submitted application application_1443642855962_0001
15/09/30 19:54:29 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1443642855962_0001/
15/09/30 19:54:29 INFO mapreduce.Job: Running job: job_1443642855962_0001
15/09/30 19:54:38 INFO mapreduce.Job: Job job_1443642855962_0001 running in uber mode : false
15/09/30 19:54:38 INFO mapreduce.Job:  map 0% reduce 0%
15/09/30 19:54:52 INFO mapreduce.Job:  map 40% reduce 0%
15/09/30 19:54:56 INFO mapreduce.Job:  map 100% reduce 0%
15/09/30 19:54:59 INFO mapreduce.Job:  map 100% reduce 100%
15/09/30 19:54:59 INFO mapreduce.Job: Job job_1443642855962_0001 completed successfully
15/09/30 19:54:59 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=1272744
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2710
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=43
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=140318
                Total time spent by all reduces in occupied slots (ms)=4742
                Total time spent by all map tasks (ms)=140318
                Total time spent by all reduce tasks (ms)=4742
                Total vcore-seconds taken by all map tasks=140318
                Total vcore-seconds taken by all reduce tasks=4742
                Total megabyte-seconds taken by all map tasks=143685632
                Total megabyte-seconds taken by all reduce tasks=4855808
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1530
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=3509
                CPU time spent (ms)=5620
                Physical memory (bytes) snapshot=2688745472
                Virtual memory (bytes) snapshot=20847497216
                Total committed heap usage (bytes)=2040528896
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 31.245 seconds
Estimated value of Pi is 3.16000000000000000000

Security and Configuration

This example is not production hardened. It does nothing to address firewall management. The key management is permissive and intended to make it easy to communicate between nodes. If this is to be used for a production deployment, it should be easy to add a role to setup the firewall. You may also want be more cautious about accepting keys between hosts.

Default Ports

Lots of people ask about what the default ports are for Hadoop services. The following four links provide all the properties that can be set for any of the main components, including the defaults if they are absent from the configuration file. If it isn’t overridden in the Ansible playbook role templates in the git repository, then the property is the default as shown in the links below.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

Problems spanning subnets

While developing this automation, I originally had the datanodes running on a separate subnet. There’s a problem/bug with Hadoop that prevented nodes from communicating across subnets. The following thread covers some of the discussion.

http://mail-archives.apache.org/mod_mbox/hadoop-user/201509.mbox/%3CCAKFXasEROCe%2BfL%2B8T7A3L0j4Qrm%3D4HHuzGfJhNuZ5MqUvQ%3DwjA%40mail.gmail.com%3E

Resources

While developing my Ansible scripts I leaned heavily on this tutorial
https://chawlasumit.wordpress.com/2015/03/09/install-a-multi-node-hadoop-cluster-on-ubuntu-14-04/

Software Engineering

Hadoop HDFS in Standalone Mode

My previous hadoop example operated against the local filesystem, in spite of the fact that I formatted a local HDFS partition. In order to operate against the local HDFS partition it’s necessary to first start the namenode and datanode. I mostly followed these instructions to start those processes. Here’s the most relevant part that I hadn’t done yet.

# Format the namenode
hdfs namenode -format
# Start the namenode
hdfs namenode
# Start a datanode
hdfs datanode

I was then ready to add some directories and data to the local HDFS partition. I got the idea of using http://www.gutenberg.org/ for sample data from this article.

[watrous@myhost ~]$ hdfs dfs -mkdir -p /user/watrous/gutenberg
13/11/15 16:27:30 INFO namenode.FSEditLog: Number of transactions: 4 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 14
[watrous@myhost ~]$ hdfs dfs -ls -R /
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user/watrous
drwxr-xr-x   - watrous supergroup          0 2013-11-14 23:16 /user/watrous/gutenberg
[watrous@myhost ~]$ hdfs dfs -copyFromLocal /home/watrous/pg20417.txt /user/watrous/gutenberg
13/11/14 23:16:35 INFO hdfs.StateChange: BLOCK* allocateBlock: /user/watrous/gutenberg/pg20417.txt._COPYING_. BP-1860796918-15.50.2.93-1384470255181 blk_1073741825_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:50010|RBW]]}
13/11/14 23:16:35 INFO datanode.DataNode: Receiving BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001 src: /127.0.0.1:47033 dest: /127.0.0.1:50010
13/11/14 23:16:36 INFO DataNode.clienttrace: src: /127.0.0.1:47033, dest: /127.0.0.1:50010, bytes: 674570, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1914205347_1, offset: 0, srvID: DS-819252937-15.50.2.93-50010-1384470785167, blockid: BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001, duration: 26478978
13/11/14 23:16:36 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741825_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:50010|RBW]]} size 0
13/11/14 23:16:36 INFO datanode.DataNode: PacketResponder: BP-1860796918-15.50.2.93-1384470255181:blk_1073741825_1001, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
13/11/14 23:16:36 INFO hdfs.StateChange: DIR* completeFile: /user/watrous/gutenberg/pg20417.txt._COPYING_ is closed by DFSClient_NONMAPREDUCE_1914205347_1
[watrous@myhost ~]$ hdfs dfs -ls /user/watrous/gutenberg
Found 1 items
-rw-r--r--   3 watrous supergroup     674570 2013-11-14 23:16 /user/watrous/gutenberg/pg20417.txt

Accessing HDFS from MapReduce classes

It turns out that no modifications are required in the Mapper and Reducer classes to work with files in HDFS. When a new Path object is created with a single String, the constructor performs some analysis to determine if there is a scheme and authority. The global configuration is then used to make the path qualified, which includes applying the hdfs scheme if there isn’t one already provided. Here are two examples where I call the same MapReduce script with different paths, first a local path, then an HDFS path.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar /opt/mount/input/sample/ ~/output.multiple
...
13/11/14 22:44:48 INFO mapred.MapTask: Processing split: file:/opt/mount/input/sample/catalina.out-20131027.gz:0+4416223
...
[watrous@myhost ~]$ hadoop jar HadoopExample.jar /user/watrous/log_myhost.com /user/watrous/output_myhost.com
...
13/11/15 17:07:21 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131029.gz:0+8924461
...

It is also possible to eliminate ambiguity and explicitly define the hdfs paths as shown here.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar hdfs://localhost/user/watrous/log_myhost.com hdfs://localhost/user/watrous/output_myhost.com-withscheme
...
13/11/15 21:49:21 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131031.gz:0+8118929
...

View results in HDFS

To view the results from HDFS, first get the path to the results, then use something like -cat to output results to stdout.

[watrous@myhost ~]$ hdfs dfs -ls /user/watrous/output_myhost.com
Found 2 items
-rw-r--r--   3 watrous supergroup          0 2013-11-15 17:08 /user/watrous/output_myhost.com/_SUCCESS
-rw-r--r--   3 watrous supergroup      17893 2013-11-15 17:08 /user/watrous/output_myhost.com/part-r-00000

With the path it’s now easy to print our results.

[watrous@myhost ~]$ hdfs dfs -cat /user/watrous/output_myhost.com/part-r-00000
2013-10-26 19:18:05,669 1
2013-10-26 19:31:00,452 1
2013-10-27 06:09:33,748 11
2013-10-27 06:09:33,749 25
2013-10-27 06:09:33,750 26
...