Today: April 17, 2024 9:52 pm
A collection of Software and Cloud patterns with a focus on the Enterprise

Install and configure a Multi-node Hadoop cluster using Ansible

I’ve recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In each case that I’ve observed they start by manually provisioning some servers and then follow a series of tutorials to manually install and configure a cluster. The typical experience seems to take about a week to setup a cluster. There is often a lot of wasted time to deal with networking and connectivity between hosts.

After telling several groups that they should automate the installation of Hadoop using something like Ansible, I decided to create an example. All the scripts to install a new Hadoop cluster in minutes are on github for you to fork: https://github.com/dwatrous/hadoop-multi-server-ansible

I have also recorded a video demonstration of the following process:

Scope

The scope of this article is to create a three node cluster on a single computer (Windows in my case) using VirtualBox and Vagrant. The cluster includes HDFS and mapreduce running on all three nodes. The following diagram will help to visualize the cluster.

hadoop-design

Build the servers

The first step is to install VirtualBox and Vagrant.

Clone hadoop-multi-server-ansible and open a console window to the directory where you cloned. The Vagrantfile defines three Ubuntu 14.04 servers. Each server needs 3GB RAM, so you’ll need to make sure you have enough RAM available. Now run vagrant up and wait a few minutes for the new servers to come up.

C:\Users\watrous\Documents\hadoop>vagrant up
Bringing machine 'master' up with 'virtualbox' provider...
Bringing machine 'data1' up with 'virtualbox' provider...
Bringing machine 'data2' up with 'virtualbox' provider...
==> master: Importing base box 'ubuntu/trusty64'...
==> master: Matching MAC address for NAT networking...
==> master: Checking if box 'ubuntu/trusty64' is up to date...
==> master: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> master: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> master: `vagrant box update` to update.
==> master: Setting the name of the VM: master
==> master: Clearing any previously set forwarded ports...
==> master: Clearing any previously set network interfaces...
==> master: Preparing network interfaces based on configuration...
    master: Adapter 1: nat
    master: Adapter 2: hostonly
==> master: Forwarding ports...
    master: 22 => 2222 (adapter 1)
==> master: Running 'pre-boot' VM customizations...
==> master: Booting VM...
==> master: Waiting for machine to boot. This may take a few minutes...
    master: SSH address: 127.0.0.1:2222
    master: SSH username: vagrant
    master: SSH auth method: private key
    master: Warning: Connection timeout. Retrying...
==> master: Machine booted and ready!
==> master: Checking for guest additions in VM...
==> master: Setting hostname...
==> master: Configuring and enabling network interfaces...
==> master: Mounting shared folders...
    master: /home/vagrant/src => C:/Users/watrous/Documents/hadoop
==> master: Running provisioner: file...
==> master: Running provisioner: shell...
    master: Running: C:/Users/watrous/AppData/Local/Temp/vagrant-shell20150930-12444-1lgl5bq.sh
==> master: stdin: is not a tty
==> master: Ign http://archive.ubuntu.com trusty InRelease
==> master: Ign http://archive.ubuntu.com trusty-updates InRelease
==> master: Ign http://security.ubuntu.com trusty-security InRelease
==> master: Hit http://archive.ubuntu.com trusty Release.gpg
==> master: Get:1 http://security.ubuntu.com trusty-security Release.gpg [933 B]
==> master: Get:2 http://archive.ubuntu.com trusty-updates Release.gpg [933 B]
==> master: Hit http://archive.ubuntu.com trusty Release
==> master: Get:3 http://security.ubuntu.com trusty-security Release [63.5 kB]
==> master: Get:4 http://archive.ubuntu.com trusty-updates Release [63.5 kB]
==> master: Get:5 http://archive.ubuntu.com trusty/main Sources [1,064 kB]
==> master: Get:6 http://security.ubuntu.com trusty-security/main Sources [96.2 kB]
==> master: Get:7 http://security.ubuntu.com trusty-security/universe Sources [31.1 kB]
==> master: Get:8 http://security.ubuntu.com trusty-security/main amd64 Packages [350 kB]
==> master: Get:9 http://archive.ubuntu.com trusty/universe Sources [6,399 kB]
==> master: Get:10 http://security.ubuntu.com trusty-security/universe amd64 Packages [117 kB]
==> master: Get:11 http://security.ubuntu.com trusty-security/main Translation-en [191 kB]
==> master: Get:12 http://security.ubuntu.com trusty-security/universe Translation-en [68.2 kB]
==> master: Hit http://archive.ubuntu.com trusty/main amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/universe amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/main Translation-en
==> master: Hit http://archive.ubuntu.com trusty/universe Translation-en
==> master: Get:13 http://archive.ubuntu.com trusty-updates/main Sources [236 kB]
==> master: Get:14 http://archive.ubuntu.com trusty-updates/universe Sources [139 kB]
==> master: Get:15 http://archive.ubuntu.com trusty-updates/main amd64 Packages [626 kB]
==> master: Get:16 http://archive.ubuntu.com trusty-updates/universe amd64 Packages [320 kB]
==> master: Get:17 http://archive.ubuntu.com trusty-updates/main Translation-en [304 kB]
==> master: Get:18 http://archive.ubuntu.com trusty-updates/universe Translation-en [168 kB]
==> master: Ign http://archive.ubuntu.com trusty/main Translation-en_US
==> master: Ign http://archive.ubuntu.com trusty/universe Translation-en_US
==> master: Fetched 10.2 MB in 4s (2,098 kB/s)
==> master: Reading package lists...
==> master: Reading package lists...
==> master: Building dependency tree...
==> master:
==> master: Reading state information...
==> master: The following extra packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-distlib
==> master:   python-distlib-whl python-html5lib python-html5lib-whl python-pip-whl
==> master:   python-requests-whl python-setuptools python-setuptools-whl python-six-whl
==> master:   python-urllib3-whl python-wheel python2.7-dev python3-pkg-resources
==> master: Suggested packages:
==> master:   debian-keyring g++-multilib g++-4.8-multilib gcc-4.8-doc libstdc++6-4.8-dbg
==> master:   libstdc++-4.8-doc python-genshi python-lxml python3-setuptools zip
==> master: Recommended packages:
==> master:   python-dev-all
==> master: The following NEW packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-dev
==> master:   python-distlib python-distlib-whl python-html5lib python-html5lib-whl
==> master:   python-pip python-pip-whl python-requests-whl python-setuptools
==> master:   python-setuptools-whl python-six-whl python-urllib3-whl python-wheel
==> master:   python2.7-dev python3-pkg-resources unzip
==> master: 0 upgraded, 32 newly installed, 0 to remove and 29 not upgraded.
==> master: Need to get 41.3 MB of archives.
==> master: After this operation, 80.4 MB of additional disk space will be used.
==> master: Get:1 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libexpat1-dev amd64 2.1.0-4ubuntu1.1 [115 kB]
==> master: Get:2 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libpython2.7-dev amd64 2.7.6-8ubuntu0.2 [22.0 MB]
==> master: Get:3 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libstdc++-4.8-dev amd64 4.8.4-2ubuntu1~14.04 [1,052 kB]
==> master: Get:4 http://archive.ubuntu.com/ubuntu/ trusty-updates/main g++-4.8 amd64 4.8.4-2ubuntu1~14.04 [15.0 MB]
==> master: Get:5 http://archive.ubuntu.com/ubuntu/ trusty/main g++ amd64 4:4.8.2-1ubuntu6 [1,490 B]
==> master: Get:6 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libdpkg-perl all 1.17.5ubuntu5.4 [179 kB]
==> master: Get:7 http://archive.ubuntu.com/ubuntu/ trusty-updates/main dpkg-dev all 1.17.5ubuntu5.4 [726 kB]
==> master: Get:8 http://archive.ubuntu.com/ubuntu/ trusty/main build-essential amd64 11.6ubuntu6 [4,838 B]
==> master: Get:9 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-perl all 1.19.02-3 [50.0 kB]
==> master: Get:10 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-xs-perl amd64 0.04-2build4 [12.6 kB]
==> master: Get:11 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-merge-perl all 0.08-2 [12.7 kB]
==> master: Get:12 http://archive.ubuntu.com/ubuntu/ trusty/main libfile-fcntllock-perl amd64 0.14-2build1 [15.9 kB]
==> master: Get:13 http://archive.ubuntu.com/ubuntu/ trusty/main libpython-dev amd64 2.7.5-5ubuntu3 [7,078 B]
==> master: Get:14 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python3-pkg-resources all 3.3-1ubuntu2 [31.7 kB]
==> master: Get:15 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-chardet-whl all 2.2.1-2~ubuntu1 [170 kB]
==> master: Get:16 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama all 0.2.5-0.1ubuntu2 [18.4 kB]
==> master: Get:17 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama-whl all 0.2.5-0.1ubuntu2 [18.2 kB]
==> master: Get:18 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python2.7-dev amd64 2.7.6-8ubuntu0.2 [269 kB]
==> master: Get:19 http://archive.ubuntu.com/ubuntu/ trusty/main python-dev amd64 2.7.5-5ubuntu3 [1,166 B]
==> master: Get:20 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib all 0.1.8-1ubuntu1 [113 kB]
==> master: Get:21 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib-whl all 0.1.8-1ubuntu1 [140 kB]
==> master: Get:22 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib all 0.999-3~ubuntu1 [83.5 kB]
==> master: Get:23 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib-whl all 0.999-3~ubuntu1 [109 kB]
==> master: Get:24 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-six-whl all 1.5.2-1ubuntu1 [10.5 kB]
==> master: Get:25 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-urllib3-whl all 1.7.1-1ubuntu3 [64.0 kB]
==> master: Get:26 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-requests-whl all 2.2.1-1ubuntu0.3 [227
kB]
==> master: Get:27 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools-whl all 3.3-1ubuntu2 [244 kB]
==> master: Get:28 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip-whl all 1.5.4-1ubuntu3 [111 kB]
==> master: Get:29 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools all 3.3-1ubuntu2 [230 kB]
==> master: Get:30 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip all 1.5.4-1ubuntu3 [97.2 kB]
==> master: Get:31 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-wheel all 0.24.0-1~ubuntu1 [44.7 kB]
==> master: Get:32 http://archive.ubuntu.com/ubuntu/ trusty-updates/main unzip amd64 6.0-9ubuntu1.3 [157 kB]
==> master: dpkg-preconfigure: unable to re-open stdin: No such file or directory
==> master: Fetched 41.3 MB in 20s (2,027 kB/s)
==> master: Selecting previously unselected package libexpat1-dev:amd64.
==> master: (Reading database ... 61002 files and directories currently installed.)
==> master: Preparing to unpack .../libexpat1-dev_2.1.0-4ubuntu1.1_amd64.deb ...
==> master: Unpacking libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Selecting previously unselected package libpython2.7-dev:amd64.
==> master: Preparing to unpack .../libpython2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package libstdc++-4.8-dev:amd64.
==> master: Preparing to unpack .../libstdc++-4.8-dev_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++-4.8.
==> master: Preparing to unpack .../g++-4.8_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++.
==> master: Preparing to unpack .../g++_4%3a4.8.2-1ubuntu6_amd64.deb ...
==> master: Unpacking g++ (4:4.8.2-1ubuntu6) ...
==> master: Selecting previously unselected package libdpkg-perl.
==> master: Preparing to unpack .../libdpkg-perl_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package dpkg-dev.
==> master: Preparing to unpack .../dpkg-dev_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package build-essential.
==> master: Preparing to unpack .../build-essential_11.6ubuntu6_amd64.deb ...
==> master: Unpacking build-essential (11.6ubuntu6) ...
==> master: Selecting previously unselected package libalgorithm-diff-perl.
==> master: Preparing to unpack .../libalgorithm-diff-perl_1.19.02-3_all.deb ...
==> master: Unpacking libalgorithm-diff-perl (1.19.02-3) ...
==> master: Selecting previously unselected package libalgorithm-diff-xs-perl.
==> master: Preparing to unpack .../libalgorithm-diff-xs-perl_0.04-2build4_amd64.deb ...
==> master: Unpacking libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Selecting previously unselected package libalgorithm-merge-perl.
==> master: Preparing to unpack .../libalgorithm-merge-perl_0.08-2_all.deb ...
==> master: Unpacking libalgorithm-merge-perl (0.08-2) ...
==> master: Selecting previously unselected package libfile-fcntllock-perl.
==> master: Preparing to unpack .../libfile-fcntllock-perl_0.14-2build1_amd64.deb ...
==> master: Unpacking libfile-fcntllock-perl (0.14-2build1) ...
==> master: Selecting previously unselected package libpython-dev:amd64.
==> master: Preparing to unpack .../libpython-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python3-pkg-resources.
==> master: Preparing to unpack .../python3-pkg-resources_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-chardet-whl.
==> master: Preparing to unpack .../python-chardet-whl_2.2.1-2~ubuntu1_all.deb ...
==> master: Unpacking python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Selecting previously unselected package python-colorama.
==> master: Preparing to unpack .../python-colorama_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python-colorama-whl.
==> master: Preparing to unpack .../python-colorama-whl_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python2.7-dev.
==> master: Preparing to unpack .../python2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package python-dev.
==> master: Preparing to unpack .../python-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking python-dev (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python-distlib.
==> master: Preparing to unpack .../python-distlib_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-distlib-whl.
==> master: Preparing to unpack .../python-distlib-whl_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib.
==> master: Preparing to unpack .../python-html5lib_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib-whl.
==> master: Preparing to unpack .../python-html5lib-whl_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-six-whl.
==> master: Preparing to unpack .../python-six-whl_1.5.2-1ubuntu1_all.deb ...
==> master: Unpacking python-six-whl (1.5.2-1ubuntu1) ...
==> master: Selecting previously unselected package python-urllib3-whl.
==> master: Preparing to unpack .../python-urllib3-whl_1.7.1-1ubuntu3_all.deb ...
==> master: Unpacking python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Selecting previously unselected package python-requests-whl.
==> master: Preparing to unpack .../python-requests-whl_2.2.1-1ubuntu0.3_all.deb ...
==> master: Unpacking python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Selecting previously unselected package python-setuptools-whl.
==> master: Preparing to unpack .../python-setuptools-whl_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip-whl.
==> master: Preparing to unpack .../python-pip-whl_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-setuptools.
==> master: Preparing to unpack .../python-setuptools_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip.
==> master: Preparing to unpack .../python-pip_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-wheel.
==> master: Preparing to unpack .../python-wheel_0.24.0-1~ubuntu1_all.deb ...
==> master: Unpacking python-wheel (0.24.0-1~ubuntu1) ...
==> master: Selecting previously unselected package unzip.
==> master: Preparing to unpack .../unzip_6.0-9ubuntu1.3_amd64.deb ...
==> master: Unpacking unzip (6.0-9ubuntu1.3) ...
==> master: Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
==> master: Processing triggers for mime-support (3.54ubuntu1.1) ...
==> master: Setting up libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Setting up libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Setting up libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++ (4:4.8.2-1ubuntu6) ...
==> master: update-alternatives: using /usr/bin/g++ to provide /usr/bin/c++ (c++) in auto mode
==> master: Setting up libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Setting up dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Setting up build-essential (11.6ubuntu6) ...
==> master: Setting up libalgorithm-diff-perl (1.19.02-3) ...
==> master: Setting up libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Setting up libalgorithm-merge-perl (0.08-2) ...
==> master: Setting up libfile-fcntllock-perl (0.14-2build1) ...
==> master: Setting up libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Setting up python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Setting up python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Setting up python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Setting up python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Setting up python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Setting up python-dev (2.7.5-5ubuntu3) ...
==> master: Setting up python-distlib (0.1.8-1ubuntu1) ...
==> master: Setting up python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Setting up python-html5lib (0.999-3~ubuntu1) ...
==> master: Setting up python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Setting up python-six-whl (1.5.2-1ubuntu1) ...
==> master: Setting up python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Setting up python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Setting up python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Setting up python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Setting up python-setuptools (3.3-1ubuntu2) ...
==> master: Setting up python-pip (1.5.4-1ubuntu3) ...
==> master: Setting up python-wheel (0.24.0-1~ubuntu1) ...
==> master: Setting up unzip (6.0-9ubuntu1.3) ...
==> master: Downloading/unpacking ansible
==> master:   Running setup.py (path:/tmp/pip_build_root/ansible/setup.py) egg_info for package ansible
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'packaging'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master: Downloading/unpacking paramiko (from ansible)
==> master: Downloading/unpacking jinja2 (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): PyYAML in /usr/lib/python2.7/dist-packages (from
ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): pycrypto>=2.6 in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Downloading/unpacking ecdsa>=0.11 (from paramiko->ansible)
==> master: Downloading/unpacking MarkupSafe (from jinja2->ansible)
==> master:   Downloading MarkupSafe-0.23.tar.gz
==> master:   Running setup.py (path:/tmp/pip_build_root/MarkupSafe/setup.py) egg_info for package MarkupSafe
==> master:
==> master: Installing collected packages: ansible, paramiko, jinja2, ecdsa, MarkupSafe
==> master:   Running setup.py install for ansible
==> master:     changing mode of build/scripts-2.7/ansible from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-playbook from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-pull from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-doc from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-galaxy from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-vault from 644 to 755
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master:     changing mode of /usr/local/bin/ansible-galaxy to 755
==> master:     changing mode of /usr/local/bin/ansible-playbook to 755
==> master:     changing mode of /usr/local/bin/ansible-doc to 755
==> master:     changing mode of /usr/local/bin/ansible-pull to 755
==> master:     changing mode of /usr/local/bin/ansible-vault to 755
==> master:     changing mode of /usr/local/bin/ansible to 755
==> master:   Running setup.py install for MarkupSafe
==> master:
==> master:     building 'markupsafe._speedups' extension
==> master:     x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c markupsafe/_speedups.c -o build/temp.linux-x86_64-2.7/markupsafe/_speedups.o
==> master:     x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/markupsafe/_speedups.o -o build/lib.linux-x86_64-2.7/markupsafe/_speedups.so
==> master: Successfully installed ansible paramiko jinja2 ecdsa MarkupSafe
==> master: Cleaning up...
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: read (192.168.51.6): No route to host
==> master: read (192.168.51.6): No route to host
==> data1: Importing base box 'ubuntu/trusty64'...
==> data1: Matching MAC address for NAT networking...
==> data1: Checking if box 'ubuntu/trusty64' is up to date...
==> data1: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data1: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data1: `vagrant box update` to update.
==> data1: Setting the name of the VM: data1
==> data1: Clearing any previously set forwarded ports...
==> data1: Fixed port collision for 22 => 2222. Now on port 2200.
==> data1: Clearing any previously set network interfaces...
==> data1: Preparing network interfaces based on configuration...
    data1: Adapter 1: nat
    data1: Adapter 2: hostonly
==> data1: Forwarding ports...
    data1: 22 => 2200 (adapter 1)
==> data1: Running 'pre-boot' VM customizations...
==> data1: Booting VM...
==> data1: Waiting for machine to boot. This may take a few minutes...
    data1: SSH address: 127.0.0.1:2200
    data1: SSH username: vagrant
    data1: SSH auth method: private key
    data1: Warning: Connection timeout. Retrying...
==> data1: Machine booted and ready!
==> data1: Checking for guest additions in VM...
==> data1: Setting hostname...
==> data1: Configuring and enabling network interfaces...
==> data1: Mounting shared folders...
    data1: /vagrant => C:/Users/watrous/Documents/hadoop
==> data1: Running provisioner: file...
==> data2: Importing base box 'ubuntu/trusty64'...
==> data2: Matching MAC address for NAT networking...
==> data2: Checking if box 'ubuntu/trusty64' is up to date...
==> data2: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data2: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data2: `vagrant box update` to update.
==> data2: Setting the name of the VM: data2
==> data2: Clearing any previously set forwarded ports...
==> data2: Fixed port collision for 22 => 2222. Now on port 2201.
==> data2: Clearing any previously set network interfaces...
==> data2: Preparing network interfaces based on configuration...
    data2: Adapter 1: nat
    data2: Adapter 2: hostonly
==> data2: Forwarding ports...
    data2: 22 => 2201 (adapter 1)
==> data2: Running 'pre-boot' VM customizations...
==> data2: Booting VM...
==> data2: Waiting for machine to boot. This may take a few minutes...
    data2: SSH address: 127.0.0.1:2201
    data2: SSH username: vagrant
    data2: SSH auth method: private key
    data2: Warning: Connection timeout. Retrying...
==> data2: Machine booted and ready!
==> data2: Checking for guest additions in VM...
==> data2: Setting hostname...
==> data2: Configuring and enabling network interfaces...
==> data2: Mounting shared folders...
    data2: /vagrant => C:/Users/watrous/Documents/hadoop
==> data2: Running provisioner: file...

Shown in the output above is the bootstrap-master.sh script installing ansible and other required libraries. At this point all three servers are ready for Hadoop to be installed and your VirtualBox console would look something like this:

virtualbox-hadoop-hosts

Limit to a single datanode

If you are low on RAM, you can make a couple of small changes to install only two servers with the same effect. To do this change the following files.

  • Vagrantfile: Remove or comment the definition of the unwanted datanode
  • group_vars/all: Remove or comment the unused host
  • hosts-dev: Remove or comment the unused host

Conversely it is possible to add as many datanodes as you like by modifying the same files above. Those changes will trickle through to as many hosts as you define. I’ll discuss that more in a future post when we use this same Ansible scripts to deploy to a cloud provider.

Install Hadoop

It’s now time to install Hadoop. There are several commented lines in the bootstrap-master.sh script that you can copy and paste to perform the next few steps. The easiest is to login to the hadoop-master server and run the ansible playbook.

Proxy management

If you happen to be behind a proxy then you’ll need to make sure that you update the proxy settings in bootstrap-master.sh and group_vars/all. For the group_vars, if you don’t have a proxy, just leave the none: false setting in place, otherwise the ansible playbook will fail since it’s expecting that to be a dictionary.

Run the Ansible playbook

Below you can see the Ansible output from configuring and installing Hadoop and all its dependencies on all three servers in your new cluster.

vagrant@hadoop-master:~$ cd src/
vagrant@hadoop-master:~/src$ ansible-playbook -i hosts-dev playbook.yml
 
PLAY [Install hadoop master node] *********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.4]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.4]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.4]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.4]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.4]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.4]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.4]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.4]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [master | Copy private key into place] **********************************
changed: [192.168.51.4]
 
TASK: [master | Copy slaves into place] ***************************************
changed: [192.168.51.4]
 
TASK: [master | prepare known_hosts] ******************************************
# 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# 192.168.51.5 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# 192.168.51.6 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [master | add 0.0.0.0 to known_hosts for secondary namenode] ************
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4]
 
PLAY [Install hadoop data nodes] **********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.5]
ok: [192.168.51.6]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
PLAY RECAP ********************************************************************
192.168.51.4               : ok=27   changed=26   unreachable=0    failed=0
192.168.51.5               : ok=23   changed=22   unreachable=0    failed=0
192.168.51.6               : ok=23   changed=22   unreachable=0    failed=0

Start Hadoop and run a job

Now that you have Hadoop installed, it’s time to format HDFS and start up all the services. All the commands to do this are available as comments in the bootstrap-master.sh file. The first step is to format the hdfs namenode. All of the commands that follow are executed as the hadoop user.

vagrant@hadoop-master:~/src$ sudo su - hadoop
hadoop@hadoop-master:~$ hdfs namenode -format
15/09/30 16:06:36 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-master/192.168.51.4
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.1
STARTUP_MSG:   classpath = [truncated]
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a; compiled by 'jenkins' on 2015-06-29T06:04Z
STARTUP_MSG:   java = 1.8.0_60
************************************************************/
15/09/30 16:06:36 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
15/09/30 16:06:36 INFO namenode.NameNode: createNameNode [-format]
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-1c37e2f0-ba4b-4ad7-84d7-223dec53d34a
15/09/30 16:06:36 INFO namenode.FSNamesystem: No KeyProvider found.
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsLock is fair:true
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Sep 30 16:06:36
15/09/30 16:06:36 INFO util.GSet: Computing capacity for map BlocksMap
15/09/30 16:06:36 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:36 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
15/09/30 16:06:36 INFO util.GSet: capacity      = 2^21 = 2097152 entries
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: defaultReplication         = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplication             = 512
15/09/30 16:06:36 INFO blockmanagement.BlockManager: minReplication             = 1
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
15/09/30 16:06:36 INFO namenode.FSNamesystem: supergroup          = supergroup
15/09/30 16:06:36 INFO namenode.FSNamesystem: isPermissionEnabled = true
15/09/30 16:06:36 INFO namenode.FSNamesystem: HA Enabled: false
15/09/30 16:06:36 INFO namenode.FSNamesystem: Append Enabled: true
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map INodeMap
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^20 = 1048576 entries
15/09/30 16:06:37 INFO namenode.FSDirectory: ACLs enabled? false
15/09/30 16:06:37 INFO namenode.FSDirectory: XAttrs enabled? true
15/09/30 16:06:37 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
15/09/30 16:06:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map cachedBlocks
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^18 = 262144 entries
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^15 = 32768 entries
15/09/30 16:06:37 INFO namenode.FSImage: Allocated new BlockPoolId: BP-992546781-192.168.51.4-1443629197156
15/09/30 16:06:37 INFO common.Storage: Storage directory /home/hadoop/hadoop-data/hdfs/namenode has been successfully formatted.
15/09/30 16:06:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/09/30 16:06:37 INFO util.ExitUtil: Exiting with status 0
15/09/30 16:06:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.51.4
************************************************************/

Start DFS

Next start the dfs services, as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-dfs.sh
Starting namenodes on [hadoop-master]
hadoop-master: Warning: Permanently added the RSA host key for IP address '192.168.51.4' to the list of known hosts.
hadoop-master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-data2: Warning: Permanently added the RSA host key for IP address '192.168.51.6' to the list of known hosts.
hadoop-data1: Warning: Permanently added the RSA host key for IP address '192.168.51.5' to the list of known hosts.
hadoop-master: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-master.out
hadoop-data2: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data2.out
hadoop-data1: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

At this point you can access the HDFS status and see all three datanodes attached wtih this URL: http://192.168.51.4:50070/dfshealth.html#tab-datanode.

Start yarn

Next start the yarn service as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-data2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data2.out
hadoop-data1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data1.out
hadoop-master: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-master.out

At this point you can access information about the compute nodes in the cluster and currently running jobs at this URL: http://192.168.51.4:8088/cluster/nodes

Verify that Java processes are running

Hadoop provides a useful script to run a command on all nodes listed in slaves. For example, you can confirm that all expected Java processes are running as expected with the following command.

hadoop@hadoop-master:~$ $HADOOP_HOME/sbin/slaves.sh jps
hadoop-data2: 3872 DataNode
hadoop-data2: 4180 Jps
hadoop-data2: 4021 NodeManager
hadoop-master: 7617 NameNode
hadoop-data1: 3872 DataNode
hadoop-data1: 4180 Jps
hadoop-master: 8675 Jps
hadoop-data1: 4021 NodeManager
hadoop-master: 8309 NodeManager
hadoop-master: 8150 ResourceManager
hadoop-master: 7993 SecondaryNameNode
hadoop-master: 7788 DataNode

Run an example job

Finally, it’s possible to confirm that everything is working as expected by running one of the example jobs. Let’s find the number pi.

hadoop@hadoop-master:~$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 10 30
Number of Maps  = 10
Samples per Map = 30
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/09/30 19:54:28 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/192.168.51.4:8032
15/09/30 19:54:29 INFO input.FileInputFormat: Total input paths to process : 10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: number of splits:10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443642855962_0001
15/09/30 19:54:29 INFO impl.YarnClientImpl: Submitted application application_1443642855962_0001
15/09/30 19:54:29 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1443642855962_0001/
15/09/30 19:54:29 INFO mapreduce.Job: Running job: job_1443642855962_0001
15/09/30 19:54:38 INFO mapreduce.Job: Job job_1443642855962_0001 running in uber mode : false
15/09/30 19:54:38 INFO mapreduce.Job:  map 0% reduce 0%
15/09/30 19:54:52 INFO mapreduce.Job:  map 40% reduce 0%
15/09/30 19:54:56 INFO mapreduce.Job:  map 100% reduce 0%
15/09/30 19:54:59 INFO mapreduce.Job:  map 100% reduce 100%
15/09/30 19:54:59 INFO mapreduce.Job: Job job_1443642855962_0001 completed successfully
15/09/30 19:54:59 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=1272744
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2710
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=43
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=140318
                Total time spent by all reduces in occupied slots (ms)=4742
                Total time spent by all map tasks (ms)=140318
                Total time spent by all reduce tasks (ms)=4742
                Total vcore-seconds taken by all map tasks=140318
                Total vcore-seconds taken by all reduce tasks=4742
                Total megabyte-seconds taken by all map tasks=143685632
                Total megabyte-seconds taken by all reduce tasks=4855808
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1530
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=3509
                CPU time spent (ms)=5620
                Physical memory (bytes) snapshot=2688745472
                Virtual memory (bytes) snapshot=20847497216
                Total committed heap usage (bytes)=2040528896
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 31.245 seconds
Estimated value of Pi is 3.16000000000000000000

Security and Configuration

This example is not production hardened. It does nothing to address firewall management. The key management is permissive and intended to make it easy to communicate between nodes. If this is to be used for a production deployment, it should be easy to add a role to setup the firewall. You may also want be more cautious about accepting keys between hosts.

Default Ports

Lots of people ask about what the default ports are for Hadoop services. The following four links provide all the properties that can be set for any of the main components, including the defaults if they are absent from the configuration file. If it isn’t overridden in the Ansible playbook role templates in the git repository, then the property is the default as shown in the links below.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

Problems spanning subnets

While developing this automation, I originally had the datanodes running on a separate subnet. There’s a problem/bug with Hadoop that prevented nodes from communicating across subnets. The following thread covers some of the discussion.

http://mail-archives.apache.org/mod_mbox/hadoop-user/201509.mbox/%3CCAKFXasEROCe%2BfL%2B8T7A3L0j4Qrm%3D4HHuzGfJhNuZ5MqUvQ%3DwjA%40mail.gmail.com%3E

Resources

While developing my Ansible scripts I leaned heavily on this tutorial
https://chawlasumit.wordpress.com/2015/03/09/install-a-multi-node-hadoop-cluster-on-ubuntu-14-04/

Comments

  1. Hi Daniel,

    Just modify one statement as follows:
    PRIVATE_KEY_SOURCE = ‘/home/watrous/.vagrant.d/insecure_private_key’, people can use it on ubuntu system as well.

    BTW, the “watrous” should be adjusted accordingly.

    Regards,
    Hawk

  2. Avatar for Daniel Watrous Zeev Lazarev : July 30, 2016 at 11:59 am

    Hi Daniel,

    Thanks for your article.
    I started with:
    vagrant up
    and have all there nodes up and running.
    I reached the point of running the command:
    ansible-playbook -i hosts-dev playbook.yml

    after which I get the error:
    TASK [setup] *******************************************************************
    The authenticity of host ‘192.168.51.5 (192.168.51.5)’ can’t be established.
    ECDSA key fingerprint is 30:75:9b:3b:5c:5b:84:b3:db:85:fe:4f:50:b6:da:fd.
    fatal: [192.168.51.5]: UNREACHABLE! => {“changed”: false, “msg”: “Failed to connect to the host via ssh.”, “unreachable”: true}

    Will be glad if you response to this message.

    Thanks.
    Zeev

    • Zeev, whenever you SSH into a new host, you have to accept the fingerprint and have it added to known hosts. There are ways to bypass this, which may be appropriate for dev environments. For now you can simply try to ssh into each node and manually accept the fingerprint. After that you can run the Ansible playbook.

  3. Hi Daniel,

    tried on vagrant 14.4.3 box. i am gettting error when try to execute ansible playbook.

    vagrant@hadoop-master:~/src$ ansible-playbook -i hosts-dev playbook.yml
    ERROR! Unexpected Exception: No module named markupsafe
    the full traceback was:

    Traceback (most recent call last):
    File “/usr/local/bin/ansible-playbook”, line 79, in
    mycli = getattr(__import__(“ansible.cli.%s” % sub, fromlist=[myclass]), myclass)
    File “/usr/local/lib/python2.7/dist-packages/ansible/cli/playbook.py”, line 30, in
    from ansible.executor.playbook_executor import PlaybookExecutor
    File “/usr/local/lib/python2.7/dist-packages/ansible/executor/playbook_executor.py”, line 27, in
    from ansible.executor.task_queue_manager import TaskQueueManager
    File “/usr/local/lib/python2.7/dist-packages/ansible/executor/task_queue_manager.py”, line 28, in
    from ansible.executor.play_iterator import PlayIterator
    File “/usr/local/lib/python2.7/dist-packages/ansible/executor/play_iterator.py”, line 29, in
    from ansible.playbook.block import Block
    File “/usr/local/lib/python2.7/dist-packages/ansible/playbook/__init__.py”, line 25, in
    from ansible.playbook.play import Play
    File “/usr/local/lib/python2.7/dist-packages/ansible/playbook/play.py”, line 27, in
    from ansible.playbook.base import Base
    File “/usr/local/lib/python2.7/dist-packages/ansible/playbook/base.py”, line 32, in
    from jinja2.exceptions import UndefinedError
    File “/usr/local/lib/python2.7/dist-packages/jinja2/__init__.py”, line 33, in
    from jinja2.environment import Environment, Template
    File “/usr/local/lib/python2.7/dist-packages/jinja2/environment.py”, line 13, in
    from jinja2 import nodes
    File “/usr/local/lib/python2.7/dist-packages/jinja2/nodes.py”, line 19, in
    from jinja2.utils import Markup
    File “/usr/local/lib/python2.7/dist-packages/jinja2/utils.py”, line 531, in
    from markupsafe import Markup, escape, soft_unicode
    ImportError: No module named markupsafe

    and executed vagrant reload –provision that ran successfully.

    but unable to execute ansible playbook.

  4. HI Daniel,

    just found one python module is missing on master node. installed through.

    sudo pip install markupsafe

    now ansible command working fine.

  5. Hi Daniel,

    Thanks for your good Work, modified O/S user in playbook according to Hawk as well as installed markupsafe python module.

    Now all work fine.

  6. Hi Daniel,

    Except for the same couple issues like the ones mentioned here, I was able to see the cluster setup run and work perfectly.

    Thank you for a no fuss, no mess but a great job.
    david l

  7. Awesome job Daniel, a lot of thanks. This was really useful for me 🙂

  8. Hi Daniel,

    i am getting below error

    vagrant@hadoop-master:~/src$ ansible-playbook -i hosts-dev playbook.yml
    [DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and make sure become_method is ‘sudo’ (default).
    This feature will be
    removed in a future release. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.

    PLAY [Install hadoop master node] **********************************************

    TASK [setup] *******************************************************************
    ok: [192.168.51.4]

    TASK [common : include_vars] ***************************************************
    fatal: [192.168.51.4]: FAILED! => {“ansible_facts”: {}, “changed”: false, “failed”: true, “message”: “/home/vagrant/src/nodes-dev does not have a valid extension: yaml, yml, json”}
    to retry, use: –limit @/home/vagrant/src/playbook.retry

    PLAY RECAP *********************************************************************
    192.168.51.4 : ok=1 changed=0 unreachable=0 failed=1

    Could you please help me to fix this issue.

    thanks,
    Vinod

    • Hi,
      Did you fix above issue?
      i got the same error. :(.

      • Which error specifically?

        • The error as Vinod came accross and explained it in the comment.

          Best regards.

          Mouhsin.

        • Hi,

          …..
          ==> master: Command /home/vagrant/venv/bin/python -c “import setuptools, tokenize;__file__=’/home/vagrant/venv/build/cffi/setup.py’;exec(compile(getattr(tokenize, ‘open’, open)(__file__).read().replace(‘\r\n’, ‘\n’), __file__, ‘exec’))” install –record /tmp/pip-OL12dQ-record/install-record.txt –single-version-externally-managed –compile –install-headers /home/vagrant/venv/include/site/python2.7 failed with error code 1 in /home/vagrant/venv/build/cffi
          ==> master: Traceback (most recent call last):
          ==> master: File “/home/vagrant/venv/bin/pip”, line 11, in
          ==> master:
          ==> master: sys.exit(main())
          ==> master: File “/home/vagrant/venv/local/lib/python2.7/site-packages/pip/__init__.py”, line 185, in main
          ==> master:
          ==> master: return command.main(cmd_args)
          ==> master: File “/home/vagrant/venv/local/lib/python2.7/site-packages/pip/basecommand.py”, line 161, in main
          ==> master:
          ==> master: text = ‘\n’.join(complete_log)
          ==> master: UnicodeDecodeError
          ==> master: :
          ==> master: ‘ascii’ codec can’t decode byte 0xe2 in position 42: ordinal not in range(128)
          The SSH command responded with a non-zero exit status. Vagrant
          assumes that this means the command failed. The output for this command
          should be in the log above. Please read the output to determine what
          went wrong.

          as i understood this error cause the above error.

          Mouhsin

  9. Nice tutorial! Thanks!

  10. Hi Daniel,
    This is a awesome tutorial but I am have a problem in the Virutal box that I wanted to see if you can help me with. When I run the ansible-playbook -i hosts-dev playbook.yml command it says
    Warning could not parse environment value
    Fatal Failed request :”HTTP ERROR 404″ NOT FOUND”, false “dest”:”/home/hadoop/hadoop-27.1.tar.gz”

  11. in this configuration, what is the sudo password for hadoop?

  12. Hi Daniel
    I’m getting this error in ansible-plyabook :
    fatal: [192.168.51.6]: FAILED! => {“changed”: false, “msg”: “groupadd: Permission denied.\ngroupadd: cannot lock /etc/group; try again later.\n”, “name”: “hadoop”}
    I think I can’t excute sudo command from master to nodeData1

Leave a Reply to Mouhsin Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.