Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged ansible

Software Engineering

Infrastructure as Code

One of the most significant enablers of IT and software automation has been the shift away from fixed infrastructure to flexible infrastructure. Virtualization, process isolation, resource sharing and other forms of flexible infrastructure have been in use for many decades in IT systems. It can be seen in early Unix systems, Java application servers and even in common tools such as Apache and IIS in the form of virtual hosts. If flexible infrastructure has been a part of technology practice for so long, why is it getting so much buzz now?

Infrastructure as Code

In the last decade, virtualization has become more accessible and transparent, in part due to text based abstractions that describe infrastructure systems. There are many such abstractions that span IaaS, PaaS, CaaS (containers) and other platforms, but I see four major categories of tool that have emerged.

  • Infrastructure Definition. This is closest to defining actual server, network and storage.
  • Runtime or system configuration. This operates on compute resources to overlay system libraries, policies, access control, etc.
  • Image definition. This produces an image or template of a system or application that can then be instantiated.
  • Application description. This is often a composite representation of infrastructure resources and relationships that together deliver a functional system.

Right tool for the right job

I have observed a trend among these toolsets to expand their scope beyond one of these categories to encompass all of them. For example, rather than use a chain of tools such as Packer to define an image, HEAT to define the infrastructure and Ansible to configure the resources and deploy the application, someone will try to use Ansible to to all three. Why is that bad?

A tool like HEAT is directly tied to the OpenStack charter. It endeavors to adhere to the native APIs as they evolve. The tools is accessible, reportable and integrated into the OpenStack environment where the managed resources are also visible. This can simplify troubleshooting and decrease development time. In my experience, a tool like Ansible generally lags behind in features, API support and lacks the native interface integration. Some argue that using a tool like Ansible makes the automation more portable between cloud providers. Given the different interfaces and underlying APIs, I haven’t seen this actually work. There is always a frustrating translation when changing providers, and in many cases there is additional frustration due to idiosyncrasies of the tool, which could have been avoided if using more native interfaces.

The point I’m driving at is that when a native, supported and integrated tool exists for a given stage of automation, it’s worth exploring, even if it represents another skill set for those who develop the automation. The insight gained can often lead to a more robust and appropriate implementation. In the end, a tool can call a combination of HEAT and Ansible as easily as just Ansible.

Containers vs. Platforms

Another lively discussion over the past few years revolves around where automation efforts should focus. AWS made popular the idea that automation at the IaaS layer was the way to go. A lot of companies have benefitted from that, but many more have found the learning curve too steep and the cost of fixed resources too high. Along came Heroku and promised to abstract away all the complexity of IaaS but still deliver all the benefits. The cost of that benefit came in either reduced flexibility or a steep learning curve to create new deployment contexts (called buildpacks). When Docker came along and provided a very easy way to produce a single function image that could be quickly instantiated, this spawned discussion related to how the container lifecycle should be orchestrated.

Containers moved the concept of image creation away from general purpose compute, which had been the focus of IaaS, and toward specialized compute, such as a single application executable. Start time and resource efficiency made containers more appealing than virtual servers, but questions about how to handle networking and storage remained. The docker best practice of single function containers drove up the number of instances when compared to more complex virtual servers that filled multiple roles and had longer life cycles. Orchestration became the key to reliable container based deployments.

The descriptive approaches that evolved to accommodate containers, such as kubernetes, provide more ease and speed than IaaS, while providing more transparency and control than PaaS. Containers make it possible to define their application deployment scenario, including images, networking, storage, configuration, routing, etc., in plain text and trust the Container as a Service (CaaS) to orchestrate it all.

Evolution

Up to this point, infrastructure as code has evolved from shell and bash scripts, to infrastructure definitions for IaaS tools, to configuration and image creation tools for what those environments look like to full application deployment descriptions. What remains to mature are the configuration, secret management and regional distribution of compute locality for performance and edge data processing.

Software Engineering

HEAT or Ansible in OpenStack? Both!

Someone asked me today whether he should use HEAT or Ansible to automate his OpenStack deployment. My answer is that he should use both! It’s helpful to understand the original design decisions for each tool in order to use each effectively. OpenStack HEAT and Ansible were designed to do different things, although in the opensource tradition, they have been extended to accommodate some overlapping functionalities.

Cloud Native

In my post on What is Cloud Native, I show the five elements of application life cycle that can be automated in the cloud (image shown below). The two life cycle elements in blue, provision and configure, correspond to the most effective use of HEAT and Ansible.

application-lifecycle-elements

OpenStack HEAT for Provisioning

HEAT is designed to capture details related to infrastructure and accommodate provisioning of that infrastructure on OpenStack. CloudFormation does the same thing in AWS and Terraform is an abstraction that has providers for both OpenStack and AWS (and many others).

HEAT provides vocabulary to define compute, storage, network and other infrastructure related resources. This includes the interrelationships between infrastructure resources, such as associating floating IPs with compute resources or binding a compute resource to a specific network. This also includes some bookkeeping items, like assigning key pairs for authentication and naming resources.

The end result of executing a heat template is a collection of one or more infrastructure resources based on existing images (VM, or volume).

Ansible for Configuration

Ansible, on the other hand, is designed to configure infrastructure after it has been provisioned. This includes activities like installing libraries and setting up a specific run time environment. System details like firewalls and log management, as well as application stack, databases, etc. are easily managed from Ansible.

Ansible can also easily accommodate application deployment. Activities such as moving application artifacts into specific places, managing users/groups and file permissions, tweaking configuration files, etc. are all easily done in Ansible.

The end result of executing an Ansible playbook is ready-to-use infrastructure.

Where is the Overlap?

Ansible can provision resources in openstack. HEAT can send a cloud-init script to a new server to perform configuration of the server. In the case of Ansible for provisioning, it is not nearly as articulate or granular for the purpose of defining infrastructure as HEAT. In the case of HEAT configuring infrastructure through cloud-init, you still need to find some way to dynamically manage the cloud-init scripts to configure each compute resource to fit into your larger system. I do use cloud-init with HEAT, but I generally find more value in leaving the bulk of configuration to Ansible.

Ansible inventory from HEAT

When using HEAT and Ansible together, it is necessary to generate the ansible inventory file from HEAT output. To accomplish this, you want to make sure HEAT outputs necessary information, like IP addresses. You can use your favorite scripting language to query HEAT and write the inventory file.

Example using both HEAT and Ansible

A while ago I published two articles that showed how I develop the Ansible configuration, and then extend that to work with HEAT for deploying complex, multi-server environments.

Install and configure a Multi-node Hadoop cluster using Ansible

Bulid a multi-server Hadoop cluster in OpenStack in minutes

The first article lays the foundation for deploying a complex system with Ansible. The second article builds on this by introducing HEAT to provision the infrastructure. The Ansible inventory file is dynamically generated using a python script and the OpenStack CLI.

Conclusion

While there is some ambiguity around the term provision in cloud parlance, I consider provision to be the process of creating infrastructure resources that are not generally configured. I refer to configuration as the process of operating against those provisioned resources to prepare them for a specific use case, such as running an application or a database. HEAT is a powerful tool for provisioning resources in OpenStack and Ansible is a great fit for configuring existing infrastructure resources.

Software Engineering

Deploy MongoDB using Ansible

I’ve recently had some people ask how I deploy MongoDB. For a while I used their excellent online tool to deploy and monitor my clusters. Unfortunately they changed direction and I couldn’t afford their new tools, so I turned to Ansible.

In order more easily share the process, I posted a simple example that you can run locally using Vagrant to deploy MongoDB using Ansible.

https://github.com/dwatrous/ansible-mongodb

As soon as you finish running the Ansible script, you can immediately connect to MongoDB and start working with Data.

ansible-mongodb

If you’re looking to learn more about MongoDB, checkout the videos I published with PackT Publishing on end to end MongoDB.

Software Engineering

What is Cloud Native?

I hear a lot of people talking about cloud native applications these days. This includes technologists and business managers. I have found that there really is a spectrum of meaning for the term cloud native and that two people rarely mean the same thing when they say cloud native.

At one end of the spectrum would be running a traditional workload on a virtual machine. In this scenario the virtual host may have been manually provisioned, manually configured, manually deployed, etc. It’s cloudiness comes from the fact that it’s a virtual machine running in the cloud.

I tend to think of cloud native at the other end and propose the following definition:

The ability to provision and configure infrastructure, stage and deploy an application and address the scale and health needs of the application in an automated and deterministic way without human interaction

The activities necessary to accomplish the above are:

  • Provision
  • Configure
  • Build and Test
  • Deploy
  • Scale and Heal

application-lifecycle-elements

Provision and Configure

The following diagram illustrates some of the workflow involved in provisioning and configuring resources for a cloud native application.

You’ll notice that there are some abstractions listed, including HEAT for openstack, CloudFormation for AWS and even Terraform, which can provision against both openstack and AWS. You’ll also notice that I include a provision flow that produces an image rather than an actual running resource. This can be helpful when using IaaS directly, but becomes essential when using containers. The management of that image creation process should include a CI/CD pipeline and a versioned image registry (more about that another time).

Build, Test, Deploy

With provisioning defined it’s time to look at the application Build, Test and Deploy steps. These are depicted in the following figure:

The color of the “Prepare Infrastructure” activity should hint that in this process it represents the workflow shown above under Provision and Configure. For clarity, various steps have been grouped under the heading “Application Staging Process”. While these can occur independently (and unfortunately sometimes testing never happens), it’s helpful to think of those four steps as necessary to validate any potential release. It should be possible to fully automate the staging of an application.

Discovery

The discovery step is often still done in a manual way using configuration files or even manual edits after deploy. Discovery could include making sure application components know how to reach a database or how a load balancer knows to which application servers it should direct traffic. In a cloud native application, this discovery should be fully automated. When using containers it will be essential and very fluid. Some mechanisms that accommodate discovery include system level tools like etcd and DNS based tools like consul.

Monitor and Heal or Scale

There are loads of monitoring tools available today. A cloud native application requires monitoring to be close to real time and needs to be able to act on monitoring outputs. This may involve creating new resources, destroying unhealthy resources and even shifting workloads around based on latency or other metrics.

Tools and Patterns

There are many tools to establish the workflows shown above. The provision step will almost always be provider specific and based on their API. Some tools, such as terraform, attempt to abstract this away from the provider with mixed results. The configure step might include Ansible or a similar tool. The build, test and deploy process will likely use a tool like Jenkins to accomplish automation. In some cases the above process may include multiple providers, all integrated by your application.

Regardless of the tools you choose, the most important characteristic of a cloud native application is that all of the activities listed are automated and deterministic.

Software Engineering

Bulid a multi-server Hadoop cluster in OpenStack in minutes

In a previous post I demonstrated a method to deploy a multi-node Hadoop cluster using Vagrant and Ansible. This post builds on that and shows how to deploy a Hadoop cluster with an arbitrary number of slave nodes in minutes on OpenStack. This process makes use of the OpenStack orchestration layer HEAT to provision the resources, after which Ansible use used to configure those resources.

All the scripts to do this yourself is available on github to clone and fork:
https://github.com/dwatrous/hadoop-multi-server-ansible

I have recorded a video demonstrating the entire process, including scaling the cluster after initial deployment:

Scope

The scope of this article is to create a Hadoop cluster with an arbitrary number of slave nodes, which can be automatically scaled up or down to accommodate changes in capacity as workloads change. The following diagram illustrates this:
hadoop-design-openstack

Build the servers

For convenience, this process still uses Vagrant to create a server that will function as the heat and ansible controller. It’s also possible create a server in OpenStack to fill this role. In this case you could simply use the bootstrap-master.sh script to configure that server. The steps to create the servers in OpenStack using heat are:

  1. Install openstack clients (we do this in a python virtual environment)
  2. Download and source the openrc file from your OpenStack environment
  3. Use the openstack clients to get details about keypairs, images, networks, etc.
  4. Update the heat template for your environment
  5. Use heat to build your servers

Install and Run Hadoop

Once the servers are provisioned, it’s time to install Hadoop. This is done using Ansible and can be run from the same host where heat was used (the vagrant created server in this case). Ansible requires an inventory file to run. Since heat is aware of the server resources it created, I added a python script to request information about provisioned servers from heat and write an inventory file. Any time you update your stack using heat, be sure to run the heat-inventory.py script so Ansible is working against the current state. Keep in mind that if you have a proxied environment, you may need to update group_vars/all.

When the Ansible scripts complete, remember to connect to the master node in openstack. From the Hadoop master node, the same process as before can be followed to start Hadoop and run a job.

Security and Configuration

In this example, a floating IP is attached to every server so the local vagrant server can connect via SSH and configure them. If a server was manually prepared in the same openstack environment, the SSH connectivity could leverage IP addresses on the private network. This would eliminate all but one floating IP address, which is still required for the master node.

Future work might include additional automation to tie together the steps I’ve demonstrated. These can also be executed as part of a CI/CD tool chain for fully automated deployments.

Software Engineering

Install and configure a Multi-node Hadoop cluster using Ansible

I’ve recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In each case that I’ve observed they start by manually provisioning some servers and then follow a series of tutorials to manually install and configure a cluster. The typical experience seems to take about a week to setup a cluster. There is often a lot of wasted time to deal with networking and connectivity between hosts.

After telling several groups that they should automate the installation of Hadoop using something like Ansible, I decided to create an example. All the scripts to install a new Hadoop cluster in minutes are on github for you to fork: https://github.com/dwatrous/hadoop-multi-server-ansible

I have also recorded a video demonstration of the following process:

Scope

The scope of this article is to create a three node cluster on a single computer (Windows in my case) using VirtualBox and Vagrant. The cluster includes HDFS and mapreduce running on all three nodes. The following diagram will help to visualize the cluster.

hadoop-design

Build the servers

The first step is to install VirtualBox and Vagrant.

Clone hadoop-multi-server-ansible and open a console window to the directory where you cloned. The Vagrantfile defines three Ubuntu 14.04 servers. Each server needs 3GB RAM, so you’ll need to make sure you have enough RAM available. Now run vagrant up and wait a few minutes for the new servers to come up.

C:\Users\watrous\Documents\hadoop>vagrant up
Bringing machine 'master' up with 'virtualbox' provider...
Bringing machine 'data1' up with 'virtualbox' provider...
Bringing machine 'data2' up with 'virtualbox' provider...
==> master: Importing base box 'ubuntu/trusty64'...
==> master: Matching MAC address for NAT networking...
==> master: Checking if box 'ubuntu/trusty64' is up to date...
==> master: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> master: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> master: `vagrant box update` to update.
==> master: Setting the name of the VM: master
==> master: Clearing any previously set forwarded ports...
==> master: Clearing any previously set network interfaces...
==> master: Preparing network interfaces based on configuration...
    master: Adapter 1: nat
    master: Adapter 2: hostonly
==> master: Forwarding ports...
    master: 22 => 2222 (adapter 1)
==> master: Running 'pre-boot' VM customizations...
==> master: Booting VM...
==> master: Waiting for machine to boot. This may take a few minutes...
    master: SSH address: 127.0.0.1:2222
    master: SSH username: vagrant
    master: SSH auth method: private key
    master: Warning: Connection timeout. Retrying...
==> master: Machine booted and ready!
==> master: Checking for guest additions in VM...
==> master: Setting hostname...
==> master: Configuring and enabling network interfaces...
==> master: Mounting shared folders...
    master: /home/vagrant/src => C:/Users/watrous/Documents/hadoop
==> master: Running provisioner: file...
==> master: Running provisioner: shell...
    master: Running: C:/Users/watrous/AppData/Local/Temp/vagrant-shell20150930-12444-1lgl5bq.sh
==> master: stdin: is not a tty
==> master: Ign http://archive.ubuntu.com trusty InRelease
==> master: Ign http://archive.ubuntu.com trusty-updates InRelease
==> master: Ign http://security.ubuntu.com trusty-security InRelease
==> master: Hit http://archive.ubuntu.com trusty Release.gpg
==> master: Get:1 http://security.ubuntu.com trusty-security Release.gpg [933 B]
==> master: Get:2 http://archive.ubuntu.com trusty-updates Release.gpg [933 B]
==> master: Hit http://archive.ubuntu.com trusty Release
==> master: Get:3 http://security.ubuntu.com trusty-security Release [63.5 kB]
==> master: Get:4 http://archive.ubuntu.com trusty-updates Release [63.5 kB]
==> master: Get:5 http://archive.ubuntu.com trusty/main Sources [1,064 kB]
==> master: Get:6 http://security.ubuntu.com trusty-security/main Sources [96.2 kB]
==> master: Get:7 http://security.ubuntu.com trusty-security/universe Sources [31.1 kB]
==> master: Get:8 http://security.ubuntu.com trusty-security/main amd64 Packages [350 kB]
==> master: Get:9 http://archive.ubuntu.com trusty/universe Sources [6,399 kB]
==> master: Get:10 http://security.ubuntu.com trusty-security/universe amd64 Packages [117 kB]
==> master: Get:11 http://security.ubuntu.com trusty-security/main Translation-en [191 kB]
==> master: Get:12 http://security.ubuntu.com trusty-security/universe Translation-en [68.2 kB]
==> master: Hit http://archive.ubuntu.com trusty/main amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/universe amd64 Packages
==> master: Hit http://archive.ubuntu.com trusty/main Translation-en
==> master: Hit http://archive.ubuntu.com trusty/universe Translation-en
==> master: Get:13 http://archive.ubuntu.com trusty-updates/main Sources [236 kB]
==> master: Get:14 http://archive.ubuntu.com trusty-updates/universe Sources [139 kB]
==> master: Get:15 http://archive.ubuntu.com trusty-updates/main amd64 Packages [626 kB]
==> master: Get:16 http://archive.ubuntu.com trusty-updates/universe amd64 Packages [320 kB]
==> master: Get:17 http://archive.ubuntu.com trusty-updates/main Translation-en [304 kB]
==> master: Get:18 http://archive.ubuntu.com trusty-updates/universe Translation-en [168 kB]
==> master: Ign http://archive.ubuntu.com trusty/main Translation-en_US
==> master: Ign http://archive.ubuntu.com trusty/universe Translation-en_US
==> master: Fetched 10.2 MB in 4s (2,098 kB/s)
==> master: Reading package lists...
==> master: Reading package lists...
==> master: Building dependency tree...
==> master:
==> master: Reading state information...
==> master: The following extra packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-distlib
==> master:   python-distlib-whl python-html5lib python-html5lib-whl python-pip-whl
==> master:   python-requests-whl python-setuptools python-setuptools-whl python-six-whl
==> master:   python-urllib3-whl python-wheel python2.7-dev python3-pkg-resources
==> master: Suggested packages:
==> master:   debian-keyring g++-multilib g++-4.8-multilib gcc-4.8-doc libstdc++6-4.8-dbg
==> master:   libstdc++-4.8-doc python-genshi python-lxml python3-setuptools zip
==> master: Recommended packages:
==> master:   python-dev-all
==> master: The following NEW packages will be installed:
==> master:   build-essential dpkg-dev g++ g++-4.8 libalgorithm-diff-perl
==> master:   libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libexpat1-dev
==> master:   libfile-fcntllock-perl libpython-dev libpython2.7-dev libstdc++-4.8-dev
==> master:   python-chardet-whl python-colorama python-colorama-whl python-dev
==> master:   python-distlib python-distlib-whl python-html5lib python-html5lib-whl
==> master:   python-pip python-pip-whl python-requests-whl python-setuptools
==> master:   python-setuptools-whl python-six-whl python-urllib3-whl python-wheel
==> master:   python2.7-dev python3-pkg-resources unzip
==> master: 0 upgraded, 32 newly installed, 0 to remove and 29 not upgraded.
==> master: Need to get 41.3 MB of archives.
==> master: After this operation, 80.4 MB of additional disk space will be used.
==> master: Get:1 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libexpat1-dev amd64 2.1.0-4ubuntu1.1 [115 kB]
==> master: Get:2 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libpython2.7-dev amd64 2.7.6-8ubuntu0.2 [22.0 MB]
==> master: Get:3 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libstdc++-4.8-dev amd64 4.8.4-2ubuntu1~14.04 [1,052 kB]
==> master: Get:4 http://archive.ubuntu.com/ubuntu/ trusty-updates/main g++-4.8 amd64 4.8.4-2ubuntu1~14.04 [15.0 MB]
==> master: Get:5 http://archive.ubuntu.com/ubuntu/ trusty/main g++ amd64 4:4.8.2-1ubuntu6 [1,490 B]
==> master: Get:6 http://archive.ubuntu.com/ubuntu/ trusty-updates/main libdpkg-perl all 1.17.5ubuntu5.4 [179 kB]
==> master: Get:7 http://archive.ubuntu.com/ubuntu/ trusty-updates/main dpkg-dev all 1.17.5ubuntu5.4 [726 kB]
==> master: Get:8 http://archive.ubuntu.com/ubuntu/ trusty/main build-essential amd64 11.6ubuntu6 [4,838 B]
==> master: Get:9 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-perl all 1.19.02-3 [50.0 kB]
==> master: Get:10 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-diff-xs-perl amd64 0.04-2build4 [12.6 kB]
==> master: Get:11 http://archive.ubuntu.com/ubuntu/ trusty/main libalgorithm-merge-perl all 0.08-2 [12.7 kB]
==> master: Get:12 http://archive.ubuntu.com/ubuntu/ trusty/main libfile-fcntllock-perl amd64 0.14-2build1 [15.9 kB]
==> master: Get:13 http://archive.ubuntu.com/ubuntu/ trusty/main libpython-dev amd64 2.7.5-5ubuntu3 [7,078 B]
==> master: Get:14 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python3-pkg-resources all 3.3-1ubuntu2 [31.7 kB]
==> master: Get:15 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-chardet-whl all 2.2.1-2~ubuntu1 [170 kB]
==> master: Get:16 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama all 0.2.5-0.1ubuntu2 [18.4 kB]
==> master: Get:17 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-colorama-whl all 0.2.5-0.1ubuntu2 [18.2 kB]
==> master: Get:18 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python2.7-dev amd64 2.7.6-8ubuntu0.2 [269 kB]
==> master: Get:19 http://archive.ubuntu.com/ubuntu/ trusty/main python-dev amd64 2.7.5-5ubuntu3 [1,166 B]
==> master: Get:20 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib all 0.1.8-1ubuntu1 [113 kB]
==> master: Get:21 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-distlib-whl all 0.1.8-1ubuntu1 [140 kB]
==> master: Get:22 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib all 0.999-3~ubuntu1 [83.5 kB]
==> master: Get:23 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-html5lib-whl all 0.999-3~ubuntu1 [109 kB]
==> master: Get:24 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-six-whl all 1.5.2-1ubuntu1 [10.5 kB]
==> master: Get:25 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-urllib3-whl all 1.7.1-1ubuntu3 [64.0 kB]
==> master: Get:26 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-requests-whl all 2.2.1-1ubuntu0.3 [227
kB]
==> master: Get:27 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools-whl all 3.3-1ubuntu2 [244 kB]
==> master: Get:28 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip-whl all 1.5.4-1ubuntu3 [111 kB]
==> master: Get:29 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-setuptools all 3.3-1ubuntu2 [230 kB]
==> master: Get:30 http://archive.ubuntu.com/ubuntu/ trusty-updates/universe python-pip all 1.5.4-1ubuntu3 [97.2 kB]
==> master: Get:31 http://archive.ubuntu.com/ubuntu/ trusty-updates/main python-wheel all 0.24.0-1~ubuntu1 [44.7 kB]
==> master: Get:32 http://archive.ubuntu.com/ubuntu/ trusty-updates/main unzip amd64 6.0-9ubuntu1.3 [157 kB]
==> master: dpkg-preconfigure: unable to re-open stdin: No such file or directory
==> master: Fetched 41.3 MB in 20s (2,027 kB/s)
==> master: Selecting previously unselected package libexpat1-dev:amd64.
==> master: (Reading database ... 61002 files and directories currently installed.)
==> master: Preparing to unpack .../libexpat1-dev_2.1.0-4ubuntu1.1_amd64.deb ...
==> master: Unpacking libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Selecting previously unselected package libpython2.7-dev:amd64.
==> master: Preparing to unpack .../libpython2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package libstdc++-4.8-dev:amd64.
==> master: Preparing to unpack .../libstdc++-4.8-dev_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++-4.8.
==> master: Preparing to unpack .../g++-4.8_4.8.4-2ubuntu1~14.04_amd64.deb ...
==> master: Unpacking g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Selecting previously unselected package g++.
==> master: Preparing to unpack .../g++_4%3a4.8.2-1ubuntu6_amd64.deb ...
==> master: Unpacking g++ (4:4.8.2-1ubuntu6) ...
==> master: Selecting previously unselected package libdpkg-perl.
==> master: Preparing to unpack .../libdpkg-perl_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package dpkg-dev.
==> master: Preparing to unpack .../dpkg-dev_1.17.5ubuntu5.4_all.deb ...
==> master: Unpacking dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Selecting previously unselected package build-essential.
==> master: Preparing to unpack .../build-essential_11.6ubuntu6_amd64.deb ...
==> master: Unpacking build-essential (11.6ubuntu6) ...
==> master: Selecting previously unselected package libalgorithm-diff-perl.
==> master: Preparing to unpack .../libalgorithm-diff-perl_1.19.02-3_all.deb ...
==> master: Unpacking libalgorithm-diff-perl (1.19.02-3) ...
==> master: Selecting previously unselected package libalgorithm-diff-xs-perl.
==> master: Preparing to unpack .../libalgorithm-diff-xs-perl_0.04-2build4_amd64.deb ...
==> master: Unpacking libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Selecting previously unselected package libalgorithm-merge-perl.
==> master: Preparing to unpack .../libalgorithm-merge-perl_0.08-2_all.deb ...
==> master: Unpacking libalgorithm-merge-perl (0.08-2) ...
==> master: Selecting previously unselected package libfile-fcntllock-perl.
==> master: Preparing to unpack .../libfile-fcntllock-perl_0.14-2build1_amd64.deb ...
==> master: Unpacking libfile-fcntllock-perl (0.14-2build1) ...
==> master: Selecting previously unselected package libpython-dev:amd64.
==> master: Preparing to unpack .../libpython-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python3-pkg-resources.
==> master: Preparing to unpack .../python3-pkg-resources_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-chardet-whl.
==> master: Preparing to unpack .../python-chardet-whl_2.2.1-2~ubuntu1_all.deb ...
==> master: Unpacking python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Selecting previously unselected package python-colorama.
==> master: Preparing to unpack .../python-colorama_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python-colorama-whl.
==> master: Preparing to unpack .../python-colorama-whl_0.2.5-0.1ubuntu2_all.deb ...
==> master: Unpacking python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Selecting previously unselected package python2.7-dev.
==> master: Preparing to unpack .../python2.7-dev_2.7.6-8ubuntu0.2_amd64.deb ...
==> master: Unpacking python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Selecting previously unselected package python-dev.
==> master: Preparing to unpack .../python-dev_2.7.5-5ubuntu3_amd64.deb ...
==> master: Unpacking python-dev (2.7.5-5ubuntu3) ...
==> master: Selecting previously unselected package python-distlib.
==> master: Preparing to unpack .../python-distlib_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-distlib-whl.
==> master: Preparing to unpack .../python-distlib-whl_0.1.8-1ubuntu1_all.deb ...
==> master: Unpacking python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib.
==> master: Preparing to unpack .../python-html5lib_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-html5lib-whl.
==> master: Preparing to unpack .../python-html5lib-whl_0.999-3~ubuntu1_all.deb ...
==> master: Unpacking python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Selecting previously unselected package python-six-whl.
==> master: Preparing to unpack .../python-six-whl_1.5.2-1ubuntu1_all.deb ...
==> master: Unpacking python-six-whl (1.5.2-1ubuntu1) ...
==> master: Selecting previously unselected package python-urllib3-whl.
==> master: Preparing to unpack .../python-urllib3-whl_1.7.1-1ubuntu3_all.deb ...
==> master: Unpacking python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Selecting previously unselected package python-requests-whl.
==> master: Preparing to unpack .../python-requests-whl_2.2.1-1ubuntu0.3_all.deb ...
==> master: Unpacking python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Selecting previously unselected package python-setuptools-whl.
==> master: Preparing to unpack .../python-setuptools-whl_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip-whl.
==> master: Preparing to unpack .../python-pip-whl_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-setuptools.
==> master: Preparing to unpack .../python-setuptools_3.3-1ubuntu2_all.deb ...
==> master: Unpacking python-setuptools (3.3-1ubuntu2) ...
==> master: Selecting previously unselected package python-pip.
==> master: Preparing to unpack .../python-pip_1.5.4-1ubuntu3_all.deb ...
==> master: Unpacking python-pip (1.5.4-1ubuntu3) ...
==> master: Selecting previously unselected package python-wheel.
==> master: Preparing to unpack .../python-wheel_0.24.0-1~ubuntu1_all.deb ...
==> master: Unpacking python-wheel (0.24.0-1~ubuntu1) ...
==> master: Selecting previously unselected package unzip.
==> master: Preparing to unpack .../unzip_6.0-9ubuntu1.3_amd64.deb ...
==> master: Unpacking unzip (6.0-9ubuntu1.3) ...
==> master: Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
==> master: Processing triggers for mime-support (3.54ubuntu1.1) ...
==> master: Setting up libexpat1-dev:amd64 (2.1.0-4ubuntu1.1) ...
==> master: Setting up libpython2.7-dev:amd64 (2.7.6-8ubuntu0.2) ...
==> master: Setting up libstdc++-4.8-dev:amd64 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++-4.8 (4.8.4-2ubuntu1~14.04) ...
==> master: Setting up g++ (4:4.8.2-1ubuntu6) ...
==> master: update-alternatives: using /usr/bin/g++ to provide /usr/bin/c++ (c++) in auto mode
==> master: Setting up libdpkg-perl (1.17.5ubuntu5.4) ...
==> master: Setting up dpkg-dev (1.17.5ubuntu5.4) ...
==> master: Setting up build-essential (11.6ubuntu6) ...
==> master: Setting up libalgorithm-diff-perl (1.19.02-3) ...
==> master: Setting up libalgorithm-diff-xs-perl (0.04-2build4) ...
==> master: Setting up libalgorithm-merge-perl (0.08-2) ...
==> master: Setting up libfile-fcntllock-perl (0.14-2build1) ...
==> master: Setting up libpython-dev:amd64 (2.7.5-5ubuntu3) ...
==> master: Setting up python3-pkg-resources (3.3-1ubuntu2) ...
==> master: Setting up python-chardet-whl (2.2.1-2~ubuntu1) ...
==> master: Setting up python-colorama (0.2.5-0.1ubuntu2) ...
==> master: Setting up python-colorama-whl (0.2.5-0.1ubuntu2) ...
==> master: Setting up python2.7-dev (2.7.6-8ubuntu0.2) ...
==> master: Setting up python-dev (2.7.5-5ubuntu3) ...
==> master: Setting up python-distlib (0.1.8-1ubuntu1) ...
==> master: Setting up python-distlib-whl (0.1.8-1ubuntu1) ...
==> master: Setting up python-html5lib (0.999-3~ubuntu1) ...
==> master: Setting up python-html5lib-whl (0.999-3~ubuntu1) ...
==> master: Setting up python-six-whl (1.5.2-1ubuntu1) ...
==> master: Setting up python-urllib3-whl (1.7.1-1ubuntu3) ...
==> master: Setting up python-requests-whl (2.2.1-1ubuntu0.3) ...
==> master: Setting up python-setuptools-whl (3.3-1ubuntu2) ...
==> master: Setting up python-pip-whl (1.5.4-1ubuntu3) ...
==> master: Setting up python-setuptools (3.3-1ubuntu2) ...
==> master: Setting up python-pip (1.5.4-1ubuntu3) ...
==> master: Setting up python-wheel (0.24.0-1~ubuntu1) ...
==> master: Setting up unzip (6.0-9ubuntu1.3) ...
==> master: Downloading/unpacking ansible
==> master:   Running setup.py (path:/tmp/pip_build_root/ansible/setup.py) egg_info for package ansible
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'packaging'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master: Downloading/unpacking paramiko (from ansible)
==> master: Downloading/unpacking jinja2 (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): PyYAML in /usr/lib/python2.7/dist-packages (from
ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Requirement already satisfied (use --upgrade to upgrade): pycrypto>=2.6 in /usr/lib/python2.7/dist-packages (from ansible)
==> master: Downloading/unpacking ecdsa>=0.11 (from paramiko->ansible)
==> master: Downloading/unpacking MarkupSafe (from jinja2->ansible)
==> master:   Downloading MarkupSafe-0.23.tar.gz
==> master:   Running setup.py (path:/tmp/pip_build_root/MarkupSafe/setup.py) egg_info for package MarkupSafe
==> master:
==> master: Installing collected packages: ansible, paramiko, jinja2, ecdsa, MarkupSafe
==> master:   Running setup.py install for ansible
==> master:     changing mode of build/scripts-2.7/ansible from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-playbook from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-pull from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-doc from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-galaxy from 644 to 755
==> master:     changing mode of build/scripts-2.7/ansible-vault from 644 to 755
==> master:
==> master:     no previously-included directories found matching 'v2'
==> master:     no previously-included directories found matching 'docsite'
==> master:     no previously-included directories found matching 'ticket_stubs'
==> master:     no previously-included directories found matching 'test'
==> master:     no previously-included directories found matching 'hacking'
==> master:     no previously-included directories found matching 'lib/ansible/modules/core/.git'
==> master:     no previously-included directories found matching 'lib/ansible/modules/extras/.git'
==> master:     changing mode of /usr/local/bin/ansible-galaxy to 755
==> master:     changing mode of /usr/local/bin/ansible-playbook to 755
==> master:     changing mode of /usr/local/bin/ansible-doc to 755
==> master:     changing mode of /usr/local/bin/ansible-pull to 755
==> master:     changing mode of /usr/local/bin/ansible-vault to 755
==> master:     changing mode of /usr/local/bin/ansible to 755
==> master:   Running setup.py install for MarkupSafe
==> master:
==> master:     building 'markupsafe._speedups' extension
==> master:     x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c markupsafe/_speedups.c -o build/temp.linux-x86_64-2.7/markupsafe/_speedups.o
==> master:     x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/markupsafe/_speedups.o -o build/lib.linux-x86_64-2.7/markupsafe/_speedups.so
==> master: Successfully installed ansible paramiko jinja2 ecdsa MarkupSafe
==> master: Cleaning up...
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: # 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
==> master: read (192.168.51.6): No route to host
==> master: read (192.168.51.6): No route to host
==> data1: Importing base box 'ubuntu/trusty64'...
==> data1: Matching MAC address for NAT networking...
==> data1: Checking if box 'ubuntu/trusty64' is up to date...
==> data1: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data1: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data1: `vagrant box update` to update.
==> data1: Setting the name of the VM: data1
==> data1: Clearing any previously set forwarded ports...
==> data1: Fixed port collision for 22 => 2222. Now on port 2200.
==> data1: Clearing any previously set network interfaces...
==> data1: Preparing network interfaces based on configuration...
    data1: Adapter 1: nat
    data1: Adapter 2: hostonly
==> data1: Forwarding ports...
    data1: 22 => 2200 (adapter 1)
==> data1: Running 'pre-boot' VM customizations...
==> data1: Booting VM...
==> data1: Waiting for machine to boot. This may take a few minutes...
    data1: SSH address: 127.0.0.1:2200
    data1: SSH username: vagrant
    data1: SSH auth method: private key
    data1: Warning: Connection timeout. Retrying...
==> data1: Machine booted and ready!
==> data1: Checking for guest additions in VM...
==> data1: Setting hostname...
==> data1: Configuring and enabling network interfaces...
==> data1: Mounting shared folders...
    data1: /vagrant => C:/Users/watrous/Documents/hadoop
==> data1: Running provisioner: file...
==> data2: Importing base box 'ubuntu/trusty64'...
==> data2: Matching MAC address for NAT networking...
==> data2: Checking if box 'ubuntu/trusty64' is up to date...
==> data2: A newer version of the box 'ubuntu/trusty64' is available! You currently
==> data2: have version '20150916.0.0'. The latest is version '20150924.0.0'. Run
==> data2: `vagrant box update` to update.
==> data2: Setting the name of the VM: data2
==> data2: Clearing any previously set forwarded ports...
==> data2: Fixed port collision for 22 => 2222. Now on port 2201.
==> data2: Clearing any previously set network interfaces...
==> data2: Preparing network interfaces based on configuration...
    data2: Adapter 1: nat
    data2: Adapter 2: hostonly
==> data2: Forwarding ports...
    data2: 22 => 2201 (adapter 1)
==> data2: Running 'pre-boot' VM customizations...
==> data2: Booting VM...
==> data2: Waiting for machine to boot. This may take a few minutes...
    data2: SSH address: 127.0.0.1:2201
    data2: SSH username: vagrant
    data2: SSH auth method: private key
    data2: Warning: Connection timeout. Retrying...
==> data2: Machine booted and ready!
==> data2: Checking for guest additions in VM...
==> data2: Setting hostname...
==> data2: Configuring and enabling network interfaces...
==> data2: Mounting shared folders...
    data2: /vagrant => C:/Users/watrous/Documents/hadoop
==> data2: Running provisioner: file...

Shown in the output above is the bootstrap-master.sh script installing ansible and other required libraries. At this point all three servers are ready for Hadoop to be installed and your VirtualBox console would look something like this:

virtualbox-hadoop-hosts

Limit to a single datanode

If you are low on RAM, you can make a couple of small changes to install only two servers with the same effect. To do this change the following files.

  • Vagrantfile: Remove or comment the definition of the unwanted datanode
  • group_vars/all: Remove or comment the unused host
  • hosts-dev: Remove or comment the unused host

Conversely it is possible to add as many datanodes as you like by modifying the same files above. Those changes will trickle through to as many hosts as you define. I’ll discuss that more in a future post when we use this same Ansible scripts to deploy to a cloud provider.

Install Hadoop

It’s now time to install Hadoop. There are several commented lines in the bootstrap-master.sh script that you can copy and paste to perform the next few steps. The easiest is to login to the hadoop-master server and run the ansible playbook.

Proxy management

If you happen to be behind a proxy then you’ll need to make sure that you update the proxy settings in bootstrap-master.sh and group_vars/all. For the group_vars, if you don’t have a proxy, just leave the none: false setting in place, otherwise the ansible playbook will fail since it’s expecting that to be a dictionary.

Run the Ansible playbook

Below you can see the Ansible output from configuring and installing Hadoop and all its dependencies on all three servers in your new cluster.

vagrant@hadoop-master:~$ cd src/
vagrant@hadoop-master:~/src$ ansible-playbook -i hosts-dev playbook.yml
 
PLAY [Install hadoop master node] *********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.4]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.4]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.4]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.4]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.4]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.4]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.4]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.4]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.4] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.4]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.4]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.4]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.4]
 
TASK: [master | Copy private key into place] **********************************
changed: [192.168.51.4]
 
TASK: [master | Copy slaves into place] ***************************************
changed: [192.168.51.4]
 
TASK: [master | prepare known_hosts] ******************************************
# 192.168.51.4 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# 192.168.51.5 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# 192.168.51.6 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [master | add 0.0.0.0 to known_hosts for secondary namenode] ************
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.4]
 
PLAY [Install hadoop data nodes] **********************************************
 
GATHERING FACTS ***************************************************************
ok: [192.168.51.5]
ok: [192.168.51.6]
 
TASK: [common | group name=hadoop state=present] ******************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | user name=hadoop comment="Hadoop" group=hadoop shell=/bin/bash] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | authorized_key user=hadoop key="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDWeJfgWx7hDeZUJOeaIVzcbmYxzMcWfxhgC2975tvGL5BV6unzLz8ZVak6ju++AvnM5mcQp6Ydv73uWyaoQaFZigAzfuenruQkwc7D5YYuba+FgZdQ8VHon29oQA3iaZWG7xTspagrfq3fcqaz2ZIjzqN+E/MtcW08PwfibN2QRWchBCuZ1Q8AmrW7gClzMcgd/uj3TstabspGaaZMCs8aC9JWzZlMMegXKYHvVQs6xH2AmifpKpLoMTdO8jP4jczmGebPzvaXmvVylgwo6bRJ3tyYAmGwx8PHj2EVVQ0XX9ipgixLyAa2c7+/crPpGmKFRrYibCCT6x65px7nWnn3"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | unpack hadoop] ************************************************
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | command mv /usr/local/hadoop-2.7.1 /usr/local/hadoop creates=/usr/local/hadoop removes=/usr/local/hadoop-2.7.1] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_HOME=" line="export HADOOP_HOME=/usr/local/hadoop"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="PATH=" line="export PATH=$PATH:$HADOOP_HOME/bin"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | lineinfile dest=/home/hadoop/.bashrc regexp="HADOOP_SSH_OPTS=" line="export HADOOP_SSH_OPTS=\"-i /home/hadoop/.ssh/hadoop_rsa\""] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Build hosts file] *********************************************
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [common | lineinfile dest=/etc/hosts regexp='127.0.1.1' state=absent] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/tmp state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/namenode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | file path=/home/hadoop/hadoop-data/hdfs/datanode state=directory owner=hadoop group=hadoop mode=750] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | Add the service scripts] **************************************
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/core-site.xml', 'src': 'core-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/hdfs-site.xml', 'src': 'hdfs-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/yarn-site.xml', 'src': 'yarn-site.xml'})
changed: [192.168.51.6] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
changed: [192.168.51.5] => (item={'dest': '/usr/local/hadoop/etc/hadoop/mapred-site.xml', 'src': 'mapred-site.xml'})
 
TASK: [common | lineinfile dest=/usr/local/hadoop/etc/hadoop/hadoop-env.sh regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [common | ensure hostkeys is a known host] ******************************
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-master SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.4', 'hostname': 'hadoop-master'})
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
# hadoop-data1 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.6] => (item={'ip': '192.168.51.5', 'hostname': 'hadoop-data1'})
# hadoop-data2 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3
changed: [192.168.51.5] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
changed: [192.168.51.6] => (item={'ip': '192.168.51.6', 'hostname': 'hadoop-data2'})
 
TASK: [oraclejava8 | apt_repository repo='deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_repository repo='deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' state=present] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | debconf name='oracle-java8-installer' question='shared/accepted-oracle-license-v1-1' value='true' vtype='select' unseen=false] ***
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | apt_key keyserver=keyserver.ubuntu.com id=EEA14886] ******
changed: [192.168.51.5]
changed: [192.168.51.6]
 
TASK: [oraclejava8 | Install Java] ********************************************
changed: [192.168.51.6]
changed: [192.168.51.5]
 
TASK: [oraclejava8 | lineinfile dest=/home/hadoop/.bashrc regexp="^export JAVA_HOME" line="export JAVA_HOME=/usr/lib/jvm/java-8-oracle"] ***
changed: [192.168.51.6]
changed: [192.168.51.5]
 
PLAY RECAP ********************************************************************
192.168.51.4               : ok=27   changed=26   unreachable=0    failed=0
192.168.51.5               : ok=23   changed=22   unreachable=0    failed=0
192.168.51.6               : ok=23   changed=22   unreachable=0    failed=0

Start Hadoop and run a job

Now that you have Hadoop installed, it’s time to format HDFS and start up all the services. All the commands to do this are available as comments in the bootstrap-master.sh file. The first step is to format the hdfs namenode. All of the commands that follow are executed as the hadoop user.

vagrant@hadoop-master:~/src$ sudo su - hadoop
hadoop@hadoop-master:~$ hdfs namenode -format
15/09/30 16:06:36 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-master/192.168.51.4
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.1
STARTUP_MSG:   classpath = [truncated]
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a; compiled by 'jenkins' on 2015-06-29T06:04Z
STARTUP_MSG:   java = 1.8.0_60
************************************************************/
15/09/30 16:06:36 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
15/09/30 16:06:36 INFO namenode.NameNode: createNameNode [-format]
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
15/09/30 16:06:36 WARN common.Util: Path /home/hadoop/hadoop-data/hdfs/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-1c37e2f0-ba4b-4ad7-84d7-223dec53d34a
15/09/30 16:06:36 INFO namenode.FSNamesystem: No KeyProvider found.
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsLock is fair:true
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
15/09/30 16:06:36 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Sep 30 16:06:36
15/09/30 16:06:36 INFO util.GSet: Computing capacity for map BlocksMap
15/09/30 16:06:36 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:36 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
15/09/30 16:06:36 INFO util.GSet: capacity      = 2^21 = 2097152 entries
15/09/30 16:06:36 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: defaultReplication         = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplication             = 512
15/09/30 16:06:36 INFO blockmanagement.BlockManager: minReplication             = 1
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
15/09/30 16:06:36 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
15/09/30 16:06:36 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
15/09/30 16:06:36 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
15/09/30 16:06:36 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
15/09/30 16:06:36 INFO namenode.FSNamesystem: supergroup          = supergroup
15/09/30 16:06:36 INFO namenode.FSNamesystem: isPermissionEnabled = true
15/09/30 16:06:36 INFO namenode.FSNamesystem: HA Enabled: false
15/09/30 16:06:36 INFO namenode.FSNamesystem: Append Enabled: true
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map INodeMap
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^20 = 1048576 entries
15/09/30 16:06:37 INFO namenode.FSDirectory: ACLs enabled? false
15/09/30 16:06:37 INFO namenode.FSDirectory: XAttrs enabled? true
15/09/30 16:06:37 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
15/09/30 16:06:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map cachedBlocks
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^18 = 262144 entries
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
15/09/30 16:06:37 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
15/09/30 16:06:37 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/09/30 16:06:37 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
15/09/30 16:06:37 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/09/30 16:06:37 INFO util.GSet: VM type       = 64-bit
15/09/30 16:06:37 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
15/09/30 16:06:37 INFO util.GSet: capacity      = 2^15 = 32768 entries
15/09/30 16:06:37 INFO namenode.FSImage: Allocated new BlockPoolId: BP-992546781-192.168.51.4-1443629197156
15/09/30 16:06:37 INFO common.Storage: Storage directory /home/hadoop/hadoop-data/hdfs/namenode has been successfully formatted.
15/09/30 16:06:37 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/09/30 16:06:37 INFO util.ExitUtil: Exiting with status 0
15/09/30 16:06:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.51.4
************************************************************/

Start DFS

Next start the dfs services, as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-dfs.sh
Starting namenodes on [hadoop-master]
hadoop-master: Warning: Permanently added the RSA host key for IP address '192.168.51.4' to the list of known hosts.
hadoop-master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-data2: Warning: Permanently added the RSA host key for IP address '192.168.51.6' to the list of known hosts.
hadoop-data1: Warning: Permanently added the RSA host key for IP address '192.168.51.5' to the list of known hosts.
hadoop-master: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-master.out
hadoop-data2: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data2.out
hadoop-data1: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-hadoop-data1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

At this point you can access the HDFS status and see all three datanodes attached wtih this URL: http://192.168.51.4:50070/dfshealth.html#tab-datanode.

Start yarn

Next start the yarn service as shown.

hadoop@hadoop-master:~$ /usr/local/hadoop/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-data2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data2.out
hadoop-data1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-data1.out
hadoop-master: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-hadoop-master.out

At this point you can access information about the compute nodes in the cluster and currently running jobs at this URL: http://192.168.51.4:8088/cluster/nodes

Verify that Java processes are running

Hadoop provides a useful script to run a command on all nodes listed in slaves. For example, you can confirm that all expected Java processes are running as expected with the following command.

hadoop@hadoop-master:~$ $HADOOP_HOME/sbin/slaves.sh jps
hadoop-data2: 3872 DataNode
hadoop-data2: 4180 Jps
hadoop-data2: 4021 NodeManager
hadoop-master: 7617 NameNode
hadoop-data1: 3872 DataNode
hadoop-data1: 4180 Jps
hadoop-master: 8675 Jps
hadoop-data1: 4021 NodeManager
hadoop-master: 8309 NodeManager
hadoop-master: 8150 ResourceManager
hadoop-master: 7993 SecondaryNameNode
hadoop-master: 7788 DataNode

Run an example job

Finally, it’s possible to confirm that everything is working as expected by running one of the example jobs. Let’s find the number pi.

hadoop@hadoop-master:~$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 10 30
Number of Maps  = 10
Samples per Map = 30
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/09/30 19:54:28 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/192.168.51.4:8032
15/09/30 19:54:29 INFO input.FileInputFormat: Total input paths to process : 10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: number of splits:10
15/09/30 19:54:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443642855962_0001
15/09/30 19:54:29 INFO impl.YarnClientImpl: Submitted application application_1443642855962_0001
15/09/30 19:54:29 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1443642855962_0001/
15/09/30 19:54:29 INFO mapreduce.Job: Running job: job_1443642855962_0001
15/09/30 19:54:38 INFO mapreduce.Job: Job job_1443642855962_0001 running in uber mode : false
15/09/30 19:54:38 INFO mapreduce.Job:  map 0% reduce 0%
15/09/30 19:54:52 INFO mapreduce.Job:  map 40% reduce 0%
15/09/30 19:54:56 INFO mapreduce.Job:  map 100% reduce 0%
15/09/30 19:54:59 INFO mapreduce.Job:  map 100% reduce 100%
15/09/30 19:54:59 INFO mapreduce.Job: Job job_1443642855962_0001 completed successfully
15/09/30 19:54:59 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=1272744
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2710
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=43
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=140318
                Total time spent by all reduces in occupied slots (ms)=4742
                Total time spent by all map tasks (ms)=140318
                Total time spent by all reduce tasks (ms)=4742
                Total vcore-seconds taken by all map tasks=140318
                Total vcore-seconds taken by all reduce tasks=4742
                Total megabyte-seconds taken by all map tasks=143685632
                Total megabyte-seconds taken by all reduce tasks=4855808
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1530
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=3509
                CPU time spent (ms)=5620
                Physical memory (bytes) snapshot=2688745472
                Virtual memory (bytes) snapshot=20847497216
                Total committed heap usage (bytes)=2040528896
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Job Finished in 31.245 seconds
Estimated value of Pi is 3.16000000000000000000

Security and Configuration

This example is not production hardened. It does nothing to address firewall management. The key management is permissive and intended to make it easy to communicate between nodes. If this is to be used for a production deployment, it should be easy to add a role to setup the firewall. You may also want be more cautious about accepting keys between hosts.

Default Ports

Lots of people ask about what the default ports are for Hadoop services. The following four links provide all the properties that can be set for any of the main components, including the defaults if they are absent from the configuration file. If it isn’t overridden in the Ansible playbook role templates in the git repository, then the property is the default as shown in the links below.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

Problems spanning subnets

While developing this automation, I originally had the datanodes running on a separate subnet. There’s a problem/bug with Hadoop that prevented nodes from communicating across subnets. The following thread covers some of the discussion.

http://mail-archives.apache.org/mod_mbox/hadoop-user/201509.mbox/%3CCAKFXasEROCe%2BfL%2B8T7A3L0j4Qrm%3D4HHuzGfJhNuZ5MqUvQ%3DwjA%40mail.gmail.com%3E

Resources

While developing my Ansible scripts I leaned heavily on this tutorial
https://chawlasumit.wordpress.com/2015/03/09/install-a-multi-node-hadoop-cluster-on-ubuntu-14-04/

Software Engineering

Encryption of secrets in source code (AESCrypt + Ansible)

The more I automate, the more I have to answer the question of how to manage my secrets. Secrets that frequently come up include:

  • SSH key pairs
  • SSL private keys
  • Credentials for external resources, such as databases and SaaS integrations

Before cloud, when server resources were not ephemeral, these could be managed manually when the server was created. In cloud environments, servers are created and destroyed automatically and from minute to minute, which leaves the question about how to manage secrets.

The OpenStack community is working on one solution called Barbican. I’ve been looking at more local solutions to secret management that accommodates storage in a code repository alongside the application code that will be deployed. Some benefits to a local solution include:

  • No additional systems to maintain
  • Secrets can be versioned
  • Infrastructure as code

Most of my automation efforts recently are centered on Ansible. One Ansible specific solution is Vault. The drawback to building secrets in to Ansible Vault is lock in to Ansible. In the future I may want to leverage other orchestration tools, such as puppet, chef or salt.

openssl

One option that works well for *nix only workloads is openssl. OpenSSL is widely used and available by default on virtually every Linux system.

Using openssl is straight forward:

openssl enc -aes-256-cbc -salt -in server.key -out server.key.aes -pass pass:secret
openssl enc -d -aes-256-cbc -in server.key.aes -out server.key -pass pass:secret

AESCrypt

I ended up choosing AESCrypt as my solution. Some key reasons for this choice include native binaries for Windows, Linux and Mac. AES encryption is very strong. The decryption key can be provided to ansible as a variable or handled manually on resulting servers.

Using AESCrypt is very simple. Below is a simple session on Windows.

C:\Users\Daniel Watrous\Documents\work\aescrypt>ls -la
total 204
drw-rw-rw-   2 Daniel Watrous 2 0      0 2015-06-30 16:06 .
drw-rw-rw-  73 Daniel Watrous 2 0  49152 2015-06-30 16:04 ..
-rwxrwxrwx   1 Daniel Watrous 2 0 155136 2015-06-30 16:05 aescrypt.exe
-rw-rw-rw-   1 Daniel Watrous 2 0    896 2015-06-30 16:06 mykey.txt
 
C:\Users\Daniel Watrous\Documents\work\aescrypt>cat mykey.txt
-----BEGIN RSA PRIVATE KEY-----
MIICXAIBAAKBgQCqGKukO1De7zhZj6+H0qtjTkVxwTCpvKe4eCZ0FPqri0cb2JZfXJ/DgYSF6vUp
wmJG8wVQZKjeGcjDOL5UlsuusFncCzWBQ7RKNUSesmQRMSGkVb1/3j+skZ6UtW+5u09lHNsj6tQ5
1s1SPrCBkedbNf0Tp0GbMJDyR4e9T04ZZwIDAQABAoGAFijko56+qGyN8M0RVyaRAXz++xTqHBLh
3tx4VgMtrQ+WEgCjhoTwo23KMBAuJGSYnRmoBZM3lMfTKevIkAidPExvYCdm5dYq3XToLkkLv5L2
pIIVOFMDG+KESnAFV7l2c+cnzRMW0+b6f8mR1CJzZuxVLL6Q02fvLi55/mbSYxECQQDeAw6fiIQX
GukBI4eMZZt4nscy2o12KyYner3VpoeE+Np2q+Z3pvAMd/aNzQ/W9WaI+NRfcxUJrmfPwIGm63il
AkEAxCL5HQb2bQr4ByorcMWm/hEP2MZzROV73yF41hPsRC9m66KrheO9HPTJuo3/9s5p+sqGxOlF
L0NDt4SkosjgGwJAFklyR1uZ/wPJjj611cdBcztlPdqoxssQGnh85BzCj/u3WqBpE2vjvyyvyI5k
X6zk7S0ljKtt2jny2+00VsBerQJBAJGC1Mg5Oydo5NwD6BiROrPxGo2bpTbu/fhrT8ebHkTz2epl
U9VQQSQzY1oZMVX8i1m5WUTLPz2yLJIBQVdXqhMCQBGoiuSoSjafUhV7i1cEGpb88h5NBYZzWXGZ
37sJ5QsW+sJyoNde3xH8vdXhzU7eT82D6X/scw9RZz+/6rCJ4p0=
-----END RSA PRIVATE KEY-----
C:\Users\Daniel Watrous\Documents\work\aescrypt>aescrypt -e -p secret mykey.txt
 
C:\Users\Daniel Watrous\Documents\work\aescrypt>ls -la
total 208
drw-rw-rw-   2 Daniel Watrous 2 0      0 2015-06-30 16:07 .
drw-rw-rw-  73 Daniel Watrous 2 0  49152 2015-06-30 16:04 ..
-rwxrwxrwx   1 Daniel Watrous 2 0 155136 2015-06-30 16:05 aescrypt.exe
-rw-rw-rw-   1 Daniel Watrous 2 0    896 2015-06-30 16:06 mykey.txt
-rw-rw-rw-   1 Daniel Watrous 2 0   1188 2015-06-30 16:07 mykey.txt.aes
 
C:\Users\Daniel Watrous\Documents\work\aescrypt>cat mykey.txt.aes
AES☻  ↑CREATED_BY aescrypt 3.10 ?
                                           <%8±d>áFo♥p♠xÿ~
C:\Users\Daniel Watrous\Documents\work\aescrypt>rm mykey.txt
 
C:\Users\Daniel Watrous\Documents\work\aescrypt>aescrypt.exe -d -p secret mykey.txt.aes
 
C:\Users\Daniel Watrous\Documents\work\aescrypt>cat mykey.txt
-----BEGIN RSA PRIVATE KEY-----
MIICXAIBAAKBgQCqGKukO1De7zhZj6+H0qtjTkVxwTCpvKe4eCZ0FPqri0cb2JZfXJ/DgYSF6vUp
wmJG8wVQZKjeGcjDOL5UlsuusFncCzWBQ7RKNUSesmQRMSGkVb1/3j+skZ6UtW+5u09lHNsj6tQ5
1s1SPrCBkedbNf0Tp0GbMJDyR4e9T04ZZwIDAQABAoGAFijko56+qGyN8M0RVyaRAXz++xTqHBLh
3tx4VgMtrQ+WEgCjhoTwo23KMBAuJGSYnRmoBZM3lMfTKevIkAidPExvYCdm5dYq3XToLkkLv5L2
pIIVOFMDG+KESnAFV7l2c+cnzRMW0+b6f8mR1CJzZuxVLL6Q02fvLi55/mbSYxECQQDeAw6fiIQX
GukBI4eMZZt4nscy2o12KyYner3VpoeE+Np2q+Z3pvAMd/aNzQ/W9WaI+NRfcxUJrmfPwIGm63il
AkEAxCL5HQb2bQr4ByorcMWm/hEP2MZzROV73yF41hPsRC9m66KrheO9HPTJuo3/9s5p+sqGxOlF
L0NDt4SkosjgGwJAFklyR1uZ/wPJjj611cdBcztlPdqoxssQGnh85BzCj/u3WqBpE2vjvyyvyI5k
X6zk7S0ljKtt2jny2+00VsBerQJBAJGC1Mg5Oydo5NwD6BiROrPxGo2bpTbu/fhrT8ebHkTz2epl
U9VQQSQzY1oZMVX8i1m5WUTLPz2yLJIBQVdXqhMCQBGoiuSoSjafUhV7i1cEGpb88h5NBYZzWXGZ
37sJ5QsW+sJyoNde3xH8vdXhzU7eT82D6X/scw9RZz+/6rCJ4p0=
-----END RSA PRIVATE KEY-----

Automation

It should be obvious from the above example that at some point the decryption still requires a password and that password should NOT be stored in the code repository. The decrypted files should also be ignored (not added to revision control). In other words, only commit the encrypted files.

The encryption password needs to be stored somewhere. One option is to keep it in your head. Another might be to keep it in lastpass or some other password manager, but be sure to keep it out of the repository where you have the encrypted secrets.

The password can be provided when calling the Ansible playbook.

Software Engineering

The Road to PaaS

I have observed that discussions about CloudFoundry often lack accurate context. Some questions I get that indicate context is missing include:

  • What Java version does CloudFoundry support?
  • What database products/versions are available
  • How can I access the server directly?

There are a few reasons that the questions above are not relevant for CloudFoundry (or any modern PaaS environment). To understand why, it’s important to understand how we got to PaaS and where we came from.

cloudfoundry-compared-traditional

Landscape

When computers were first becoming a common requirement for the enterprise, most applications were monolithic. All applicaiton components would run on the same general purpose server. This included interface, application technology (e.g. Java, .NET and PHP) and data and file storage. Over time, these functions were distributed across different servers. The servers also began to take on characteristic differences that would accommodate the technology being run.

Today, compute has been commoditized and virtualized. Rather than thinking of compute as a physical server, built to suit a specific purpose, compute is instead viewed in discreet chunks that can be scaled horizontally. PaaS today marries an application with those chunks of compute capacity as needed and abstracts application access to services, which may or may not run on the same PaaS platform.

Contributor and Organization Dynamic

The role of contributors and organizations have changed throughout the evolution of the landscape. Early monolithic systems required technology experts who were familiar with a broad range of technologies, including system administration, programming, networking, etc. As the functions were distributed, the roles became more defined by their specializations. Webmasters, DBAs, and programmers became siloed. Some unintended conflicts complicated this more distributed architecture due in part to the fact that efficiencies in one silo did not always align with the best interests of other silos.

DevOps

As the evolution pushed toward compute as a commodity, the new found flexibility drove many frustrated technologists to reach beyond their respective silo to accomplish their design and delivery objectives. Programmers began to look at how different operating system environments and database technologies could enable them to produce results faster and more reliably. System administrators began to rethink system management in ways that abstracted hardware dependencies and decreased the complexity involved in augmenting compute capacity available to individual functions. Datastore, network, storage and other experts began a similar process of abstracting their offering. This blending of roles and new dynamic of collaboration and contribution has come to be known as DevOps.

Interoperability

Interoperability between systems and applications in the days of monolithic application development made use of many protocols. This was due in part to the fact that each monolithic system exposed it’s services in different ways. As the above progression took place, the field of available protocols normalized. RESTful interfaces over HTTP have emerged as an accepted standard and the serialization structures most common to REST are XML and JSON. This makes integration straight forward and provides for a high amount of reuse of existing services. This also makes services available to a greater diversity of devices.

Security and Isolation

One key development that made this evolution from compute as hardware to compute as a utility possible was effective isolation of compute resources on shared hardware. The first big step in this direction came in the form of virualization. Virtualized hardware made it possible to run many distinct operating systems simultaneously on the same hardware. It also significantly reduced the time to provision new server resources, since the underlying hardware was already wired and ready.

Compute as a ________

The next step in the evolution came in the form of containers. Unlike virtualization, containers made it possible to provide an isolated, configurable compute instance in much less time that consumed fewer system resources to create and manage (i.e. lightweight). This progression from compute as hardware to compute as virtual and finally to compute as a container made it realistic to literally view compute as discreet chunks that could be created and destroyed in seconds as capacity requirements changed.

Infrastructure as Code

Another important observation regarding the evolution of compute is that as the compute environment became easier to create (time to provision decreased), the process to provision changed. When a physical server required ordering, shipping, mounting, wiring, etc., it was reasonable to take a day or two to install and configure the operating system, network and related components. When that hardware was virtualized and could be provisioned in hours (or less), system administrators began to pursue more automation to accommodate the setup of these systems (e.g. ansible, puppet, chef and even Vagrant). This made it possible to think of systems as more transient. With the advent of Linux containers, the idea of infrastructure as code became even more prevalent. Time to provision is approaching zero.

A related byproduct of infrastructure defined by scripts or code was reproduceability. Whereas it was historically difficult to ensure that two systems were configured identically, the method for provisioning containers made it trivial to ensure that compute resources were identically configured. This in turn improved debugging, collaboration and accommodated versioning of operating environments.

Contextual Answers

Given that the landscape has changed so drastically, let’s look at some possible answers to the questions from the beginning of this post.

  • Q. What Java (or any language) version does CloudFoundry support?
    A. It supports any language that is defined in the scripts used to provision the container that will run the application. While it is true that some such scripts may be available by default, this doesn’t imply that the PaaS provides only that. If it’s a fit, use it. If not, create new provisioning scripts.
  • Q. What database products/versions are available?
    A. Any database product or version can be used. If the datastore services available that are associated with the PaaS by default are not sufficient, bring your own or create another application component to accommodate your needs.
  • Q. How can I access the server directly?
    A. There is no “the server” If you want to know more about the server environment, look at the script/code that is responsible for provisioning it. Even better, create a new container and play around with it. Once you get things just right, update your code so that every new container incorporates the desired changes. Every “the server” will look exactly how you define it.
Software Engineering

Build a Multi-server LEMP stack using Ansible

My objective in this post is to explore the use of Ansible to configure a multi-server LEMP stack. This builds on the preliminary work I did demonstrating how to use Vagrant to create an environment to run Ansible. You can follow this entire example on any Windows (or Linux) host.

Ansible only runs on Linux hosts, not Windows. As a result, I needed to provision one Linux host to act as Ansible controller. One aspect of Ansible that I wanted to explore is the ability to manage multiple hosts with different configurations. For this experiment, I provision two more Linux hosts, one to act as a database host and the other to function as an Nginx/PHP server for a complete LEMP stack. I created the diagram below to illustrate my setup.

vagrant-ansible-lemp

There are two primary artifact categories for this experiement:

  • Vagrantfile to provision each host
  • Ansible playbook related files

Since there were more than a few Ansible playbook files, I chose to create a github repository rather than provide all the code here. You can clone/fork the files to run this experiment here:

https://github.com/dwatrous/vagrant-ansible-lemp

Explanation

Here is a list of the files you’ll find in that repository.

  • Vagrantfile
  • control.sh
  • lemp/group_vars/all
  • lemp/hosts
  • lemp/roles/common/handlers/main.yml
  • lemp/roles/common/tasks/main.yml
  • lemp/roles/database/handlers/main.yml
  • lemp/roles/database/tasks/main.yml
  • lemp/roles/web/handlers/main.yml
  • lemp/roles/web/tasks/main.yml
  • lemp/roles/web/templates/default
  • lemp/roles/web/templates/wall.php
  • lemp/site.yml

I do use a bootstrap shell script, control.sh, with Vagrant for the Ansible control server. It is necessary to install Ansible on the control server, but since Ansible doesn’t require an agent, there’s no need to bootstrap the other servers.

Playbook files

For each Ansible defined role there are three artifact categories.

  • handlers
  • tasks
  • templates

Handlers are named tasks that can be called or notified when Ansible detects other events. These are commonly used to trigger service restarts when configuration files change, as an example.

Tasks are the meat of the playbook. This lists out the steps to put a system into a desired state, including installing software, copying templates, registering and calling handlers, etc.

Configuration files, such as the nginx ‘default’ configuration in this case, can be stored in the templates folder and copied to the host using a task. Templates are helpful when a desired configuration differs significantly from a system default, this can be easier than updating individual lines in a file one at a time using lineinfile. The Ansible playbook files are in the following directory.

/vagrant/lemp

The site.yml file ties it all together by associating host groups with roles. You run the playbook like this.

ansible-playbook -i hosts site.yml

The example wall.php script should be accessible locally using the port 80->8080 mapping as http://127.0.0.1:8080/wall.php or over port 80 on the external IP assigned to the web host. Here’s what you can expect to see.

ansible-wall-example

Resources

I used the ansible examples repository on Github while putting this together. You may find it useful. For the specifics of installing LEMP on Ubuntu, I followed my Vagrant tutorial.

Software Engineering

Using Vagrant to Explore Ansible

Last week I wrote about Vagrant, a fantastic tool to spin up virtual development environments. Today I’m exploring Ansible. Ansible is an open source tool which streamlines certain system administration activities. Unlike Vagrant, which provisions new machines, Ansible takes an already provisioned machine and configures it. This can include installing and configuring software, managing services, and even running simple commands. Ansible doesn’t require any agent software to be installed on the system being managed. Everything is executed over SSH.

Ansible only runs on Linux (though I’ve heard of people running it in cygwin with some difficulty). In order to play with Ansible, I used Vagrant to spin up a control box and a subject box that are connected in a way that I can easily run Ansible commands. Here’s my Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :
 
# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"
 
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  # define ansible subject (web in this case) box
  config.vm.define "subject" do |subject|
    subject.vm.box = "ubuntu/trusty64"
    subject.vm.network "public_network"
    subject.vm.network "private_network", ip: "192.168.51.4"
    subject.vm.provider "virtualbox" do |v|
      v.name = "Ansible subject"
      v.cpus = 2
      v.memory = 768
    end
    # copy private key so hosts can ssh using key authentication (the script below sets permissions to 600)
    subject.vm.provision :file do |file|
      file.source      = 'C:\Users\watrous\.vagrant.d\insecure_private_key'
      file.destination = '/home/vagrant/.ssh/id_rsa'
    end
    subject.vm.provision :shell, path: "subject.sh"
    subject.vm.network "forwarded_port", guest: 80, host: 8080
  end
 
  # define ansible control box (provision this last so it can add other hosts to known_hosts for ssh authentication)
  config.vm.define "control" do |control|
    control.vm.box = "ubuntu/trusty64"
    control.vm.network "public_network"
    control.vm.network "private_network", ip: "192.168.50.4"
    control.vm.provider "virtualbox" do |v|
      v.name = "Ansible control"
      v.cpus = 1
      v.memory = 512
    end
    # copy private key so hosts can ssh using key authentication (the script below sets permissions to 600)
    control.vm.provision :file do |file|
      file.source      = 'C:\Users\watrous\.vagrant.d\insecure_private_key'
      file.destination = '/home/vagrant/.ssh/id_rsa'
    end
    control.vm.provision :shell, path: "control.sh"
  end
 
  # consider using agent forwarding instead of manually copying the private key as I did above
  # config.ssh.forward_agent = true
 
end

Notice that I created a public network to get a DHCP external address. I also created a private network with assigned addresses in the open address space. This is so I can indicate to Ansible in the hosts file where to locate all of the inventory.

I had trouble getting SSH agent forwarding to work on Windows through PuTTY, so for now I’m manually placing the private key and updating known_hosts with the ‘ssh-keyscan’ command. You can see part of this in the Vagrantfile above. The remaining work is done in two scripts, one for the control and one of the subject.

control.sh

#!/usr/bin/env bash
 
# set proxy variables
#export http_proxy=http://myproxy.com:8080
#export https_proxy=https://myproxy.com:8080
 
# install pip, then use pip to install ansible
apt-get -y install python-dev python-pip
pip install ansible
 
# fix permissions on private key file
chmod 600 /home/vagrant/.ssh/id_rsa
 
# add subject host to known_hosts (IP is defined in Vagrantfile)
ssh-keyscan -H 192.168.51.4 >> /home/vagrant/.ssh/known_hosts
chown vagrant:vagrant /home/vagrant/.ssh/known_hosts
 
# create ansible hosts (inventory) file
mkdir -p /etc/ansible/
cat /vagrant/hosts >> /etc/ansible/hosts

subject.sh

#!/usr/bin/env bash
 
# fix permissions on private key file
chmod 600 /home/vagrant/.ssh/id_rsa

I also provide copy this hosts file into place on the control system so it knows against which inventory it should operate.

hosts

[targets]
localhost   ansible_connection=local
192.168.51.4    ansible_connection=ssh

After running ‘vagrant up‘, I can verify that the control box is able to access the subject box using the ping module in ansible.

vagrant-ansible

Conclusion

This post doesn’t demonstrate the use of Ansible, aside from the ping command. What it does do is provide an environment where I can build and run Ansible playbooks, which is exactly what I plan to do next.