Tag Archives: hadoop

Bulid a multi-server Hadoop cluster in OpenStack in minutes

In a previous post I demonstrated a method to deploy a multi-node Hadoop cluster using Vagrant and Ansible. This post builds on that and shows how to deploy a Hadoop cluster with an arbitrary number of slave nodes in minutes on OpenStack. This process makes use of the OpenStack orchestration layer HEAT to provision the […]

Read more

Install and configure a Multi-node Hadoop cluster using Ansible

I’ve recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In each case that I’ve observed they start by […]

Read more

Analyze Tomcat Logs using PIG (hadoop)

In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG. The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python […]

Read more

Hadoop Scripts in Python

I read that Hadoop supports scripts written in various languages other than Java, such as Python. Since I’m a fan of python, I wanted to prove this out. It was my good fortune to find an excellent post by Michael Noll that walked me through the entire process of scripting in Python for Hadoop. It’s […]

Read more

Hadoop Takes Compressed Files (gzip, bzip2) as Direct Input

The log rotation mechanism on my servers automatically compresses (gzip) the rotated log file to save on disk space. I discovered that Hadoop is already designed to deal with compressed files using gzip, bzip2 and LZO out of the box. This means that no additional work is required in the Mapper class to decompress. Here’s […]

Read more

Hadoop HDFS in Standalone Mode

My previous hadoop example operated against the local filesystem, in spite of the fact that I formatted a local HDFS partition. In order to operate against the local HDFS partition it’s necessary to first start the namenode and datanode. I mostly followed these instructions to start those processes. Here’s the most relevant part that I […]

Read more

Use Hadoop to Analyze Java Logs (Tomcat catalina.out)

One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: […]

Read more