In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG. The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python for example. PIG latin is the descriptive query language and has some similarities with SQL. These include grouping and filtering. Load in the data First I launch into the interactive local PIG command line, grunt. Commands are not......
Continue Reading
I read that Hadoop supports scripts written in various languages other than Java, such as Python. Since I’m a fan of python, I wanted to prove this out. It was my good fortune to find an excellent post by Michael Noll that walked me through the entire process of scripting in Python for Hadoop. It’s an excellent post and worked as written for me in Hadoop 2.2.0. How hadoop processes scripts from other languages (stdin/stdout) In order to accommodate scripts from other languages, hadoop focuses on standard in (stdin) and standard out (stdout).......
Continue Reading
One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: The Definitive Guide for setup and example code. Installing Hadoop Hadoop in standalone mode was the most convenient for initial development of the Map Reduce classes. The following commands were executed on a virtual server running RedHat Enterprise......
Continue Reading
I’ve been working on generating analytics based on a collection containing statistical data. My previous attempt involved using Map Reduce in MongoDB. Recall that the data in the statistics collection has this form. { "_id" : ObjectId("5e6877a516832a9c8fe89ca9"), "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283", "request_date" : ISODate("2013-04-05T06:00:24.006Z"), "request_method" : "POST", "document" : { "domain" : "", "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED", "deleted" : null, "ipAddress" : "", "disposition" : "", "owner" : ObjectId("af1459ed793eca35754090a0"), "_id" : ObjectId("6fec518787a52a9c988ea683"), "issueDate" : ISODate("2013-04-05T06:00:24.005Z"), }, "request_uri" : { "path" : "/v1/sitelicenses", "netloc" : "api.easysoftwarelicensing.com" } }{ "_id" : ObjectId("5e6877a516832a9c8fe89ca9"), "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283", "request_date" :......
Continue Reading
I have a RESTful SaaS service I created which uses MongoDB. Each REST call creates a new record in a statistics collection. In order to implement quotas and provide user analytics, I need to process the statistics collection periodically and generate meaningful analytics specific to each user. This is just the type of problem map reduce was meant to solve. In order to accomplish this I’ll need to do the following: Map all statistics records over a time range Reduce the number of calls, both authenticated and anonymous Finalize to get the sum......
Continue Reading