Use Hadoop to Analyze Java Logs (Tomcat catalina.out)

One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: The Definitive Guide for setup and example code.

Installing Hadoop

Hadoop in standalone mode was the most convenient for initial development of the Map Reduce classes. The following commands were executed on a virtual server running RedHat Enterprise Linux 6.3. First verify Java 6 is installed:

[watrous@myhost ~]$ java -version
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Next, download and extract Hadoop. Hadoop can be downloaded using a mirror. Hadoop can be setup and run locally and does not require any special privileges. Always verify that you have a good download.

[watrous@myhost ~]$ wget http://download.nextag.com/apache/hadoop/common/stable/hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ md5sum hadoop-2.2.0.tar.gz
25f27eb0b5617e47c032319c0bfd9962  hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ tar xzf hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ hdfs namenode -format

That last command creates an HDFS file system in the tmp folder. In my case it was created here: /tmp/hadoop-watrous/dfs/.

Environment variables were added to .bash_profile for JAVA_HOME and HADOOP_INSTALL, as shown. These can also be run locally each time you login.

export JAVA_HOME=/usr/lib/jvm/jre
export HADOOP_INSTALL=/home/watrous/hadoop-2.2.0
export PATH=$PATH:$HADOOP_INSTALL/bin

I can now verify that Hadoop is installed and ready to run.

[watrous@myhost ~]$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /home/watrous/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar

Get some seed data

Now that Hadoop is all setup, I need some seed data to operate on. For this I just reached out and grabbed a log file from one of my production servers.

[watrous@myhost ~]$ mkdir input
[watrous@myhost ~]$ scp watrous@mywebhost.com:/var/lib/tomcat/logs/catalina.out ./input/

Creating Map Reduce Classes

The most simple operation in Hadoop requires a Mapper class, a Reducer class and a third class that identifies the Mapper and Reducer including the datatypes that connect them. The examples below required two jars from the release downloaded above:

  • hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\common\hadoop-common-2.2.0.jar
  • hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.2.0.jar

I also use regular expressions in Java to analyze each line in the log. Regular expressions can be more resilient to variations and allow for grouping, which gives easy access to specific data elements. As always, I used Kodos to develop the regular expression.

In the example below, I don’t actually use the log value, but instead I just count up how many occurrences there are by key.

Mapper class

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class TomcatLogErrorMapper extends Mapper<LongWritable, Text, Text, Text> {
 
    String pattern = "([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3})\\s*([a-zA-Z]+)\\s*([a-zA-Z.]+)\\s*-\\s*(.+)$";
    // Create a Pattern object
    Pattern r = Pattern.compile(pattern);
 
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
 
        Matcher m = r.matcher(line);
        if (m.find()) {
            // only consider ERRORs for this example
            if (m.group(2).contains("ERROR")) {
                  // example log line
                  // 2013-11-08 04:06:56,586 DEBUG component.helpers.GenericSOAPConnector  - Attempting to connect to: https://remotehost.com/app/rfc/entry/msg_status
//                System.out.println("Found value: " + m.group(0)); //complete line
//                System.out.println("Found value: " + m.group(1)); // date
//                System.out.println("Found value: " + m.group(2)); // log level
//                System.out.println("Found value: " + m.group(3)); // class
//                System.out.println("Found value: " + m.group(4)); // message
                context.write(new Text(m.group(1)), new Text(m.group(2) + m.group(3) + m.group(4)));
            }
        }
    }
}

Reducer class

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class TomcatLogErrorReducer extends Reducer<Text, Text, Text, IntWritable> {
 
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        int countValue = 0;
        for (Text value : values) {
            countValue++;
        }
        context.write(key, new IntWritable(countValue));
    }
}

Job class with main

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class TomcatLogError {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: TomcatLogError <input path> <output path>");
            System.exit(-1);
        }
        Job job = new Job();
        job.setJarByClass(TomcatLogError.class);
        job.setJobName("Tomcat Log Error");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(TomcatLogErrorMapper.class);
        job.setReducerClass(TomcatLogErrorReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Running Hadoop

In netbeans I made sure that the Main Class was TomcatLogError in the compiled jar. I then ran Clean and Build to get a jar which I transferred up to the server where I installed Hadoop.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar input/catalina.out ~/output
13/11/11 19:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/11/11 19:20:52 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
...
13/11/11 18:36:57 INFO mapreduce.Job: Job job_local1725513594_0001 completed successfully
13/11/11 18:36:57 INFO mapreduce.Job: Counters: 27
        File System Counters
                FILE: Number of bytes read=430339145
                FILE: Number of bytes written=1057396
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=1101516
                Map output records=105
                Map output bytes=20648
                Map output materialized bytes=20968
                Input split bytes=396
                Combine input records=0
                Combine output records=0
                Reduce input groups=23
                Reduce shuffle bytes=0
                Reduce input records=105
                Reduce output records=23
                Spilled Records=210
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=234
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1827143680
        File Input Format Counters
                Bytes Read=114455257
        File Output Format Counters
                Bytes Written=844

The output folder now contains a file named part-r-00000 with the results of the processing.

[watrous@c0003913 ~]$ more output/part-r-00000
2013-11-08 04:04:51,894 2
2013-11-08 05:04:52,711 2
2013-11-08 05:33:23,073 3
2013-11-08 06:04:53,689 2
2013-11-08 07:04:54,366 3
2013-11-08 08:04:55,096 2
2013-11-08 13:34:28,936 2
2013-11-08 17:32:31,629 3
2013-11-08 18:51:17,357 1
2013-11-08 18:51:17,423 1
2013-11-08 18:51:17,491 1
2013-11-08 18:51:17,499 1
2013-11-08 18:51:17,500 1
2013-11-08 18:51:17,502 1
2013-11-08 18:51:17,503 1
2013-11-08 18:51:17,504 1
2013-11-08 18:51:17,506 1
2013-11-08 18:51:17,651 6
2013-11-08 18:51:17,652 23
2013-11-08 18:51:17,653 25
2013-11-08 18:51:17,654 19
2013-11-08 19:01:13,771 2
2013-11-08 21:32:34,522 2

Based on this analysis, there were a number of errors produced around the hour 18:51:17. It is then easy to change the Mapper class to emit based on a different key, such as Class or Message to identify more precisely what the error is, now that it’s know when the errors happened.

Increasing scale

The Mapper and Reducer classes can be enhanced to give more relevant details. The process of transferring the files can also be automated and the input method can be adapted to walk a directory, rather than a single file. Reports can also be aggregated and placed in a web directory or emailed.

Twitter Digg Delicious Stumbleupon Technorati Facebook Email

About Daniel Watrous

I'm a Software & Electrical Engineer and online entrepreneur.

3 Responses to “Use Hadoop to Analyze Java Logs (Tomcat catalina.out)”

  1. Can you provide some detailed steps on setting up the jar file ? If I have eclipse how to go about doing this ?

    Thanks

    srikrishna

Trackbacks/Pingbacks

  1. Hadoop Scripts in Python | Daniel Watrous on Software Engineering - November 18, 2013

    […] try to add value by revisiting my original example of analyzing Apache Tomcat (Java) logs in Hadoop using python this […]

Leave a Reply