Today: February 14, 2025 9:17 pm
A collection of Software and Cloud patterns with a focus on the Enterprise

Use Hadoop to Analyze Java Logs (Tomcat catalina.out)

One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: The Definitive Guide for setup and example code.

Installing Hadoop

Hadoop in standalone mode was the most convenient for initial development of the Map Reduce classes. The following commands were executed on a virtual server running RedHat Enterprise Linux 6.3. First verify Java 6 is installed:

[watrous@myhost ~]$ java -version
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Next, download and extract Hadoop. Hadoop can be downloaded using a mirror. Hadoop can be setup and run locally and does not require any special privileges. Always verify that you have a good download.

[watrous@myhost ~]$ wget http://download.nextag.com/apache/hadoop/common/stable/hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ md5sum hadoop-2.2.0.tar.gz
25f27eb0b5617e47c032319c0bfd9962  hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ tar xzf hadoop-2.2.0.tar.gz
[watrous@myhost ~]$ hdfs namenode -format

That last command creates an HDFS file system in the tmp folder. In my case it was created here: /tmp/hadoop-watrous/dfs/.

Environment variables were added to .bash_profile for JAVA_HOME and HADOOP_INSTALL, as shown. These can also be run locally each time you login.

export JAVA_HOME=/usr/lib/jvm/jre
export HADOOP_INSTALL=/home/watrous/hadoop-2.2.0
export PATH=$PATH:$HADOOP_INSTALL/bin

I can now verify that Hadoop is installed and ready to run.

[watrous@myhost ~]$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /home/watrous/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar

Get some seed data

Now that Hadoop is all setup, I need some seed data to operate on. For this I just reached out and grabbed a log file from one of my production servers.

[watrous@myhost ~]$ mkdir input
[watrous@myhost ~]$ scp watrous@mywebhost.com:/var/lib/tomcat/logs/catalina.out ./input/

Creating Map Reduce Classes

The most simple operation in Hadoop requires a Mapper class, a Reducer class and a third class that identifies the Mapper and Reducer including the datatypes that connect them. The examples below required two jars from the release downloaded above:

  • hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\common\hadoop-common-2.2.0.jar
  • hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.2.0.jar

I also use regular expressions in Java to analyze each line in the log. Regular expressions can be more resilient to variations and allow for grouping, which gives easy access to specific data elements. As always, I used Kodos to develop the regular expression.

In the example below, I don’t actually use the log value, but instead I just count up how many occurrences there are by key.

Mapper class

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class TomcatLogErrorMapper extends Mapper<LongWritable, Text, Text, Text> {
 
    String pattern = "([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3})\\s*([a-zA-Z]+)\\s*([a-zA-Z.]+)\\s*-\\s*(.+)$";
    // Create a Pattern object
    Pattern r = Pattern.compile(pattern);
 
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
 
        Matcher m = r.matcher(line);
        if (m.find()) {
            // only consider ERRORs for this example
            if (m.group(2).contains("ERROR")) {
                  // example log line
                  // 2013-11-08 04:06:56,586 DEBUG component.helpers.GenericSOAPConnector  - Attempting to connect to: https://remotehost.com/app/rfc/entry/msg_status
//                System.out.println("Found value: " + m.group(0)); //complete line
//                System.out.println("Found value: " + m.group(1)); // date
//                System.out.println("Found value: " + m.group(2)); // log level
//                System.out.println("Found value: " + m.group(3)); // class
//                System.out.println("Found value: " + m.group(4)); // message
                context.write(new Text(m.group(1)), new Text(m.group(2) + m.group(3) + m.group(4)));
            }
        }
    }
}

Reducer class

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class TomcatLogErrorReducer extends Reducer<Text, Text, Text, IntWritable> {
 
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        int countValue = 0;
        for (Text value : values) {
            countValue++;
        }
        context.write(key, new IntWritable(countValue));
    }
}

Job class with main

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class TomcatLogError {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: TomcatLogError <input path> <output path>");
            System.exit(-1);
        }
        Job job = new Job();
        job.setJarByClass(TomcatLogError.class);
        job.setJobName("Tomcat Log Error");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(TomcatLogErrorMapper.class);
        job.setReducerClass(TomcatLogErrorReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Running Hadoop

In netbeans I made sure that the Main Class was TomcatLogError in the compiled jar. I then ran Clean and Build to get a jar which I transferred up to the server where I installed Hadoop.

[watrous@myhost ~]$ hadoop jar HadoopExample.jar input/catalina.out ~/output
13/11/11 19:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/11/11 19:20:52 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
...
13/11/11 18:36:57 INFO mapreduce.Job: Job job_local1725513594_0001 completed successfully
13/11/11 18:36:57 INFO mapreduce.Job: Counters: 27
        File System Counters
                FILE: Number of bytes read=430339145
                FILE: Number of bytes written=1057396
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=1101516
                Map output records=105
                Map output bytes=20648
                Map output materialized bytes=20968
                Input split bytes=396
                Combine input records=0
                Combine output records=0
                Reduce input groups=23
                Reduce shuffle bytes=0
                Reduce input records=105
                Reduce output records=23
                Spilled Records=210
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=234
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1827143680
        File Input Format Counters
                Bytes Read=114455257
        File Output Format Counters
                Bytes Written=844

The output folder now contains a file named part-r-00000 with the results of the processing.

[watrous@c0003913 ~]$ more output/part-r-00000
2013-11-08 04:04:51,894 2
2013-11-08 05:04:52,711 2
2013-11-08 05:33:23,073 3
2013-11-08 06:04:53,689 2
2013-11-08 07:04:54,366 3
2013-11-08 08:04:55,096 2
2013-11-08 13:34:28,936 2
2013-11-08 17:32:31,629 3
2013-11-08 18:51:17,357 1
2013-11-08 18:51:17,423 1
2013-11-08 18:51:17,491 1
2013-11-08 18:51:17,499 1
2013-11-08 18:51:17,500 1
2013-11-08 18:51:17,502 1
2013-11-08 18:51:17,503 1
2013-11-08 18:51:17,504 1
2013-11-08 18:51:17,506 1
2013-11-08 18:51:17,651 6
2013-11-08 18:51:17,652 23
2013-11-08 18:51:17,653 25
2013-11-08 18:51:17,654 19
2013-11-08 19:01:13,771 2
2013-11-08 21:32:34,522 2

Based on this analysis, there were a number of errors produced around the hour 18:51:17. It is then easy to change the Mapper class to emit based on a different key, such as Class or Message to identify more precisely what the error is, now that I know when the errors happened.

Increasing scale

The Mapper and Reducer classes can be enhanced to give more relevant details. The process of transferring the files can also be automated and the input method can be adapted to walk a directory, rather than a single file. Reports can also be aggregated and placed in a web directory or emailed.

Comments

  1. […] try to add value by revisiting my original example of analyzing Apache Tomcat (Java) logs in Hadoop using python this […]

  2. Can you provide some detailed steps on setting up the jar file ? If I have eclipse how to go about doing this ?

    Thanks

    srikrishna

  3. I am not able to locate the output directory, can you please update where it can be?

    • If you followed my example precisely, you would have specified the output directory as “~/output”, which is a directory named “output” off your home directory. In the last code snippet, I type “more output/part-r-00000” from my home directory. I could also type “more ~/output/part-r-00000”, in which case it wouldn’t matter in which directory I was.

      • Very nice tutorial. To run your example I had to import the log file to HDFS using the command:

        hadoop dfs -copyFromLocal /tmp/catalina.out /user/hduser/keyword

        And then read the output as:
        ./hadoop dfs -cat /tmp/output/part-r-00000

        Can you please clarify how you are able to view the output without DFS?

        Thanks

        • DFS is not required to use Hadoop. Since this was a simple demo, I provided my hadoop call local files and a local output directory. Hadoop doesn’t have any problem with local directories.

          “hadoop jar HadoopExample.jar input/catalina.out ~/output”

  4. Thanks for the tutorial.Worked for me. But I want to extend it by displaying associated text records. How can I go about this?

    • I’m not sure what you mean by “associated text records”. If you have some key value that’s common between the output from this and the other associated text records, map both sources and reduce it down based on that common key.

  5. […] recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig […]

  6. Hi, I have tried this samples but the reducer output was empty. can you some one share the input file format then only we can able to understand the logic of map-reducer, please help us

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.