Use Hadoop to Analyze Java Logs (Tomcat catalina.out)
One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: The Definitive Guide for setup and example code.
Installing Hadoop
Hadoop in standalone mode was the most convenient for initial development of the Map Reduce classes. The following commands were executed on a virtual server running RedHat Enterprise Linux 6.3. First verify Java 6 is installed:
[watrous@myhost ~]$ java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) |
Next, download and extract Hadoop. Hadoop can be downloaded using a mirror. Hadoop can be setup and run locally and does not require any special privileges. Always verify that you have a good download.
[watrous@myhost ~]$ wget http://download.nextag.com/apache/hadoop/common/stable/hadoop-2.2.0.tar.gz [watrous@myhost ~]$ md5sum hadoop-2.2.0.tar.gz 25f27eb0b5617e47c032319c0bfd9962 hadoop-2.2.0.tar.gz [watrous@myhost ~]$ tar xzf hadoop-2.2.0.tar.gz [watrous@myhost ~]$ hdfs namenode -format |
That last command creates an HDFS file system in the tmp folder. In my case it was created here: /tmp/hadoop-watrous/dfs/.
Environment variables were added to .bash_profile for JAVA_HOME and HADOOP_INSTALL, as shown. These can also be run locally each time you login.
export JAVA_HOME=/usr/lib/jvm/jre export HADOOP_INSTALL=/home/watrous/hadoop-2.2.0 export PATH=$PATH:$HADOOP_INSTALL/bin |
I can now verify that Hadoop is installed and ready to run.
[watrous@myhost ~]$ hadoop version Hadoop 2.2.0 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /home/watrous/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar |
Get some seed data
Now that Hadoop is all setup, I need some seed data to operate on. For this I just reached out and grabbed a log file from one of my production servers.
[watrous@myhost ~]$ mkdir input [watrous@myhost ~]$ scp watrous@mywebhost.com:/var/lib/tomcat/logs/catalina.out ./input/ |
Creating Map Reduce Classes
The most simple operation in Hadoop requires a Mapper class, a Reducer class and a third class that identifies the Mapper and Reducer including the datatypes that connect them. The examples below required two jars from the release downloaded above:
- hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\common\hadoop-common-2.2.0.jar
- hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.2.0.jar
I also use regular expressions in Java to analyze each line in the log. Regular expressions can be more resilient to variations and allow for grouping, which gives easy access to specific data elements. As always, I used Kodos to develop the regular expression.
In the example below, I don’t actually use the log value, but instead I just count up how many occurrences there are by key.
Mapper class
import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class TomcatLogErrorMapper extends Mapper<LongWritable, Text, Text, Text> { String pattern = "([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3})\\s*([a-zA-Z]+)\\s*([a-zA-Z.]+)\\s*-\\s*(.+)$"; // Create a Pattern object Pattern r = Pattern.compile(pattern); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); Matcher m = r.matcher(line); if (m.find()) { // only consider ERRORs for this example if (m.group(2).contains("ERROR")) { // example log line // 2013-11-08 04:06:56,586 DEBUG component.helpers.GenericSOAPConnector - Attempting to connect to: https://remotehost.com/app/rfc/entry/msg_status // System.out.println("Found value: " + m.group(0)); //complete line // System.out.println("Found value: " + m.group(1)); // date // System.out.println("Found value: " + m.group(2)); // log level // System.out.println("Found value: " + m.group(3)); // class // System.out.println("Found value: " + m.group(4)); // message context.write(new Text(m.group(1)), new Text(m.group(2) + m.group(3) + m.group(4))); } } } } |
Reducer class
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class TomcatLogErrorReducer extends Reducer<Text, Text, Text, IntWritable> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int countValue = 0; for (Text value : values) { countValue++; } context.write(key, new IntWritable(countValue)); } } |
Job class with main
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class TomcatLogError { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: TomcatLogError <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setJarByClass(TomcatLogError.class); job.setJobName("Tomcat Log Error"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(TomcatLogErrorMapper.class); job.setReducerClass(TomcatLogErrorReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
Running Hadoop
In netbeans I made sure that the Main Class was TomcatLogError in the compiled jar. I then ran Clean and Build to get a jar which I transferred up to the server where I installed Hadoop.
[watrous@myhost ~]$ hadoop jar HadoopExample.jar input/catalina.out ~/output 13/11/11 19:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/11/11 19:20:52 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id ... 13/11/11 18:36:57 INFO mapreduce.Job: Job job_local1725513594_0001 completed successfully 13/11/11 18:36:57 INFO mapreduce.Job: Counters: 27 File System Counters FILE: Number of bytes read=430339145 FILE: Number of bytes written=1057396 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=1101516 Map output records=105 Map output bytes=20648 Map output materialized bytes=20968 Input split bytes=396 Combine input records=0 Combine output records=0 Reduce input groups=23 Reduce shuffle bytes=0 Reduce input records=105 Reduce output records=23 Spilled Records=210 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=234 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=1827143680 File Input Format Counters Bytes Read=114455257 File Output Format Counters Bytes Written=844 |
The output folder now contains a file named part-r-00000 with the results of the processing.
[watrous@c0003913 ~]$ more output/part-r-00000 2013-11-08 04:04:51,894 2 2013-11-08 05:04:52,711 2 2013-11-08 05:33:23,073 3 2013-11-08 06:04:53,689 2 2013-11-08 07:04:54,366 3 2013-11-08 08:04:55,096 2 2013-11-08 13:34:28,936 2 2013-11-08 17:32:31,629 3 2013-11-08 18:51:17,357 1 2013-11-08 18:51:17,423 1 2013-11-08 18:51:17,491 1 2013-11-08 18:51:17,499 1 2013-11-08 18:51:17,500 1 2013-11-08 18:51:17,502 1 2013-11-08 18:51:17,503 1 2013-11-08 18:51:17,504 1 2013-11-08 18:51:17,506 1 2013-11-08 18:51:17,651 6 2013-11-08 18:51:17,652 23 2013-11-08 18:51:17,653 25 2013-11-08 18:51:17,654 19 2013-11-08 19:01:13,771 2 2013-11-08 21:32:34,522 2 |
Based on this analysis, there were a number of errors produced around the hour 18:51:17. It is then easy to change the Mapper class to emit based on a different key, such as Class or Message to identify more precisely what the error is, now that I know when the errors happened.
Increasing scale
The Mapper and Reducer classes can be enhanced to give more relevant details. The process of transferring the files can also be automated and the input method can be adapted to walk a directory, rather than a single file. Reports can also be aggregated and placed in a web directory or emailed.
[…] try to add value by revisiting my original example of analyzing Apache Tomcat (Java) logs in Hadoop using python this […]
Can you provide some detailed steps on setting up the jar file ? If I have eclipse how to go about doing this ?
Thanks
srikrishna
That’s a great idea. For this article, I think I used Eclipse to create the code. I’ll try to track it down and update the article when I have a minute.
I am not able to locate the output directory, can you please update where it can be?
If you followed my example precisely, you would have specified the output directory as “~/output”, which is a directory named “output” off your home directory. In the last code snippet, I type “more output/part-r-00000” from my home directory. I could also type “more ~/output/part-r-00000”, in which case it wouldn’t matter in which directory I was.
Very nice tutorial. To run your example I had to import the log file to HDFS using the command:
hadoop dfs -copyFromLocal /tmp/catalina.out /user/hduser/keyword
And then read the output as:
./hadoop dfs -cat /tmp/output/part-r-00000
Can you please clarify how you are able to view the output without DFS?
Thanks
DFS is not required to use Hadoop. Since this was a simple demo, I provided my hadoop call local files and a local output directory. Hadoop doesn’t have any problem with local directories.
“hadoop jar HadoopExample.jar input/catalina.out ~/output”
Thanks for the tutorial.Worked for me. But I want to extend it by displaying associated text records. How can I go about this?
I’m not sure what you mean by “associated text records”. If you have some key value that’s common between the output from this and the other associated text records, map both sources and reduce it down based on that common key.
[…] recently been involved with several groups interested in using Hadoop to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig […]
Hi, I have tried this samples but the reducer output was empty. can you some one share the input file format then only we can able to understand the logic of map-reducer, please help us