Today: September 13, 2024 12:50 pm
A collection of Software and Cloud patterns with a focus on the Enterprise

Analyze Tomcat Logs using PIG (hadoop)

In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG.

The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python for example. PIG latin is the descriptive query language and has some similarities with SQL. These include grouping and filtering.

Load in the data

First I launch into the interactive local PIG command line, grunt. Commands are not case sensitive, but it can be helpful to distinguish function names from variables. I show all commands in CAPS. Since the catalina.out data is not in a structured format (csv, tab, etc.), I load each line as a chararray (string).

[watrous@myhost ~]$ pig -x local
grunt> raw_log_entries = LOAD '/opt/mount/input/sample/catalina.out' USING TextLoader AS (line:chararray);
grunt> illustrate raw_log_entries;
--------------------------------------------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                                                     |
--------------------------------------------------------------------------------------------------------------------------------------------
|                     | 2013-10-30 04:20:18,897 DEBUG component.JVFunctions  - JVList: got docc03931336Instant Ink 2 - Alert # q14373261 |
--------------------------------------------------------------------------------------------------------------------------------------------

Note that it is also possible to provide a directory and PIG will load all files in the given directory.

Use regular expressions to parse each line

Now that I have the data in, I want to split each line into fields. To do this in PIG I use regular expressions with the REGEX_EXTRACT_ALL function. Notice that I double escape regex symbols, such as \\s for space. In the command below, the FLATTEN turns the matched values into a tuple that can be matched up with the AS fields. I’m treating all fields as chararray.

grunt> logs_base = FOREACH raw_log_entries GENERATE
>> FLATTEN(
>> REGEX_EXTRACT_ALL(line, '^([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9:,]+)\\s([a-zA-Z]+)\\s+([a-zA-Z0-9.]+)\\s+(.*)$')
>> ) AS (
>> logDate:      chararray,
>> logLevel:     chararray,
>> logClass:     chararray,
>> logMessage:   chararray
>> );
grunt> illustrate logs_base;
-----------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                    |
-----------------------------------------------------------------------------------------------------------
|                     | 2013-11-08 04:26:27,966 DEBUG component.JVFunctions  - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
| logs_base     | logDate:chararray       | logLevel:chararray      | logClass:chararray      | logMessage:chararray        |
-----------------------------------------------------------------------------------------------------------------------------
|               | 2013-11-08 04:26:27,966 | DEBUG                   | component.JVFunctions  | - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------------------------

Filter and Group and Generate the desired output

I want to report on the ERROR logs by timestamp. I first filter the log base by the logLevel field. I then group the filtered records by logDate. Finally I use the FOREACH function to GENERATE a result set including the timestamp and a count of errors at that time. Finally I dump the results.

grunt> filtered_records = FILTER logs_base BY logLevel == 'ERROR';
grunt> grouped_records = GROUP filtered_records BY logDate;
grunt> log_count = FOREACH grouped_records GENERATE group, COUNT(filtered_records);
grunt> dump log_count
 
HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.0   0.12.0  watrous 2013-12-05 21:38:54     2013-12-05 21:39:15     GROUP_BY,FILTER
 
Success!
 
Job Stats (time in seconds):
JobId   Alias   Feature Outputs
job_local_0002  filtered_records,grouped_records,log_count,logs_base,raw_log_entries    GROUP_BY,COMBINER       file:/tmp/temp1196141656/tmp-135873072,
 
Input(s):
Successfully read records from: "/opt/mount/input/sample/catalina.out"
 
Output(s):
Successfully stored records in: "file:/tmp/temp1196141656/tmp-135873072"
 
Job DAG:
job_local_0002
 
 
2013-12-05 21:39:15,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2013-12-05 21:39:15,814 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2013-12-05 21:39:15,815 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-12-05 21:39:15,815 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2013-11-08 04:04:51,894,2)
(2013-11-08 05:04:52,711,2)
(2013-11-08 05:33:23,073,3)
(2013-11-08 06:04:53,689,2)
(2013-11-08 07:04:54,366,3)
(2013-11-08 08:04:55,096,2)
(2013-11-08 13:34:28,936,2)
(2013-11-08 17:32:31,629,3)
(2013-11-08 18:50:56,971,1)
(2013-11-08 18:50:56,980,1)
(2013-11-08 18:50:56,986,1)
(2013-11-08 18:50:57,008,1)
(2013-11-08 18:50:57,017,1)
(2013-11-08 18:50:57,024,1)
(2013-11-08 18:51:17,357,1)
(2013-11-08 18:51:17,423,1)
(2013-11-08 18:51:17,491,1)
(2013-11-08 18:51:17,499,1)
(2013-11-08 18:51:17,500,1)
(2013-11-08 18:51:17,502,1)
(2013-11-08 18:51:17,503,1)
(2013-11-08 18:51:17,504,1)
(2013-11-08 18:51:17,506,1)
(2013-11-08 18:51:17,651,6)
(2013-11-08 18:51:17,652,23)
(2013-11-08 18:51:17,653,25)
(2013-11-08 18:51:17,654,19)
(2013-11-08 19:01:13,771,2)
(2013-11-08 21:32:34,522,2)

Performance in PIG

Performance is at risk, since the descriptive language PIG latin needs to be translated into one or more MapReduce steps. This translation doesn’t always provide for the best performance. However, for smaller datasets, the lower performance may be offset by eliminating the build phase required when producing your own MapReduce jobs.

Troubleshooting

I spent way more time trying to get PIG working than I felt I should have. The PIG mailing list was very helpful and quick. Here are some pointers.

Agreement of Hadoop version

PIG is compiled against a specific version of hadoop. As a result, any local Hadoop version must match the version referenced when PIG was compiled. If the local Hadoop version doesn’t agree with the version used when building PIG, it’s possible to remove all references to the local hadoop version and PIG will use its internal version of hadoop. In my case I had to remove hadoop binaries from my PATH.

Documentation and examples

There are very few examples showing the use of PIG, and of those that I found, none worked as written. This seems to indicate either that PIG is moving very fast or that the developers are unhappy with the APIs, which change frequently.

References

http://aws.amazon.com/articles/2729
Hadoop: The Definitive Guide

Comments

  1. Avatar for Daniel Watrous keerthikanth : July 19, 2014 at 8:36 pm

    Can you provide much more datasets & some more live complex examples on PIG

  2. […] to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.