Analyze Tomcat Logs using PIG (hadoop)

December 5, 2013

Software Engineering

2 Comments

Daniel Watrous

In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG.

The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python for example. PIG latin is the descriptive query language and has some similarities with SQL. These include grouping and filtering.

Load in the data

First I launch into the interactive local PIG command line, grunt. Commands are not case sensitive, but it can be helpful to distinguish function names from variables. I show all commands in CAPS. Since the catalina.out data is not in a structured format (csv, tab, etc.), I load each line as a chararray (string).

[watrous@myhost ~]$ pig -x local
grunt> raw_log_entries = LOAD '/opt/mount/input/sample/catalina.out' USING TextLoader AS (line:chararray);
grunt> illustrate raw_log_entries;
--------------------------------------------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                                                     |
--------------------------------------------------------------------------------------------------------------------------------------------
|                     | 2013-10-30 04:20:18,897 DEBUG component.JVFunctions  - JVList: got docc03931336Instant Ink 2 - Alert # q14373261 |
--------------------------------------------------------------------------------------------------------------------------------------------

Note that it is also possible to provide a directory and PIG will load all files in the given directory.

Use regular expressions to parse each line

Now that I have the data in, I want to split each line into fields. To do this in PIG I use regular expressions with the REGEX_EXTRACT_ALL function. Notice that I double escape regex symbols, such as \\s for space. In the command below, the FLATTEN turns the matched values into a tuple that can be matched up with the AS fields. I’m treating all fields as chararray.

grunt> logs_base = FOREACH raw_log_entries GENERATE
>> FLATTEN(
>> REGEX_EXTRACT_ALL(line, '^([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9:,]+)\\s([a-zA-Z]+)\\s+([a-zA-Z0-9.]+)\\s+(.*)$')
>> ) AS (
>> logDate:      chararray,
>> logLevel:     chararray,
>> logClass:     chararray,
>> logMessage:   chararray
>> );
grunt> illustrate logs_base;
-----------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                    |
-----------------------------------------------------------------------------------------------------------
|                     | 2013-11-08 04:26:27,966 DEBUG component.JVFunctions  - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
| logs_base     | logDate:chararray       | logLevel:chararray      | logClass:chararray      | logMessage:chararray        |
-----------------------------------------------------------------------------------------------------------------------------
|               | 2013-11-08 04:26:27,966 | DEBUG                   | component.JVFunctions  | - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------------------------

Filter and Group and Generate the desired output

I want to report on the ERROR logs by timestamp. I first filter the log base by the logLevel field. I then group the filtered records by logDate. Finally I use the FOREACH function to GENERATE a result set including the timestamp and a count of errors at that time. Finally I dump the results.

grunt> filtered_records = FILTER logs_base BY logLevel == 'ERROR';
grunt> grouped_records = GROUP filtered_records BY logDate;
grunt> log_count = FOREACH grouped_records GENERATE group, COUNT(filtered_records);
grunt> dump log_count
 
HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.0   0.12.0  watrous 2013-12-05 21:38:54     2013-12-05 21:39:15     GROUP_BY,FILTER
 
Success!
 
Job Stats (time in seconds):
JobId   Alias   Feature Outputs
job_local_0002  filtered_records,grouped_records,log_count,logs_base,raw_log_entries    GROUP_BY,COMBINER       file:/tmp/temp1196141656/tmp-135873072,
 
Input(s):
Successfully read records from: "/opt/mount/input/sample/catalina.out"
 
Output(s):
Successfully stored records in: "file:/tmp/temp1196141656/tmp-135873072"
 
Job DAG:
job_local_0002
 
 
2013-12-05 21:39:15,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2013-12-05 21:39:15,814 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2013-12-05 21:39:15,815 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-12-05 21:39:15,815 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2013-11-08 04:04:51,894,2)
(2013-11-08 05:04:52,711,2)
(2013-11-08 05:33:23,073,3)
(2013-11-08 06:04:53,689,2)
(2013-11-08 07:04:54,366,3)
(2013-11-08 08:04:55,096,2)
(2013-11-08 13:34:28,936,2)
(2013-11-08 17:32:31,629,3)
(2013-11-08 18:50:56,971,1)
(2013-11-08 18:50:56,980,1)
(2013-11-08 18:50:56,986,1)
(2013-11-08 18:50:57,008,1)
(2013-11-08 18:50:57,017,1)
(2013-11-08 18:50:57,024,1)
(2013-11-08 18:51:17,357,1)
(2013-11-08 18:51:17,423,1)
(2013-11-08 18:51:17,491,1)
(2013-11-08 18:51:17,499,1)
(2013-11-08 18:51:17,500,1)
(2013-11-08 18:51:17,502,1)
(2013-11-08 18:51:17,503,1)
(2013-11-08 18:51:17,504,1)
(2013-11-08 18:51:17,506,1)
(2013-11-08 18:51:17,651,6)
(2013-11-08 18:51:17,652,23)
(2013-11-08 18:51:17,653,25)
(2013-11-08 18:51:17,654,19)
(2013-11-08 19:01:13,771,2)
(2013-11-08 21:32:34,522,2)

grunt> filtered_records = FILTER logs_base BY logLevel == 'ERROR'; grunt> grouped_records = GROUP filtered_records BY logDate; grunt> log_count = FOREACH grouped_records GENERATE group, COUNT(filtered_records); grunt> dump log_count HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.0 0.12.0 watrous 2013-12-05 21:38:54 2013-12-05 21:39:15 GROUP_BY,FILTER Success! Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0002 filtered_records,grouped_records,log_count,logs_base,raw_log_entries GROUP_BY,COMBINER file:/tmp/temp1196141656/tmp-135873072, Input(s): Successfully read records from: "/opt/mount/input/sample/catalina.out" Output(s): Successfully stored records in: "file:/tmp/temp1196141656/tmp-135873072" Job DAG: job_local_0002 2013-12-05 21:39:15,813 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2013-12-05 21:39:15,814 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2013-12-05 21:39:15,815 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2013-12-05 21:39:15,815 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (2013-11-08 04:04:51,894,2) (2013-11-08 05:04:52,711,2) (2013-11-08 05:33:23,073,3) (2013-11-08 06:04:53,689,2) (2013-11-08 07:04:54,366,3) (2013-11-08 08:04:55,096,2) (2013-11-08 13:34:28,936,2) (2013-11-08 17:32:31,629,3) (2013-11-08 18:50:56,971,1) (2013-11-08 18:50:56,980,1) (2013-11-08 18:50:56,986,1) (2013-11-08 18:50:57,008,1) (2013-11-08 18:50:57,017,1) (2013-11-08 18:50:57,024,1) (2013-11-08 18:51:17,357,1) (2013-11-08 18:51:17,423,1) (2013-11-08 18:51:17,491,1) (2013-11-08 18:51:17,499,1) (2013-11-08 18:51:17,500,1) (2013-11-08 18:51:17,502,1) (2013-11-08 18:51:17,503,1) (2013-11-08 18:51:17,504,1) (2013-11-08 18:51:17,506,1) (2013-11-08 18:51:17,651,6) (2013-11-08 18:51:17,652,23) (2013-11-08 18:51:17,653,25) (2013-11-08 18:51:17,654,19) (2013-11-08 19:01:13,771,2) (2013-11-08 21:32:34,522,2)

Performance in PIG

Performance is at risk, since the descriptive language PIG latin needs to be translated into one or more MapReduce steps. This translation doesn’t always provide for the best performance. However, for smaller datasets, the lower performance may be offset by eliminating the build phase required when producing your own MapReduce jobs.

Troubleshooting

I spent way more time trying to get PIG working than I felt I should have. The PIG mailing list was very helpful and quick. Here are some pointers.

Agreement of Hadoop version

PIG is compiled against a specific version of hadoop. As a result, any local Hadoop version must match the version referenced when PIG was compiled. If the local Hadoop version doesn’t agree with the version used when building PIG, it’s possible to remove all references to the local hadoop version and PIG will use its internal version of hadoop. In my case I had to remove hadoop binaries from my PATH.

Documentation and examples

There are very few examples showing the use of PIG, and of those that I found, none worked as written. This seems to indicate either that PIG is moving very fast or that the developers are unhappy with the APIs, which change frequently.

References

http://aws.amazon.com/articles/2729
Hadoop: The Definitive Guide

PRV POST

NXT POST

Comments

keerthikanth : July 19, 2014 at 8:36 pm

Can you provide much more datasets & some more live complex examples on PIG

Reply
Install and configure a Multi-node Hadoop cluster using Ansible | Daniel Watrous on Software Engineering : October 1, 2015 at 12:08 pm

[…] to process large sets of data, including use of higher level abstractions on top of Hadoop like Pig and Hive. What has surprised me most is that no one is automating their installation of Hadoop. In […]

Reply