Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged big data

Software Engineering

Analyze Tomcat Logs using PIG (hadoop)

In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG.

The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python for example. PIG latin is the descriptive query language and has some similarities with SQL. These include grouping and filtering.

Load in the data

First I launch into the interactive local PIG command line, grunt. Commands are not case sensitive, but it can be helpful to distinguish function names from variables. I show all commands in CAPS. Since the catalina.out data is not in a structured format (csv, tab, etc.), I load each line as a chararray (string).

[watrous@myhost ~]$ pig -x local
grunt> raw_log_entries = LOAD '/opt/mount/input/sample/catalina.out' USING TextLoader AS (line:chararray);
grunt> illustrate raw_log_entries;
--------------------------------------------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                                                     |
--------------------------------------------------------------------------------------------------------------------------------------------
|                     | 2013-10-30 04:20:18,897 DEBUG component.JVFunctions  - JVList: got docc03931336Instant Ink 2 - Alert # q14373261 |
--------------------------------------------------------------------------------------------------------------------------------------------

Note that it is also possible to provide a directory and PIG will load all files in the given directory.

Use regular expressions to parse each line

Now that I have the data in, I want to split each line into fields. To do this in PIG I use regular expressions with the REGEX_EXTRACT_ALL function. Notice that I double escape regex symbols, such as \\s for space. In the command below, the FLATTEN turns the matched values into a tuple that can be matched up with the AS fields. I’m treating all fields as chararray.

grunt> logs_base = FOREACH raw_log_entries GENERATE
>> FLATTEN(
>> REGEX_EXTRACT_ALL(line, '^([0-9]{4}-[0-9]{2}-[0-9]{2}\\s[0-9:,]+)\\s([a-zA-Z]+)\\s+([a-zA-Z0-9.]+)\\s+(.*)$')
>> ) AS (
>> logDate:      chararray,
>> logLevel:     chararray,
>> logClass:     chararray,
>> logMessage:   chararray
>> );
grunt> illustrate logs_base;
-----------------------------------------------------------------------------------------------------------
| raw_log_entries     | line:chararray                                                                    |
-----------------------------------------------------------------------------------------------------------
|                     | 2013-11-08 04:26:27,966 DEBUG component.JVFunctions  - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------
| logs_base     | logDate:chararray       | logLevel:chararray      | logClass:chararray      | logMessage:chararray        |
-----------------------------------------------------------------------------------------------------------------------------
|               | 2013-11-08 04:26:27,966 | DEBUG                   | component.JVFunctions  | - Visible Level Added :LEV1 |
-----------------------------------------------------------------------------------------------------------------------------

Filter and Group and Generate the desired output

I want to report on the ERROR logs by timestamp. I first filter the log base by the logLevel field. I then group the filtered records by logDate. Finally I use the FOREACH function to GENERATE a result set including the timestamp and a count of errors at that time. Finally I dump the results.

grunt> filtered_records = FILTER logs_base BY logLevel == 'ERROR';
grunt> grouped_records = GROUP filtered_records BY logDate;
grunt> log_count = FOREACH grouped_records GENERATE group, COUNT(filtered_records);
grunt> dump log_count
 
HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.0   0.12.0  watrous 2013-12-05 21:38:54     2013-12-05 21:39:15     GROUP_BY,FILTER
 
Success!
 
Job Stats (time in seconds):
JobId   Alias   Feature Outputs
job_local_0002  filtered_records,grouped_records,log_count,logs_base,raw_log_entries    GROUP_BY,COMBINER       file:/tmp/temp1196141656/tmp-135873072,
 
Input(s):
Successfully read records from: "/opt/mount/input/sample/catalina.out"
 
Output(s):
Successfully stored records in: "file:/tmp/temp1196141656/tmp-135873072"
 
Job DAG:
job_local_0002
 
 
2013-12-05 21:39:15,813 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2013-12-05 21:39:15,814 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2013-12-05 21:39:15,815 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-12-05 21:39:15,815 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2013-11-08 04:04:51,894,2)
(2013-11-08 05:04:52,711,2)
(2013-11-08 05:33:23,073,3)
(2013-11-08 06:04:53,689,2)
(2013-11-08 07:04:54,366,3)
(2013-11-08 08:04:55,096,2)
(2013-11-08 13:34:28,936,2)
(2013-11-08 17:32:31,629,3)
(2013-11-08 18:50:56,971,1)
(2013-11-08 18:50:56,980,1)
(2013-11-08 18:50:56,986,1)
(2013-11-08 18:50:57,008,1)
(2013-11-08 18:50:57,017,1)
(2013-11-08 18:50:57,024,1)
(2013-11-08 18:51:17,357,1)
(2013-11-08 18:51:17,423,1)
(2013-11-08 18:51:17,491,1)
(2013-11-08 18:51:17,499,1)
(2013-11-08 18:51:17,500,1)
(2013-11-08 18:51:17,502,1)
(2013-11-08 18:51:17,503,1)
(2013-11-08 18:51:17,504,1)
(2013-11-08 18:51:17,506,1)
(2013-11-08 18:51:17,651,6)
(2013-11-08 18:51:17,652,23)
(2013-11-08 18:51:17,653,25)
(2013-11-08 18:51:17,654,19)
(2013-11-08 19:01:13,771,2)
(2013-11-08 21:32:34,522,2)

Performance in PIG

Performance is at risk, since the descriptive language PIG latin needs to be translated into one or more MapReduce steps. This translation doesn’t always provide for the best performance. However, for smaller datasets, the lower performance may be offset by eliminating the build phase required when producing your own MapReduce jobs.

Troubleshooting

I spent way more time trying to get PIG working than I felt I should have. The PIG mailing list was very helpful and quick. Here are some pointers.

Agreement of Hadoop version

PIG is compiled against a specific version of hadoop. As a result, any local Hadoop version must match the version referenced when PIG was compiled. If the local Hadoop version doesn’t agree with the version used when building PIG, it’s possible to remove all references to the local hadoop version and PIG will use its internal version of hadoop. In my case I had to remove hadoop binaries from my PATH.

Documentation and examples

There are very few examples showing the use of PIG, and of those that I found, none worked as written. This seems to indicate either that PIG is moving very fast or that the developers are unhappy with the APIs, which change frequently.

References

http://aws.amazon.com/articles/2729
Hadoop: The Definitive Guide

Software Engineering

Big Data Cache Approaches

I’ve had several conversations recently about caching as it relates to big data. As a result of these discussions I wanted to review some details that should be considered when deciding if a cache is necessary and how to cache big data when it is necessary.

What is a Cache?

The purpose of a cache is to duplicate frequently accessed or important data in such a way that it can be accessed very fast and close to where it is needed. Caching generally moves data from a low cost, high density location (e.g. disk) to a high cost, very fast location (e.g. RAM). In some cases a cache moves data from a high latency context (e.g. network) to a low latency context (e.g. local disk). A cache increases hardware requirements due to duplication of data.

Big Data Alternatives

A review of big data alternatives shows that some are designed to operate entirely from RAM while others operate primarily from disk. Each option has made a trade off that can be viewed along two competing priorities: Scalability & Performance vs. Depth of Functionality.

scale-performance-functionality

While relational databases are heavy on functionality and lag on performance, other alternatives like Redis are heavy on performance but lag with respect to features.

It’s also important to recognize that there are some tools, such as Lucene and Memcached that are not typically considered as belonging to the collection of big data tools, but they effectively occupy the same feature space.

Robust functionality requirements may necessitate using a lower performance big data alternative. If that lower performance impacts SLAs (Service Level Agreement), it may be necessary to introduce a more performant alternative, either as a replacement or compliment to the first choice.

In other words, some of the big data alternatives occupy the cache tool space, while others are more feature rich datastores.

Cache Scope and Integration

Cache scope and cache positioning have a significant impact on the effectiveness of a solution. In the illustration below I outline a typical tiered web application architecture.

Big data integration with a cache can be introduced at any tier within a web application

Big data integration with a cache can be introduced at any tier within a web application

As indicated, each layer can be made to interact with a cache. I’ve also highlighted a few example technologies that may be a good fit for each feature space. A cache mechanism can be introduced in any tier, or even in multiple tiers.

Some technologies, like MongoDB build some level of caching into their engine directly. This can be convenient, but it may not address latency or algorithm complexity. It may also not be feasible to cache a large result set when a query affects a large number of records or documents.

Algorithm Complexity and Load

To discuss algorithm complexity, let’s consider a couple of examples. Imagine a sample of web analytics data comprising 5,000,000 interactions. Each interaction document includes the following information:

  • timestamp
  • browser details
  • page details
  • tagging details
  • username if available

Algorithm #1: User Behavior

An algorithm that queries interactions by username and calculates the total number of interactions and time on site will require access to a very small subset of the total data since not all records contain a username, and not all users are logged in when viewing the site (some may not even have an account). Also, the number of interactions a single user can produce is limited.

It’s likely that in this case no caching would be required. A database level cache may be helpful since the number of records are few and can remain in memory throughout the request cycle. Since the number of requests for a particular user is likely to be small, caching at the web tier would yield little value, while still consuming valuable memory.

Algorithm #2: Browser Choice by Demographic

Unlike the algorithm above that assumes many usernames (fewer records per username), there are likely to be a dozen or fewer browser types across the entire collection of data (many records per browser type). As a result, an algorithm that calculates the number of pages visited using a specific browser may reach into the millions.

In this case a native datastore cache may fall short and queries would resort to disk every time. A carefully designed application level cache could reduce load on the datastore while increasing responsiveness.

If an application level cache is selected, a decision still needs to be made about scope. Should a subset of the data as individual records or record fragments be stored in the cache or should that data be transformed into a more concise format before placing it in the cache.

Notice here that the cache is not only a function of memory resources, but also CPU. There may be sufficient memory to store all the data, but if the algorithm consumes a great deal of CPU then despite fast access to the data, the application may lag.

If the exact same report is of interest to many people, a web tier cache may be justified, since there would be no value in generating the exact same report on each request. However, this needs to be balanced against authentication and authorization requirements that may protect that data.

Algorithm complexity and server load can influence cache design as much as datastore performance.

Balance

We’ve discussed some trade offs between available memory and CPU. Additional factors include clarity of code, latency between tiers, scale and performance. In some cases an in memory solution may fall short due to latency between a cache server and application servers. In those cases, proximity to the data may require distributing a cache so that latency is limited by system bus speeds and not network connectivity.

CAUTION: As you consider introducing a cache at any level, make sure you’re not prematurely optimizing. Code clarity and overall architecture can suffer considerably when optimization is pursued without justification.

If you don’t need a cache, don’t introduce one.

Existing Cache Solutions

Caching can be implemented inside or outside your application. SaaS solutions, such as Akamai, prevent the request from ever reaching your servers. Build your own solutions which leverage cloud CDN offerings are also available. When building a cache into your application, there are options like Varnish Cache. For Java applications EHCache or cache4guice may be a good fit. As mentioned above, there are many technologies available to create a custom cache, among which Memcached and Redis are very popular. These also work with a variety of programming languages and offer flexibility with respect to what is stored and how it is stored.

Best Solution

Hopefully it’s obvious that there’s no single winner when it comes to cache technology. However, if you find that the solution you’ve chosen doesn’t perform as well as required, there are many options to get that data closer to the CPU and even reduce the CPU load for repetitive processing.