Software Engineering – Page 8 – Daniel Watrous on Software and Cloud Engineering

About 13 years ago I created my first integration with Authorize.net for a client who wanted to accept credit card payments directly on his website. The internet has changed a lot since then and the frequency of fraud attempts has increased. One credit card fraud signature I identified while reviewing my server logs for one of my e-commerce websites was consistent. I refer to this is a shotgun attack, since the hacker sends through hundreds of credit card attempts. Here’s how it works and what to look for. All requests from a single......

Continue Reading

In a previous post I illustrated the use of Hadoop to analyze Apache Tomcat log files (catalina.out). Below I perform the same Tomcat log analysis using PIG. The motivation behind PIG is the ability us a descriptive language to analyze large sets of data rather than writing code to process it, using Java or Python for example. PIG latin is the descriptive query language and has some similarities with SQL. These include grouping and filtering. Load in the data First I launch into the interactive local PIG command line, grunt. Commands are not......

Continue Reading

I read that Hadoop supports scripts written in various languages other than Java, such as Python. Since I’m a fan of python, I wanted to prove this out. It was my good fortune to find an excellent post by Michael Noll that walked me through the entire process of scripting in Python for Hadoop. It’s an excellent post and worked as written for me in Hadoop 2.2.0. How hadoop processes scripts from other languages (stdin/stdout) In order to accommodate scripts from other languages, hadoop focuses on standard in (stdin) and standard out (stdout).......

Continue Reading

The log rotation mechanism on my servers automatically compresses (gzip) the rotated log file to save on disk space. I discovered that Hadoop is already designed to deal with compressed files using gzip, bzip2 and LZO out of the box. This means that no additional work is required in the Mapper class to decompress. Here’s a snippet from the MapReduce output that shows 13/11/15 22:01:46 INFO mapred.MapTask: Processing split: hdfs://localhost/user/watrous/log_myhost.com/catalina.out-20131103.gz:0+3058954 13/11/15 22:01:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/11/15 22:01:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 13/11/15 22:01:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb:......

Continue Reading

My previous hadoop example operated against the local filesystem, in spite of the fact that I formatted a local HDFS partition. In order to operate against the local HDFS partition it’s necessary to first start the namenode and datanode. I mostly followed these instructions to start those processes. Here’s the most relevant part that I hadn’t done yet. # Format the namenode hdfs namenode -format # Start the namenode hdfs namenode # Start a datanode hdfs datanode# Format the namenode hdfs namenode -format # Start the namenode hdfs namenode # Start a datanode......

Continue Reading

One of the Java applications I develop deploys in Tomcat and is load-balanced across a couple dozen servers. Each server can produce gigabytes of log output daily due to the high volume. This post demonstrates simple use of hadoop to quickly extract useful and relevant information from catalina.out files using Map Reduce. I followed Hadoop: The Definitive Guide for setup and example code. Installing Hadoop Hadoop in standalone mode was the most convenient for initial development of the Map Reduce classes. The following commands were executed on a virtual server running RedHat Enterprise......

Continue Reading

Most Java programmers are very familiar with the mechanism to extend a class. To do this, you simply create a new class and specify that it extends another class. You can then add funtionality not available in the original class, AND you can also override any functionality that existed already. Imagine a simple class public class Message { private String message; public Message(String message) { this.message = message; } public void showMessage() { System.out.println(message); } }public class Message { private String message; public Message(String message) { this.message = message; } public......

Continue Reading

As the scale of web applications increases, performance optimization considerations are more frequently included in initial design. One optimization technique used extensively is caching. A cache contains pre-processed data that is ready to use without redoing the processing. Processing may include extensive computation and accessing data over a network or on disk. Keep the details out of your code One design consideration when introducing a caching mechanism in your code is to keep the details out of your code. Most caches are just simple key value stores, and so it’s tempting to introduce......

Continue Reading

A few days ago I wrote about how to structure version details in MongoDB. In this and subsequent articles I’m going to present a Java based approach to working with that revision data. I have published all this work as an open source repository on github. Feel free to fork it: https://github.com/dwatrous/mongodb-revision-objects Design Decisions To begin with, here are a few design rules that should direct the implementation: Program to interfaces. Choice of datastore or other technologies should not be visible in application code Application code should never deal with versioned objects. It......

Continue Reading

The need to track changes to web content and provide for draft or preview functionality is common to many web applications today. In relational databases it has long been common to accomplish this using a recursive relationship within a single table or by splitting the table out and storing version details in a secondary table. I’ve recently been exploring the best way to accomplish this using MongoDB. A few design considerations Data will be represented in three primary states, published, draft and history. These might also be called current and preview. From a......

Continue Reading