Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged saas

Software Engineering

MongoDB Map Reduce for Analytics

I have a RESTful SaaS service I created which uses MongoDB. Each REST call creates a new record in a statistics collection. In order to implement quotas and provide user analytics, I need to process the statistics collection periodically and generate meaningful analytics specific to each user.

This is just the type of problem map reduce was meant to solve. In order to accomplish this I’ll need to do the following:

  • Map all statistics records over a time range
  • Reduce the number of calls, both authenticated and anonymous
  • Finalize to get the sum of authenticated and anonymous calls as total
  • Run over a time range

The data in the statistics collection has this form:

{
        "_id" : ObjectId("5e6877a516832a9c8fe89ca9"),
        "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283",
        "request_date" : ISODate("2013-04-05T06:00:24.006Z"),
        "request_method" : "POST",
        "document" : {
                "domain" : "",
                "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED",
                "deleted" : null,
                "ipAddress" : "",
                "disposition" : "",
                "owner" : ObjectId("af1459ed793eca35754090a0"),
                "_id" : ObjectId("6fec518787a52a9c988ea683"),
                "issueDate" : ISODate("2013-04-05T06:00:24.005Z"),
        },
        "request_uri" : {
                "path" : "/v1/sitelicenses",
                "netloc" : "api.easysoftwarelicensing.com"
        }
}

Here is what I came up with:

Map function

var map_analytics = function() {
    var key = this.owner;
    if (this.apikey == null) {
        var value = {api_call_with_key: 0, api_call_without_key: 1};
    } else {
        var value = {api_call_with_key: 1, api_call_without_key: 0};
    }
    emit(key, value);
};

Reduce function

var reduce_analytics  = function(key_owner, api_calls) {
    reduced_val = {api_call_with_key: 0, api_call_without_key: 0};
    api_calls.forEach(function(value) {
        reduced_val.api_call_with_key += value.api_call_with_key;
        reduced_val.api_call_without_key += value.api_call_without_key;
    });
    return reduced_val;
};

Finalize function

var finalize_analytics = function (key, reduced_val) {
    reduced_val.total_api_calls = reduced_val.api_call_with_key + reduced_val.api_call_without_key;
    return reduced_val;
};

Run Map Reduce

db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics })

That produces an analytics collection with ObjectIDs that match the users _id in the administrators collection. It looks like this.

> db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics })
{
        "result" : "analytics",
        "timeMillis" : 79,
        "counts" : {
                "input" : 14,
                "emit" : 14,
                "reduce" : 2,
                "output" : 2
        },
        "ok" : 1,
}
> db.analytics.find().pretty()
{
        "_id" : ObjectId("5136d880136b961c98c9a62f"),
        "value" : {
                "api_call_with_key" : 8,
                "api_call_without_key" : 4,
                "total_api_calls" : 12
        }
}
{
        "_id" : ObjectId("5143b2c8136b9616343dacec"),
        "value" : {
                "api_call_with_key" : 0,
                "api_call_without_key" : 2,
                "total_api_calls" : 2
        }
}

I had originally hoped to write the analytics to the administrator document, but I don’t think that’s possible, since it overwrites the document with the result of the reduce/finalize functions.

I got my inspiration from this example.

Storing and Scheduling

The question remains how best to store and then schedule the periodic running of this map reduce functionality. It seems that storing it is best done on the server, as shown here: http://docs.mongodb.org/manual/tutorial/store-javascript-function-on-server/

Scheduling will most likely involve a crontab. I’m not sure if I’ll call it directly or through a python script.