MongoDB Map Reduce for Analytics
I have a RESTful SaaS service I created which uses MongoDB. Each REST call creates a new record in a statistics collection. In order to implement quotas and provide user analytics, I need to process the statistics collection periodically and generate meaningful analytics specific to each user.
This is just the type of problem map reduce was meant to solve. In order to accomplish this I’ll need to do the following:
- Map all statistics records over a time range
- Reduce the number of calls, both authenticated and anonymous
- Finalize to get the sum of authenticated and anonymous calls as total
- Run over a time range
The data in the statistics collection has this form:
{ "_id" : ObjectId("5e6877a516832a9c8fe89ca9"), "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283", "request_date" : ISODate("2013-04-05T06:00:24.006Z"), "request_method" : "POST", "document" : { "domain" : "", "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED", "deleted" : null, "ipAddress" : "", "disposition" : "", "owner" : ObjectId("af1459ed793eca35754090a0"), "_id" : ObjectId("6fec518787a52a9c988ea683"), "issueDate" : ISODate("2013-04-05T06:00:24.005Z"), }, "request_uri" : { "path" : "/v1/sitelicenses", "netloc" : "api.easysoftwarelicensing.com" } } |
Here is what I came up with:
Map function
var map_analytics = function() { var key = this.owner; if (this.apikey == null) { var value = {api_call_with_key: 0, api_call_without_key: 1}; } else { var value = {api_call_with_key: 1, api_call_without_key: 0}; } emit(key, value); }; |
Reduce function
var reduce_analytics = function(key_owner, api_calls) { reduced_val = {api_call_with_key: 0, api_call_without_key: 0}; api_calls.forEach(function(value) { reduced_val.api_call_with_key += value.api_call_with_key; reduced_val.api_call_without_key += value.api_call_without_key; }); return reduced_val; }; |
Finalize function
var finalize_analytics = function (key, reduced_val) { reduced_val.total_api_calls = reduced_val.api_call_with_key + reduced_val.api_call_without_key; return reduced_val; }; |
Run Map Reduce
db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics }) |
That produces an analytics collection with ObjectIDs that match the users _id in the administrators collection. It looks like this.
> db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics }) { "result" : "analytics", "timeMillis" : 79, "counts" : { "input" : 14, "emit" : 14, "reduce" : 2, "output" : 2 }, "ok" : 1, } > db.analytics.find().pretty() { "_id" : ObjectId("5136d880136b961c98c9a62f"), "value" : { "api_call_with_key" : 8, "api_call_without_key" : 4, "total_api_calls" : 12 } } { "_id" : ObjectId("5143b2c8136b9616343dacec"), "value" : { "api_call_with_key" : 0, "api_call_without_key" : 2, "total_api_calls" : 2 } } |
I had originally hoped to write the analytics to the administrator document, but I don’t think that’s possible, since it overwrites the document with the result of the reduce/finalize functions.
I got my inspiration from this example.
Storing and Scheduling
The question remains how best to store and then schedule the periodic running of this map reduce functionality. It seems that storing it is best done on the server, as shown here: http://docs.mongodb.org/manual/tutorial/store-javascript-function-on-server/
Scheduling will most likely involve a crontab. I’m not sure if I’ll call it directly or through a python script.
[…] I’ve been working on generating analytics based on a collection containing statistical data. My previous attempt involved using Map Reduce in MongoDB. […]