MongoDB Aggregation for Analytics
I’ve been working on generating analytics based on a collection containing statistical data. My previous attempt involved using Map Reduce in MongoDB. Recall that the data in the statistics collection has this form.
{ "_id" : ObjectId("5e6877a516832a9c8fe89ca9"), "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283", "request_date" : ISODate("2013-04-05T06:00:24.006Z"), "request_method" : "POST", "document" : { "domain" : "", "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED", "deleted" : null, "ipAddress" : "", "disposition" : "", "owner" : ObjectId("af1459ed793eca35754090a0"), "_id" : ObjectId("6fec518787a52a9c988ea683"), "issueDate" : ISODate("2013-04-05T06:00:24.005Z"), }, "request_uri" : { "path" : "/v1/sitelicenses", "netloc" : "api.easysoftwarelicensing.com" } } |
There were a few items that kept getting in the way with the map reduce implementation. In particular the complexity of the objects I was trying to emit and then reduce were causing me some headaches. Someone suggested using the Aggregation Framework in MongoDB. Here’s what I came up with.
db.statistics.aggregate({ $match: { owner: ObjectId("5143b2c8136b9616343da222") } }, { $project: { owner: "$owner", action_1: { $cond: [{$eq: ["$apikey", null]},0, 1] }, action_2: { $cond: [{$ne: ["$apikey", null]},0, 1] } } }, { $group: { _id: "$owner", action_1: {$sum: "$action_1"}, action_2: {$sum: "$action_2"} } }, { $project: { action_1: "$action_1", action_2: "$action_2", actions_total: { $add: ["$action_1", "$action_2"] }, actions_per_day: { $divide: [ {$add: ["$action_1", "$action_2"]}, {$dayOfMonth: new Date()} ] }, } }) |
At first all the discussion of the aggregation pipeline felt awkward. After a while it became more clear. For example, the above does this:
- $match limits me to the set of documents associated with a particular owner
- $project creates a new document conditionally including some data from the documents that were matched
- $group then sums those documents that were projected above
- the final $project performs additional calculations with the grouped (summed) data
The output of the above aggregation is a document that looks like this:
{ "result" : [ { "_id" : ObjectId("5136d880136b961c98c9a62f"), "action_1" : 10, "action_2" : 4, "actions_total" : 14, "actions_per_day" : 1.4 } ], "ok" : 1 } |
So far I’ve only run this on small sets of data, so I can’t comment on performance for large data sets. Since as of right now it’s still not possible to cache the results in a separate collection, performance may become an issue as my statistical data set grows.
Reference
http://stackoverflow.com/questions/16455528/mongodb-aggregation-framework-date-now