Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged aggregation framework

Software Engineering

MongoDB Aggregation for Analytics

I’ve been working on generating analytics based on a collection containing statistical data. My previous attempt involved using Map Reduce in MongoDB. Recall that the data in the statistics collection has this form.

        "_id" : ObjectId("5e6877a516832a9c8fe89ca9"),
        "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283",
        "request_date" : ISODate("2013-04-05T06:00:24.006Z"),
        "request_method" : "POST",
        "document" : {
                "domain" : "",
                "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED",
                "deleted" : null,
                "ipAddress" : "",
                "disposition" : "",
                "owner" : ObjectId("af1459ed793eca35754090a0"),
                "_id" : ObjectId("6fec518787a52a9c988ea683"),
                "issueDate" : ISODate("2013-04-05T06:00:24.005Z"),
        "request_uri" : {
                "path" : "/v1/sitelicenses",
                "netloc" : "api.easysoftwarelicensing.com"

There were a few items that kept getting in the way with the map reduce implementation. In particular the complexity of the objects I was trying to emit and then reduce were causing me some headaches. Someone suggested using the Aggregation Framework in MongoDB. Here’s what I came up with.

    $match: {
        owner: ObjectId("5143b2c8136b9616343da222")
}, {
    $project: {
        owner: "$owner",
        action_1: {
            $cond: [{$eq: ["$apikey", null]},0, 1]
        action_2: {
            $cond: [{$ne: ["$apikey", null]},0, 1]
}, {
    $group: {
        _id: "$owner",
        action_1: {$sum: "$action_1"},
        action_2: {$sum: "$action_2"}
}, {
    $project: {
        action_1: "$action_1",
        action_2: "$action_2",
        actions_total: {
            $add: ["$action_1", "$action_2"]
        actions_per_day: {
            $divide: [
                {$add: ["$action_1", "$action_2"]}, 
                {$dayOfMonth: new Date()}

At first all the discussion of the aggregation pipeline felt awkward. After a while it became more clear. For example, the above does this:

  • $match limits me to the set of documents associated with a particular owner
  • $project creates a new document conditionally including some data from the documents that were matched
  • $group then sums those documents that were projected above
  • the final $project performs additional calculations with the grouped (summed) data

The output of the above aggregation is a document that looks like this:

        "result" : [
                        "_id" : ObjectId("5136d880136b961c98c9a62f"),
                        "action_1" : 10,
                        "action_2" : 4,
                        "actions_total" : 14,
                        "actions_per_day" : 1.4
        "ok" : 1

So far I’ve only run this on small sets of data, so I can’t comment on performance for large data sets. Since as of right now it’s still not possible to cache the results in a separate collection, performance may become an issue as my statistical data set grows.