MongoDB Aggregation for Analytics

May 9, 2013

Software Engineering

No Comments

Daniel Watrous

I’ve been working on generating analytics based on a collection containing statistical data. My previous attempt involved using Map Reduce in MongoDB. Recall that the data in the statistics collection has this form.

{
        "_id" : ObjectId("5e6877a516832a9c8fe89ca9"),
        "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283",
        "request_date" : ISODate("2013-04-05T06:00:24.006Z"),
        "request_method" : "POST",
        "document" : {
                "domain" : "",
                "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED",
                "deleted" : null,
                "ipAddress" : "",
                "disposition" : "",
                "owner" : ObjectId("af1459ed793eca35754090a0"),
                "_id" : ObjectId("6fec518787a52a9c988ea683"),
                "issueDate" : ISODate("2013-04-05T06:00:24.005Z"),
        },
        "request_uri" : {
                "path" : "/v1/sitelicenses",
                "netloc" : "api.easysoftwarelicensing.com"
        }
}

There were a few items that kept getting in the way with the map reduce implementation. In particular the complexity of the objects I was trying to emit and then reduce were causing me some headaches. Someone suggested using the Aggregation Framework in MongoDB. Here’s what I came up with.

db.statistics.aggregate({
    $match: {
        owner: ObjectId("5143b2c8136b9616343da222")
    }
}, {
    $project: {
        owner: "$owner",
        action_1: {
            $cond: [{$eq: ["$apikey", null]},0, 1]
        },
        action_2: {
            $cond: [{$ne: ["$apikey", null]},0, 1]
        }
    }
}, {
    $group: {
        _id: "$owner",
        action_1: {$sum: "$action_1"},
        action_2: {$sum: "$action_2"}
    }
}, {
    $project: {
        action_1: "$action_1",
        action_2: "$action_2",
        actions_total: {
            $add: ["$action_1", "$action_2"]
        },
        actions_per_day: {
            $divide: [
                {$add: ["$action_1", "$action_2"]}, 
                {$dayOfMonth: new Date()}
            ]
        },
    }
})

At first all the discussion of the aggregation pipeline felt awkward. After a while it became more clear. For example, the above does this:

$match limits me to the set of documents associated with a particular owner
$project creates a new document conditionally including some data from the documents that were matched
$group then sums those documents that were projected above
the final $project performs additional calculations with the grouped (summed) data

The output of the above aggregation is a document that looks like this:

{
        "result" : [
                {
                        "_id" : ObjectId("5136d880136b961c98c9a62f"),
                        "action_1" : 10,
                        "action_2" : 4,
                        "actions_total" : 14,
                        "actions_per_day" : 1.4
                }
        ],
        "ok" : 1
}

So far I’ve only run this on small sets of data, so I can’t comment on performance for large data sets. Since as of right now it’s still not possible to cache the results in a separate collection, performance may become an issue as my statistical data set grows.

Reference

http://stackoverflow.com/questions/16455528/mongodb-aggregation-framework-date-now

PRV POST

NXT POST

MongoDB Aggregation for Analytics

May 9, 2013

Software Engineering

No Comments

Reference

Related

Leave a Reply Cancel reply