Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged nosql

Software Engineering

Representing Revision Data in MongoDB

The need to track changes to web content and provide for draft or preview functionality is common to many web applications today. In relational databases it has long been common to accomplish this using a recursive relationship within a single table or by splitting the table out and storing version details in a secondary table. I’ve recently been exploring the best way to accomplish this using MongoDB.

A few design considerations

Data will be represented in three primary states, published, draft and history. These might also be called current and preview. From a workflow perspective if might be tempting to include states like in_review, pending_publish, rejected, etc., but that’s not necessary from a versioning perspective. Data is in draft until it is published. Workflow specifics should be handled outside the version control mechanism.

In code, it’s important to avoid making the revision mechanism a primary feature. In other words, you want to deal with the document stored in published, not published itself.

Historical documents need to have a unique identifier, just like the top level entity. These will be accessed less frequently and so performance is less of a consideration.

From a concurrency perspective, it’s important to make sure that updates operate against fresh data.

Basic structure

The basic structure is a top level document that contains sub-documents accommodating the three primary states mentioned above.

{
  published: {},
  draft: {},
  history: {
    "1" : {
      metadata: <value>,
      document: {}
    },
    ...
  }
}

In history, each retired document requires a few things. One is a unique identifier. Another is when it was retired. It might be useful to track which user caused it to be retired. As a result, metadata above should represent all those elements that you would like to know about that document.

Let’s imagine a person object that looks like this:

{
  "name" : {
    "firstName" : "Daniel",
    "lastName" : "Watrous"
  },
  "age" : 32,
  "email" : "daniel@current.com",
  "happy" : true
}

Our versioned document may look something like this.

{
  "published" : {
    "name" : {
      "firstName" : "Daniel",
      "lastName" : "Watrous"
    },
    "age" : 32,
    "email" : "daniel@current.com",
    "happy" : true
  },
  "draft" : {
    "name" : {
      "firstName" : "Daniel",
      "lastName" : "Watrous"
    },
    "age" : 33,
    "email" : "daniel@future.com",
    "happy" : true
  },
  "history" : {
    "1" : {
      "person" : {
        "name" : {
          "firstName" : "Danny",
          "lastName" : "Watrous"
        },
        "age" : 15,
        "email" : "daniel@beforeinternet.com",
        "happy" : true
      },
      "dateRetired" : "2003-02-19"
    },
    "2" : {
      "person" : {
        "name" : {
          "firstName" : "Dan",
          "lastName" : "Watrous"
        },
        "age" : 23,
        "email" : "daniel@oldschool.com",
        "happy" : true
      },
      "dateRetired" : "2010-06-27"
    }
  }
}

There are a few options when it comes to dealing with uniquely identifying historical data. One is to calculate a unique value at the time an object is placed in history. This could be a combination of the top level object ID and a sequential version number. Another is to generate a hash when the object is loaded. The problem with the second approach is that queries for specific date objects become more complex.

Queries

As a rule of thumb, it’s probably best to always execute queries against published. Queries against history are unlikely at an application level. One reason for this is that any interest in historical revisions will almost universally be in the context of the published version. In other words, the top level object will already be in scope.

Draft data should be considered transient. There should be little need to protect this data or save it. Until it is accepted and becomes the new published data, changes should have little impact. Querying for draft data would be unlikely, and is discouraged at the application level.

Historical Limits

The size of documents and the frequency with which they are changed must factor in to the retention of historical data. There may be other regulations and internal policies that affect this retention. In some cases it may be sufficient to retain only the last revision. In other cases it may be a time period determination. In some cases it may be desirable to save as much history as is physically possible.

While MongoDB does provide a capped collection, that won’t help us with the structure above. All of the historical data is in a sub document, not a separate collection. For that reason, any retention policy must be implemented at the application level.

It might be tempting to implement the revision history in a separate collection in order to manage retention with a capped collection. Some problems arise. The biggest problem would be that there is no way to cap the collection by versioned document. One way to look at this is that if you have one document that changed very frequently and another that changed rarely or never, historical data for the rarely changing document would eventually be pushed off the end of the collection as updates for the frequently changed object are added.

Retention model

As a baseline, it’s probably reasonable to define retention based on the following two metrics.

  • Minimum time retention
  • Maximum revisions retained

In other words, hang on to all revisions for at least the minimum time, up to the maximum number or revisions. This decision would be made at the time the document is modified. If a document is modified infrequently, it’s possible the documents would be much older than the minimum retention time.

Performance considerations

Each document in MongoDB has a specifically allocated size when it is created. Updates that increase the size of the document must allocate a new document large enough to accommodate the updated document on disk and move the document. This can be an expensive operation to perform, especially at high volume.

To mitigate this it’s possible to define a paddingFactor for a collection. A padding factor is a multiplier used when creating a new document that provides additional space. For example, for paddingFactor=2, the document would be allocated twice the space needed to accommodate its size.

Since version 2.2 there’s a new option collMod that uses powers of 2 to increase record sizes. This may be more efficient than a fixed paddingFactor.

Note that operations like compact and repairDatabase will remove any unused padding added by paddingFactor.

Changes to document structure

It’s possible that the structure of documents change throughout the life of an application. If information is added to or removed from the core document structure, it’s important to recognize that any application code will need to be able to deal with those changes.

Application aspects that might be affected include JSON to object mapping and diffing algorithms.

Software Engineering

MongoDB Map Reduce for Analytics

I have a RESTful SaaS service I created which uses MongoDB. Each REST call creates a new record in a statistics collection. In order to implement quotas and provide user analytics, I need to process the statistics collection periodically and generate meaningful analytics specific to each user.

This is just the type of problem map reduce was meant to solve. In order to accomplish this I’ll need to do the following:

  • Map all statistics records over a time range
  • Reduce the number of calls, both authenticated and anonymous
  • Finalize to get the sum of authenticated and anonymous calls as total
  • Run over a time range

The data in the statistics collection has this form:

{
        "_id" : ObjectId("5e6877a516832a9c8fe89ca9"),
        "apikey" : "7e78ed1525b7568c2316576f2b265f55e6848b5830db4e6586283",
        "request_date" : ISODate("2013-04-05T06:00:24.006Z"),
        "request_method" : "POST",
        "document" : {
                "domain" : "",
                "validationMethod" : "LICENSE_EXISTS_NOT_EXPIRED",
                "deleted" : null,
                "ipAddress" : "",
                "disposition" : "",
                "owner" : ObjectId("af1459ed793eca35754090a0"),
                "_id" : ObjectId("6fec518787a52a9c988ea683"),
                "issueDate" : ISODate("2013-04-05T06:00:24.005Z"),
        },
        "request_uri" : {
                "path" : "/v1/sitelicenses",
                "netloc" : "api.easysoftwarelicensing.com"
        }
}

Here is what I came up with:

Map function

var map_analytics = function() {
    var key = this.owner;
    if (this.apikey == null) {
        var value = {api_call_with_key: 0, api_call_without_key: 1};
    } else {
        var value = {api_call_with_key: 1, api_call_without_key: 0};
    }
    emit(key, value);
};

Reduce function

var reduce_analytics  = function(key_owner, api_calls) {
    reduced_val = {api_call_with_key: 0, api_call_without_key: 0};
    api_calls.forEach(function(value) {
        reduced_val.api_call_with_key += value.api_call_with_key;
        reduced_val.api_call_without_key += value.api_call_without_key;
    });
    return reduced_val;
};

Finalize function

var finalize_analytics = function (key, reduced_val) {
    reduced_val.total_api_calls = reduced_val.api_call_with_key + reduced_val.api_call_without_key;
    return reduced_val;
};

Run Map Reduce

db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics })

That produces an analytics collection with ObjectIDs that match the users _id in the administrators collection. It looks like this.

> db.statistics.mapReduce(map_analytics, reduce_analytics, {out: { reduce: "analytics" }, query: { request_date: { $gt: new Date('01/01/2012')}}, finalize: finalize_analytics })
{
        "result" : "analytics",
        "timeMillis" : 79,
        "counts" : {
                "input" : 14,
                "emit" : 14,
                "reduce" : 2,
                "output" : 2
        },
        "ok" : 1,
}
> db.analytics.find().pretty()
{
        "_id" : ObjectId("5136d880136b961c98c9a62f"),
        "value" : {
                "api_call_with_key" : 8,
                "api_call_without_key" : 4,
                "total_api_calls" : 12
        }
}
{
        "_id" : ObjectId("5143b2c8136b9616343dacec"),
        "value" : {
                "api_call_with_key" : 0,
                "api_call_without_key" : 2,
                "total_api_calls" : 2
        }
}

I had originally hoped to write the analytics to the administrator document, but I don’t think that’s possible, since it overwrites the document with the result of the reduce/finalize functions.

I got my inspiration from this example.

Storing and Scheduling

The question remains how best to store and then schedule the periodic running of this map reduce functionality. It seems that storing it is best done on the server, as shown here: http://docs.mongodb.org/manual/tutorial/store-javascript-function-on-server/

Scheduling will most likely involve a crontab. I’m not sure if I’ll call it directly or through a python script.

Software Engineering

MongoDB Secure Mode

Security in MongoDB is relatively young in terms of features and granularity. Interestingly, they indicate that a typical use case would be to use Mongo on a trusted network “much like how one would use, say, memcached.

MongoDB does NOT run in secure mode by default.

As it is, the features that are available are standard, proven and probably sufficient for most use cases. Here’s a quick summary of pros and cons.

  • Pros
    • Nonce-based digest for authentication
    • Security applies across replica set nodes and shard members
  • Cons
    • Few recent replies on security wiki page
    • Course grained access control

User access levels

Course grained access control allows for users to be defined per database and given either read only or read/write access. Since there is no rigid schema in MongoDB, it’s not possible to limit access to a subset of collections or documents.

Limit to expected IPs

Along the lines of the ‘trusted network’ mentioned above, it’s recommended to configure each mongo instance to accept connections from specific ports. For example, you could limit access to the loopback address, or to an IP for a local private network.

Disable http interface

By default, a useful HTTP based interface provides information about the mongodb instance on a machine and links to similar interfaces on related machines in the replica set. This can be disabled by providing –nohttpinterface when starting mongod.

SSL ready

In cases where SSL security is required, Mongo can be compiled to include support for it. The standard downloads do not include this feature. A standard SSL key can be produced in the usual way, using openssl for example.

Software Engineering

MongoDB – Boise Code Camp 2012

Welcome to Boise Code Camp, 2012. This year I decided to present on MongoDB. In a previous presentation on MongoDB and some video training I cover installation and configuration and using replica sets.

For the code camp this year I explore some of the differences between schema design for a relational database and schema design for MongoDB. You can copy and paste the commands below directly onto the MongoDB command line to follow along.

To begin with, it’s important to know how to get help and see what databases and collections are available. The default MongoDB ID generation is also useful to know about.

1
2
3
4
5
6
7
8
9
10
11
help
show dbs
 
// discuss default IDs and how they work
use exploreid
db.dropDatabase()
 
use exploreid
// db.collection.insert(object)
db.things.insert({'name': 'Daniel Watrous'})
db.things.find()

Schema Design

Now let’s build the course tracker database. The MongoDB website provides some great instruction and a few videos on its schema design for MongoDB page. First have a look at the relational schema and then the adjusted MongoDB schema.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// build coursetracker database
use coursetracker
db.dropDatabase()
 
use coursetracker
// create student records
db.students.insert({"_id": "john.shore", "name": {"first": "John", "last": "Shore"}, "email": "john.shore@gmail.com", "major": "Electrical Engineering"})
db.students.find()
 
db.students.insert({"_id": "jeff", "name": {"first": "Jeff", "last": "Holland"}, "email": "jeff@yahoo.com", "major": "Business"})
db.students.insert({"_id": "scott", "name": {"first": "Scott", "last": "Mills"}, "email": "scott@hotmail.com", "major": "Humanities/Art"})
db.students.insert({"_id": "rcotter", "name": {"first": "Ray", "last": "Cotter"}, "email": "rcotter@msn.com", "major": "Computer Science"})
db.students.insert({"_id": "lee2331", "name": {"first": "Lee", "last": "Aldwell"}, "email": "lee2331@aol.com", "major": "Graphic Design"})
db.students.find()
// have a look to see that the students collection now exists
show collections
 
// create course records
db.courses.insert({"_id": "HIST-1010", "name": "History of the World 1010", "description": "A bunch of really interesting things that actually happened", "students": [], "ratings": []})
db.courses.find()
 
db.courses.insert({"_id": "ENGCOMP-1010", "name": "English Composition 1010", "description": "If you can't write well, you've got nothing!", "students": [], "ratings": []})
db.courses.insert({"_id": "ART-1050", "name": "Artistic Interpretation 1050", "description": "Discover your inner beholder", "students": [], "ratings": []})
db.courses.find()
 
// create instructor records
db.instructors.insert({"_id": "wally.r.binns", "name": {"first": "Wally", "middle": "r", "last": "Binns"}, "email": "wally.r.binns@ssu.edu", "bio": "I was born in the middle of my mother\"s doctoral dissertation on Faraday Cage isolation. I\"ve been an academic ever since...", "publications": [], "courses": []})
db.instructors.find()
 
db.instructors.insert({"_id": "gerald.waterford.iii", "name": {"first": "Gerald", "last": "Waterford", "suffix": "III"}, "email": "gerald.waterford.iii@ssu.edu", "bio": "My father's father was a great man. My father, not so much. I am restoring the family honor.", "publications": [], "courses": []})
db.instructors.insert({"_id": "kim.b", "name": {"prefix": "Mrs.", "first": "Kim", "last": "Binnley"}, "email": "kim.b@ssu.edu", "bio": "My mother told me 'Don't let those dopes push you around'. My life has been a constant struggle against dopeness ever since. Sigh...", "publications": [], "courses": []})
db.instructors.find().pretty()
// now we can see all three of our collections
show collections

At this point we have records for primary entities, but all the arrays that would hold course assignments, publications, etc. are empty. Let’s start adding some additional details to those. primary documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// add some publications
var pub = {"title": "Inverted Celestial Poetry", "source": "http://www.pubcentral.com/poetry/inverted-celestial-poetry"}
pub
// db.collection.update( criteria, objNew, upsert, multi )
db.instructors.update({"_id": "wally.r.binns"}, {$push: {"publications": pub1}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "Grow, grow, little Dandelion", "source": "http://www.hopefulstories.com/my-dandelion"}}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "The teapot and the spoon", "source": "http://www.dishsoap.com/teapot-spoon"}}}, false, false)
 
// now lets assign some courses to instructors
db.courses.find({}, {'_id': 1, 'name': 1})
db.instructors.update({"_id": "kim.b"}, {$push: {"courses": "ART-1050"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "ENGCOMP-1010"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "HIST-1010"}}, false, false)
db.instructors.find().pretty()
 
// now lets add some students to courses
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "lee2331"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.find().pretty()

Map Reduce

Now we have details about which instructors teach which courses. We know which students are registered for each course. Let’s explore how we can get a clean aggregate view of these details. In SQL we would construct a query and include the grouping and calculations we wanted to get these details. In MongoDB we need to use Map Reduce.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// find out how many courses each student has
map_number_courses = function() { 
  this.students.forEach(function(value) {
    emit(value, 1);
  });
}
reduce_number_courses = function(key, values) {
    var result = 0;
 
    values.forEach(function(value) {
      result += value;
    });
 
    return result;
}
result = db.courses.mapReduce(map_number_courses, reduce_number_courses, {out: {replace: "numbercourses"}});
show collections
db.numbercourses.find()

It’s nice to see by _id how many courses each student has, but what if we want to combine details from two different collections into a single result. It turns out we can accomplish this with MongoDB too. However, we need to run two separate map reduce executions to get all the data we need in one place.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// produce a listing of student full name and how many courses he/she is taking
map_course_report_names = function() { 
  emit(this._id, {name: this.name.first+' '+this.name.last, numbercourses: 0});
}
map_course_report_number = function() { 
  this.students.forEach(function(value) {
    emit(value, {name: '', numbercourses: 1});
  });
}
reduce_course_report = function(key, values) {
    var result = {name: '', numbercourses: 0};
 
    values.forEach(function(value) {
      result.numbercourses += value.numbercourses;
      if (result.name === '') {
        result.name = value.name;
      }
    });
 
    return result;
}
result = db.courses.mapReduce(map_course_report_number, reduce_course_report, {out: {replace: "coursereport"}});
result = db.students.mapReduce(map_course_report_names, reduce_course_report, {out: {reduce: "coursereport"}});
show collections
db.coursereport.find()

You might notice that neither of these map reduce examples lend themselves to the type of ad hoc queries that are possible in SQL. One potential downside to this is that extracting informal data in various ways may be more tedious. A potential upside is that once you have defined the map, reduce structure that will produce the results you want, the output is cached in a very fast, indexable collection.

Inspiration

Some of the mapping techniques above were found here:
http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/

Software Engineering

Introduction to MongoDB

I put this presentation together for a local web developer meetup. The topic was MongoDB.

MongoDB is a database, but unlike traditional relational databases which store data in well defined tables (schema), MongoDB stores data in JSON like documents (BSON). This provides a great deal of flexibility.

After you get through this presentation, go have a look at my hands on introduction to MongoDB and my walk through of MongoDB backups using Replica Sets.

Mongodb

View more presentations from dwatrous.