Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged version

Software Engineering

Using Java to work with Versioned Data

A few days ago I wrote about how to structure version details in MongoDB. In this and subsequent articles I’m going to present a Java based approach to working with that revision data.

I have published all this work as an open source repository on github. Feel free to fork it:
https://github.com/dwatrous/mongodb-revision-objects

Design Decisions

To begin with, here are a few design rules that should direct the implementation:

  1. Program to interfaces. Choice of datastore or other technologies should not be visible in application code
  2. Application code should never deal with versioned objects. It should only deal with domain objects

Starting with A above, I came up with a design involving only five interfaces. The management of the Person class is managed using VersionedPerson, Person, HistoricalPerson and PersonDAO. A fifth interface, DisplayMode, is used to facilitate display of the correct versioned data in the application. Here’s what the Person interface looks like:

public interface Person {
    PersonName getName();
    void setName(PersonName name);
    Integer getAge();
    void setAge(Integer age);
    String getEmail();
    void setEmail(String email);
    boolean isHappy();
    void setHappy(boolean happy);
    public interface PersonName {
        String getFirstName();
        void setFirstName(String firstName);
        String getLastName();
        void setLastName(String lastName);
    }
}

Note that there is no indication of any datastore related artifacts, such as an ID attribute. It also does not include any specifics about versioning, like historical meta data. This is a clean interface that should be used throughout the application code anywhere a Person is needed.

During implementation you’ll see that using a dependency injection framework makes it easy to write application code against this interface and provide any implementation at run time.

Versioning

Obviously it’s necessary to deal with the versioning somewhere in the code. The question is where and how. According to point B above, I want to conceal any hint of the versioned structure from application code. To illustrate, let’s imagine a bit of code that would retrieve and display a person’s name and email.

First I show you what you want to avoid (i.e. DO NOT DO THIS).

Person personToDisplay;
VersionedPerson versionedPerson = personDao.getPersonByName(personName);
if (displayMode.isPreviewModeActive()) {
    personToDisplay = versionedPerson.getDraft();
} else {
    personToDisplay = versionedPerson.getPublished();
}
System.out.println(personToDisplay.getName().getFirstName());
System.out.println(personToDisplay.getEmail());

There are a few problems with this approach that might not be obvious based on this simple example. One is that by allowing the PersonDAO to return a VersionedPerson, it becomes necessary to include conditional code everyehere in your application that you want to access a Person object. Imagine how costly a simple change to DisplayMode could be over time, not to mention the chance of bugs creeping in.

Another problem is that your application, which deals with Person objects, now has code throughout that introduces concepts of VersionedPerson, HistoricalPerson, etc.

In the end, all of those details relate to data access. In other words, your Data Access Object needs to be aware of these details, but the rest of your application does not. By moving all these details into your DAO, you can rewrite the above example to look like this.

Person personToDisplay = personDao.getPersonByName(personName);
System.out.println(personToDisplay.getName().getFirstName());
System.out.println(personToDisplay.getEmail());

As you can see, this keeps your application code much cleaner. The DAO has the responsibility to determine which Person object to return.

DAO design

Let’s have a closer look at the DAO. Here’s the PersonDAO interface:

public interface PersonDAO {
    void save(Person person);
    void saveDraft(Person person);
    void publish(Person person);
    Person getPersonByName(PersonName name);
    Person getPersonByName(PersonName name, Integer historyMarker);
    List<Person> getPersonsByLastName(String lastName);
}

Notice that the DAO only ever receives or returns Person objects and search parameters. At the interface level, there is no indication of an underlying datastore or other technology. There is also no indication of any versioning. This encourages application developers to keep application code clean.

Despite this clean interface, there are some complexities. Based on the structure of the mongodb document, which stores published, draft and history as nested documents in a single document, there is only one ObjectID that identifies all versions of the Person. That means that the ObjectID exists at the VersionedPerson level, not the Person level. That makes it necessary to pass some information around with the Person that will identify the VersionedPerson for write operations. This comes through in the implementation of the MorphiaPersonDAO.

Download

You can clone or download the mongodb-revision-objects code and dig in to the details yourself on github.

Software Engineering

Representing Revision Data in MongoDB

The need to track changes to web content and provide for draft or preview functionality is common to many web applications today. In relational databases it has long been common to accomplish this using a recursive relationship within a single table or by splitting the table out and storing version details in a secondary table. I’ve recently been exploring the best way to accomplish this using MongoDB.

A few design considerations

Data will be represented in three primary states, published, draft and history. These might also be called current and preview. From a workflow perspective if might be tempting to include states like in_review, pending_publish, rejected, etc., but that’s not necessary from a versioning perspective. Data is in draft until it is published. Workflow specifics should be handled outside the version control mechanism.

In code, it’s important to avoid making the revision mechanism a primary feature. In other words, you want to deal with the document stored in published, not published itself.

Historical documents need to have a unique identifier, just like the top level entity. These will be accessed less frequently and so performance is less of a consideration.

From a concurrency perspective, it’s important to make sure that updates operate against fresh data.

Basic structure

The basic structure is a top level document that contains sub-documents accommodating the three primary states mentioned above.

{
  published: {},
  draft: {},
  history: {
    "1" : {
      metadata: <value>,
      document: {}
    },
    ...
  }
}

In history, each retired document requires a few things. One is a unique identifier. Another is when it was retired. It might be useful to track which user caused it to be retired. As a result, metadata above should represent all those elements that you would like to know about that document.

Let’s imagine a person object that looks like this:

{
  "name" : {
    "firstName" : "Daniel",
    "lastName" : "Watrous"
  },
  "age" : 32,
  "email" : "daniel@current.com",
  "happy" : true
}

Our versioned document may look something like this.

{
  "published" : {
    "name" : {
      "firstName" : "Daniel",
      "lastName" : "Watrous"
    },
    "age" : 32,
    "email" : "daniel@current.com",
    "happy" : true
  },
  "draft" : {
    "name" : {
      "firstName" : "Daniel",
      "lastName" : "Watrous"
    },
    "age" : 33,
    "email" : "daniel@future.com",
    "happy" : true
  },
  "history" : {
    "1" : {
      "person" : {
        "name" : {
          "firstName" : "Danny",
          "lastName" : "Watrous"
        },
        "age" : 15,
        "email" : "daniel@beforeinternet.com",
        "happy" : true
      },
      "dateRetired" : "2003-02-19"
    },
    "2" : {
      "person" : {
        "name" : {
          "firstName" : "Dan",
          "lastName" : "Watrous"
        },
        "age" : 23,
        "email" : "daniel@oldschool.com",
        "happy" : true
      },
      "dateRetired" : "2010-06-27"
    }
  }
}

There are a few options when it comes to dealing with uniquely identifying historical data. One is to calculate a unique value at the time an object is placed in history. This could be a combination of the top level object ID and a sequential version number. Another is to generate a hash when the object is loaded. The problem with the second approach is that queries for specific date objects become more complex.

Queries

As a rule of thumb, it’s probably best to always execute queries against published. Queries against history are unlikely at an application level. One reason for this is that any interest in historical revisions will almost universally be in the context of the published version. In other words, the top level object will already be in scope.

Draft data should be considered transient. There should be little need to protect this data or save it. Until it is accepted and becomes the new published data, changes should have little impact. Querying for draft data would be unlikely, and is discouraged at the application level.

Historical Limits

The size of documents and the frequency with which they are changed must factor in to the retention of historical data. There may be other regulations and internal policies that affect this retention. In some cases it may be sufficient to retain only the last revision. In other cases it may be a time period determination. In some cases it may be desirable to save as much history as is physically possible.

While MongoDB does provide a capped collection, that won’t help us with the structure above. All of the historical data is in a sub document, not a separate collection. For that reason, any retention policy must be implemented at the application level.

It might be tempting to implement the revision history in a separate collection in order to manage retention with a capped collection. Some problems arise. The biggest problem would be that there is no way to cap the collection by versioned document. One way to look at this is that if you have one document that changed very frequently and another that changed rarely or never, historical data for the rarely changing document would eventually be pushed off the end of the collection as updates for the frequently changed object are added.

Retention model

As a baseline, it’s probably reasonable to define retention based on the following two metrics.

  • Minimum time retention
  • Maximum revisions retained

In other words, hang on to all revisions for at least the minimum time, up to the maximum number or revisions. This decision would be made at the time the document is modified. If a document is modified infrequently, it’s possible the documents would be much older than the minimum retention time.

Performance considerations

Each document in MongoDB has a specifically allocated size when it is created. Updates that increase the size of the document must allocate a new document large enough to accommodate the updated document on disk and move the document. This can be an expensive operation to perform, especially at high volume.

To mitigate this it’s possible to define a paddingFactor for a collection. A padding factor is a multiplier used when creating a new document that provides additional space. For example, for paddingFactor=2, the document would be allocated twice the space needed to accommodate its size.

Since version 2.2 there’s a new option collMod that uses powers of 2 to increase record sizes. This may be more efficient than a fixed paddingFactor.

Note that operations like compact and repairDatabase will remove any unused padding added by paddingFactor.

Changes to document structure

It’s possible that the structure of documents change throughout the life of an application. If information is added to or removed from the core document structure, it’s important to recognize that any application code will need to be able to deal with those changes.

Application aspects that might be affected include JSON to object mapping and diffing algorithms.