Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged mongodb

Software Engineering

MongoDB Authentication Setup

Authentication in MongoDB provides ‘normal’, which is full read and write, or ‘readonly’ access at a database level. There are two scenarios when authentication comes into play: single server and multi-server. When using a single server, authentication can be enabled but adding --auth to the startup parameters.

When using a replicaset, sharded setup or combination, a key file must be provided and the --keyFile parameter used at startup. This enables each node to communicate with other nodes using a nonce scheme based on the keyFile. In this configuration, --auth is implied and the all MongoDB access then requires authentication.

The details below assume a replicaset configuration.

Creating a keyFile

The keyFile must be 1kB or less. It can be any text and for platform independence, any whitespace is ignored. On Linux, OpenSSL can be used to create a solid keyFile like this:

openssl rand -base64 258 > mongokey

At this point, the local file mongokey will contain something like this:

rMvhlWEIzktbhXN+rcTV43z2YKPGsd8YHNNuOVpZLW9bIPx1MaAeGTVullFVY4A5
B0zRpKLXcB347T/m278LK3BNBynB3mVpoe1pPmSYVjpBmo3LhsDKXywb8dU7UrBl
9bgh4NZfNaBcYykuoQsiloWNP5QtMquBymF2bh+1s+aJpvkq1FzAhsJvwcGeILBc
gnBOwZAsDXlE0M1hr0zvsulkyvFDgE2UcS+2tm4yZPKNDygA2HCcXqJypa9L2f1J
dC83SLNxbN4MkeE+NeY3ZE+LFUqTyvb827VhXfCX+S+TpD5h/otiS1GiQnTcBiSB
fYrMhLsOFPU9UYc705XDw48m

It’s important to choose a multiple of 3 (e.g. 258) so that not equal signs are added to the keyFile.

Installing keyFile

It’s important to protect this file from unauthorized access. One way to do this is to store it in a way that only the limited user mongod user has access. In my case I moved the file into a directory owned by the mongod user and set permissions restrictively.

mkdir /home/mongod
mv /home/watrous/mongokey /home/mongod/
chown -R mongod.mongod /home/mongod/
chmod -R 700 /home/mongod/

Update MongoDB configuration

With the keyFile in place and secure, I then updated the configuration file, /etc/mongo.conf, to include a reference to the keyFile by adding this line:

keyFile = /home/mongod/mongokey

MongoDB must then be restarted and will load up. After restarting you may notice chatter in the logs about failed authentication. These errors will go away as the same procedure is completed on remaining nodes and they have the keyFile available.

Establishing users

Once a keyFile has been installed as described above, MongoDB then requires authentication. Keep in mind that it is not necessary to add --auth as a startup parameter when using a keyFile.

Complete instructions for adding users can be found here: http://docs.mongodb.org/manual/tutorial/control-access-to-mongodb-with-authentication/.

I began by establishing an admin user. I did this by connecting to mongo locally on the primary node:

[watrous@system ~]$ mongo
MongoDB shell version: 2.0.6
connecting to: test
PRIMARY> use admin
switched to db admin
PRIMARY> db.addUser("admin", "bn%c@4fE0ns$!w4TFao$innIjOBKoPS*")
{
        "n" : 0,
        "lastOp" : NumberLong("5828987137181089793"),
        "connectionId" : 60,
        "err" : null,
        "ok" : 1
}
{
        "user" : "admin",
        "readOnly" : false,
        "pwd" : "f9d7f021d49ccc82b5186d16c664c652",
        "_id" : ObjectId("50e4b8eae0bdfc9063b69c32")
}
> db.auth("admin", "bn%c@4fE0ns$!w4TFao$innIjOBKoPS*")
1
PRIMARY> db.system.users.find()
{ "_id" : ObjectId("50e4b8eae0bdfc9063b69c32"), "user" : "admin", "readOnly" : false, "pwd" : "f9d7f021d49ccc82b5186d16c664c652" }

Next I established an account for a specific database. I created two accounts, one with normal access and the other with readonly access.

PRIMARY> use documents
switched to db documents
PRIMARY> db.addUser("documents_full", "*XE@D2x@nc8pfp9iKnA!!Fu!3mTd*HYY")
{
        "n" : 0,
        "lastOp" : NumberLong("5828988434261213185"),
        "connectionId" : 60,
        "err" : null,
        "ok" : 1
}
{
        "user" : "documents_full",
        "readOnly" : false,
        "pwd" : "3cd1cbaec406081b310d7f49b4284c2f",
        "_id" : ObjectId("50e4ba19e0bdfc9063b69c33")
}
PRIMARY> db.addUser("documents_readonly", "91h#Tv5prInoU%GZQDNF9AoAWN5HTEag", true)
{
        "n" : 0,
        "lastOp" : NumberLong("5828988696254218241"),
        "connectionId" : 60,
        "err" : null,
        "ok" : 1
}
{
        "user" : "documents_readonly",
        "readOnly" : true,
        "pwd" : "87cab9e7ce7a5c731b34b1a0737c2ae9",
        "_id" : ObjectId("50e4ba56e0bdfc9063b69c34")
}
PRIMARY> db.system.users.find()
{ "_id" : ObjectId("50e4ba19e0bdfc9063b69c33"), "user" : "documents_full", "readOnly" : false, "pwd" : "3cd1cbaec406081b310d7f49b4284c2f" }
{ "_id" : ObjectId("50e4ba56e0bdfc9063b69c34"), "user" : "documents_readonly", "readOnly" : true, "pwd" : "87cab9e7ce7a5c731b34b1a0737c2ae9" }

At this point I was able to verify authentication and access levels.

Software Engineering

MongoDB Secure Mode

Security in MongoDB is relatively young in terms of features and granularity. Interestingly, they indicate that a typical use case would be to use Mongo on a trusted network “much like how one would use, say, memcached.

MongoDB does NOT run in secure mode by default.

As it is, the features that are available are standard, proven and probably sufficient for most use cases. Here’s a quick summary of pros and cons.

  • Pros
    • Nonce-based digest for authentication
    • Security applies across replica set nodes and shard members
  • Cons
    • Few recent replies on security wiki page
    • Course grained access control

User access levels

Course grained access control allows for users to be defined per database and given either read only or read/write access. Since there is no rigid schema in MongoDB, it’s not possible to limit access to a subset of collections or documents.

Limit to expected IPs

Along the lines of the ‘trusted network’ mentioned above, it’s recommended to configure each mongo instance to accept connections from specific ports. For example, you could limit access to the loopback address, or to an IP for a local private network.

Disable http interface

By default, a useful HTTP based interface provides information about the mongodb instance on a machine and links to similar interfaces on related machines in the replica set. This can be disabled by providing –nohttpinterface when starting mongod.

SSL ready

In cases where SSL security is required, Mongo can be compiled to include support for it. The standard downloads do not include this feature. A standard SSL key can be produced in the usual way, using openssl for example.

Software Engineering

Redis as a cache for Java servlets

I’ve been refactoring an application recently to move away from a proprietary and inflexible in memory datastore. The drawbacks of the proprietary datastore included the fact that the content was static. The only way to update data involved a build and replication process that took much longer than the stakeholders were willing to wait. The main selling point in favor of the in memory datastore was that it is blazing fast. And I mean blazing fast.

My choice for a replacement datastore technology is MongoDB. MongoDB worked great, but the profiling and performance comparison naturally showed that the in memory solution out performed the MongoDB solution. Communication with MongoDB for every request was obviously much slower than the previous in memory datastore solution, and the response time was less consistent from one request to another.

Caching for performance

When the data being used to generate a response changes infrequently, it’s generally bad design to serve dynamic content on every page load. Enter caching. There are a host of caching approaches covering everything from reverse proxies, like varnish, to platform specific solutions, like EHCache.

As a first stab, I chose a golden oldie, memcached, and an up and coming alternative, redis. There’s some lively discussion online about the performance differences between these two technologies. Ultimately I chose Redis due to the active development on the platform and the feature set.

Basic cache

In Java there are a handful of available Redis drivers. I started with the Jedis client. In order to use Jedis, I added this to my pom.xml.

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>2.0.0</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

I then modified my basic Servlet to init a JedisPool and use jedis to cache the values I was retrieving from MongoDB. Here’s what my class ended up looking like.

package com.danielwatrous.cachetest;
 
import com.google.inject.Guice;
import com.google.inject.Injector;
import com.danielwatrous.linker.domain.WebLink;
import com.danielwatrous.linker.modules.MongoLinkerModule;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
 
public class BuildCnavLink extends HttpServlet {
 
    private static Injector hbinjector = null;
    private static JedisPool pool = null;
 
    @Override
    public void init() {
        hbinjector = Guice.createInjector(new MongoLinkerModule());
        pool = new JedisPool(new JedisPoolConfig(), "localhost", 6379);
    }
 
    @Override
    public void destroy() {
        pool.destroy();
    }
 
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        response.setContentType("text/xml;charset=UTF-8");
        PrintWriter out = response.getWriter();
        String value = "";
        Jedis jedis = null;
 
        try {
            jedis = pool.getResource();
            String cacheKey = getCacheKey (request.getParameter("country"), request.getParameter("lang"), request.getParameter("company"));
            value = jedis.get(cacheKey);
            if (value == null) {
                WebLink webLink = hbinjector.getInstance(WebLink.class);
                webLink.setLanguage(request.getParameter("lang"));
                webLink.setCountry(request.getParameter("country"));
                webLink.setCompany(request.getParameter("company"));
                value = webLink.buildWebLink();
                jedis.set(cacheKey, value);
            }
        } finally {
            pool.returnResource(jedis);            
        }
 
        try {
            out.println("<link>");
            out.println("<url>" + value + "</url>");
            out.println("</link>");
        } finally {            
            out.close();
        }
    }
 
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        processRequest(request, response);
    }
 
    protected String getCacheKey (String country, String lang, String company) {
        String cacheKey = country + lang + company;
        return cacheKey;
    }
}

Observations

It’s assumed that a combination of country, lang and company will produce a unique value when buildWebLink is called. That must be the case if you’re using those to generate a cache key.

There’s also nothing built in above to invalidate the cache. In order to validate the cache it may work to build a time/age check. There may be other more sophisticated optimistic or pessimistic algorithms to manage cached content.

In the case above, I’m using redis to store a simple String value. I’m also still generating a dynamic response, but I’ve effectively moved the majority of my data calls to redis.

Conclusion

As a first stab, this performs on par with the proprietary in memory solution that we’re replacing and the consistency from one request to another is very tight. Here I’m connecting to a local redis instance. If redis were on a remote box, network latency may erase these gains. Object storage or serialization may also affect performance if it’s determined that simple String caching isn’t sufficient or desirable.

Resources

http://www.ibm.com/developerworks/java/library/j-javadev2-22/index.html
https://github.com/xetorthio/jedis/wiki/Getting-started

Software Engineering

Lightweight Replication Monitoring with MongoDB

One of my applications runs on a large assortment of hosts split between various data centers. Some of these are redundant pairs and others are in load balanced clusters. They all require a set of identical files which represent static content and other data.

rsync was chosen to facilitate replication of data from a source to many targets. What rsync lacked out of the box was a reporting mechanism to verify that the collection of files across target systems was consistent with the source.

Existing solutions

Before designing my solution, I searched for an existing solution to the same problem. I found backup monitor, but the last release was three years ago and there have only ever been 25 downloads, so it was less than compelling. It was also a heavier solution than I was interested in.

In this case it seems that developing a new solution is appropriate.

Monitoring requirements

The goal was to have each target system run a lightweight process at scheduled intervals and send a report to an aggregator service. A report could then be generated based on the aggregated data.

My solution includes a few components. One component analyzes the current state of files on disk and writes that state to a state file. Another component needs to read that file and transmit it to the aggregator. The aggregator needs to store the state identified by the host to which it corresponds. Finally there needs to be a reporting mechanism to display the data for all hosts.

Due to the distributed nature of the replication targets, the solution should be centralized so that changes in reporting structure are picked up by target hosts quickly and with minimal effort.

Current disk state analysis

This component potentially analyses many hundreds of thousands of files. That means the solution for this component must run very fast and be reliable. The speed requirements for this component eliminated some of the scripting languages that might otherwise be appealing (e.g. Perl, Python, etc.).

Instead I chose to write this as a bash script and make use of existing system utilities. The utilities I use include du, find and wc. Here’s what I came up with:

# Generate a report showing the sizes 
# and file counts of replicated folders
 
# create a path reference to the report file
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
reportfile="$BASEDIR/spacereport"
 
# create/overwrite report the file; write date
date '+%Y-%m-%d %H:%M:%S' > $reportfile
 
# append details to report file
du -sh /path/to/replicated/files/* | while read size dir;
do
    echo -n "$size ";
    # augment du output with count of files in the directory
    echo -n `find "$dir" -type f|wc -l`;
    echo " $dir ";
done >> $reportfile

These commands run very fast and produce an output that looks like this:

2012-08-06 21:45:10
4.5M 101 /path/to/replicated/files/style
24M 2002 /path/to/replicated/files/html
6.7G 477505 /path/to/replicated/files/images
761M 1 /path/to/replicated/files/transfer.tgz
30G 216 /path/to/replicated/files/data

Notice that the file output is space and newline delimited. It’s not great for human readability, but you’ll see in a minute that with regular expressions it’s super easy to build a report to send to the aggregator.

Read state and transmit to aggregator

Now that we have a report cached describing our current disk state, we need to format that properly and send it to the aggregator. To do this, Python seemed a good fit.

But first, I needed to be able to quickly and reliably extract information from this plain text report. Regular expressions seemed like a great fit for this, so I used my favorite regex tool, Kodos. The two expressions I need are to extract the date and then each line of report data.

You can see what I came up with in the Python script below.

#-------------------------------------------------------------------------------
# Name:        spacereport
# Purpose:     run spacereport.sh and report the results to a central service
#-------------------------------------------------------------------------------
#!/usr/bin/env python
 
import os
import re
import urllib, urllib2
from datetime import datetime
from socket import gethostname
 
def main():
 
    # where to send the report
    url = 'http://example.com/spacereport.php'
 
    # define regular expression(s)
    regexp_size_directory = re.compile(r"""([0-9.KGM]*)\s*([0-9]*)\s*[a-zA-Z/]*/(.+)""",  re.MULTILINE)
    regexp_report_time = re.compile(r"""^([0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2})\n""")
 
    # run the spacereport.sh script to generate plain text report
    base_dir = os.path.dirname(os.path.realpath(__file__))
    os.system(os.path.join(base_dir, 'spacereport.sh'))
 
    # parse space data from file
    spacedata = open(os.path.join(base_dir, 'spacereport')).read()
    space_report_time = regexp_report_time.search(spacedata).group(1)
    space_data_directories = regexp_size_directory.findall(spacedata)
 
    # create space data transmission
    report_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    hostname = gethostname()
    space_data = {'host': hostname, 'report_time': space_report_time, 'report_time_sent': report_time, 'space_data_by_directory': []}
    for space_data_directory in space_data_directories:
        space_data['space_data_by_directory'].append({'size_on_disk': space_data_directory[0], 'files_on_disk': space_data_directory[1], 'directory': space_data_directory[2]})
 
    # prepare the report
    # it might be better to use the json library for this :)
    report = {'report': str(space_data).replace("'", "\"")}
 
    # encode and send the report
    data = urllib.urlencode(report)
    req = urllib2.Request(url, data)
    response = urllib2.urlopen(req)
 
    # You can optionally output the response to verify that it worked
    the_page = response.read()
    print(the_page)
 
if __name__ == '__main__':
    main()

The variable report contains a value similar to what’s shown below:

{'report': '{"report_time": "2012-08-06 21:45:10", "host": "WATROUS1", "space_data_by_directory": [{"files_on_disk": "101", "directory": "style", "size_on_disk": "4.5M"}, {"files_on_disk": "2002", "directory": "html", "size_on_disk": "24M"}, {"files_on_disk": "477505", "directory": "images", "size_on_disk": "6.7G"}, {"files_on_disk": "1", "directory": "transfer.tgz", "size_on_disk": "761M"}, {"files_on_disk": "216", "directory": "data", "size_on_disk": "30G"}], "report_time_sent": "2012-08-06 16:20:53"}'}

The difference between report_time and report_time_sent, if there is a difference, is important. It can alert you to an error on the system preventing a new, valid report from being created by your shell script. It can also signal load issues if the gap is too wide.

Capture and store state data

Now we need to create spacereport.php, the aggregation script. It’s sole job is to receive the report and store it in MongoDB. This is made easy using PHP’s built in json_decode and MongoDB support. After the call to json_decode, the dates still need to be converted to MongoDates.

<?php
if ($_SERVER['REQUEST_METHOD'] == "POST") {
  // decode JSON report and convert dates to MongoDate
  $spacereport = json_decode($_POST['report'], true);
  $spacereport['report_time_sent'] = new MongoDate(strtotime($spacereport['report_time_sent']));
  $spacereport['report_time'] = new MongoDate(strtotime($spacereport['report_time']));
 
  // connect to MongoDB
  $mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
  // select a database
  $db = $mongoConnection->reports;
 
  // select a collection
  $collection = $db->spacereport;
 
  // add a record
  $collection->insert($spacereport);
 
  print_r($spacereport);
} else {
  // this should probably set the STATUS to 405 Method Not Allowed
  echo 'not POST';
}
?>

At this point, the data is now available in MongoDB. It’s possible to use Mongo’s query mechanism to query the data.

PRIMARY> db.spacereport.find({host: 't1.example.com'}).limit(1).pretty()
{
        "_id" : ObjectId("501fb1d47540c4df76000073"),
        "report_time" : ISODate("2012-08-06T12:00:11Z"),
        "host" : "t1.example.com",
        "space_data_by_directory" : [
                {
                        "files_on_disk" : "101",
                        "directory" : "style ",
                        "size_on_disk" : "4.5M"
                },
                {
                        "files_on_disk" : "2001",
                        "directory" : "html ",
                        "size_on_disk" : "24M"
                },
                {
                        "files_on_disk" : "477505",
                        "directory" : "directory ",
                        "size_on_disk" : "6.7G"
                },
                {
                        "files_on_disk" : "1",
                        "directory" : "transfer.tgz ",
                        "size_on_disk" : "761M"
                },
                {
                        "files_on_disk" : "215",
                        "directory" : "data ",
                        "size_on_disk" : "30G"
                }
        ],
        "report_time_sent" : ISODate("2012-08-06T12:00:20Z")
}

NOTE: For this system, historical data quickly diminishes in importance since what I’m interested in is the current state of replication. For that reason I made the spacereport collection capped to 1MB or 100 records.

PRIMARY> db.runCommand({"convertToCapped": "spacereport", size: 1045876, max: 100});
{ "ok" : 1 }

Display state data report

It’s not very useful to look at the data one record at a time, so we need some way of viewing the data as a whole. PHP is convenient, so we’ll use that to create a web based report.

<html>
<head>
<style>
body {
    font-family: Arial, Helvetica, Sans serif;
}
 
.directoryname {
    float: left;
    width: 350px;
}
 
.sizeondisk {
    float: left;
    text-align: right;
    width: 150px;
}
 
.numberoffiles {
    float: left;
    text-align: right;
    width: 150px;
}
</style>
</head>
 
<body>
 
<?php
 
$host_display_template = "<hr />\n<strong>%s</strong> showing report at <em>%s</em> (of %d total reports)<br />\n";
$spacedata_row_template = "<div class='directoryname'>%s</div> <div class='sizeondisk'>%s</div> <div class='numberoffiles'>%d total files</div><div style='clear: both;'></div>\n";
 
$mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
// select a database
$db = $mongoConnection->reports;
 
// select a collection
$collection = $db->spacereport;
 
// group the collection to get a unique list of all hosts reporting space data
$key = array("host" => 1);
$initial = array("reports" => 0);
$reduce = "function(obj, prev) {prev.reports++;}";
$reports_by_host = $collection->group($key, $initial, $reduce);
 
// cycle through all hosts found above
foreach ($reports_by_host['retval'] as $report_by_host) {
    // grab the reports for this host and sort to find most recent report
    $cursor = $collection->find(array("host" => $report_by_host['host']));
    $cursor->sort(array("report_time_sent" => -1))->limit(1);
    foreach ($cursor as $obj) {
        // output details about this host and the report timing
        printf ($host_display_template, $report_by_host['host'], date('M-d-Y H:i:s', $obj['report_time']->sec), $report_by_host['reports']);
        foreach ($obj["space_data_by_directory"] as $directory) {
            // output details about this directory
            printf ($spacedata_row_template, $directory["directory"], $directory["size_on_disk"], $directory["files_on_disk"]);
        }
    }
}
?>
 
</body>
</html>

Centralizing the service

At this point the entire reporting structure is in place, but it requires manual installation or updates on each host where it runs. Even if you only have a handful of hosts, it can quickly become a pain to have to update them by hand each time you can to change the structure.

To get around this, host the two scripts responsible for creating and sending the report in some location that’s accessible to the target host. Then run the report from a third script that grabs the latest copies of those scripts and run the reports.

# download the latest spacereport scripts 
# and run to update central aggregation point
 
BASEDIR=$(dirname $0)
# use wget to grab the latest scripts
wget -q http://c.example.com/spacereport.py -O $BASEDIR/spacereport.py
wget -q http://c.example.com/spacereport.sh -O $BASEDIR/spacereport.sh
# make sure that spacereport.sh is executable
chmod +x $BASEDIR/spacereport.sh
# create and send a spacereport
python $BASEDIR/spacereport.py

Since MongoDB is schemaless, the structure of the reports can be changed at will. Provided legacy values are left in place, no changes are required at any other point in the reporting process.

Future Enhancements

Possible enhancements could include active monitoring, such as allowing an administrator to define rules that would trigger notifications based on the data being aggregated. This monitoring could be implemented as a hook in the spacereport.php aggregator script or based on a cron. Rules could include comparisons between hosts, self comparison with historical data for the same host, or comparison to baseline data external to the hosts being monitored.

Some refactoring to generalize the shell script that produces the initial plain text report may improve efficiency and/or flexibility, though it’s difficult to imagine that writing a custom program to replace existing system utilities would be worthwhile.

The ubiquity of support for MongoDB and JSON would make it easy to reimplement any of the above components in other languages if there’s a better fit for a certain project.

Software Engineering

Revisit Webfaction Hosting

Several years ago I hosted with webfaction for about a year. I was drawn to them at the time because they allowed SSH access and I could run Java applications on my account. Those were not common features available under shared hosting at the time. I didn’t end up deploying any Java applications and the PHP sites I did deploy routed through webfaction’s nginx to PHP configuration which frequently failed. That meant that many visitors to my site saw nginx errors rather than my web page. When they couldn’t resolve the issue I moved my hosting to hostgator and have been extremely happy ever since.

I recently decided to explore Python Flask with MongoDB and was looking at hosting options. Google App Engine is a little too restrictive and I’ve had a mixed experience while working on the Software Licensing system that I’ve developed for that platform. I considered several VPS options including Amazon EC2 and Linode. As I was looking around I thought about webfaction again.

Ready to run Flask and MongoDB

I did a little searching and found that I could deploy Flask applications easily on webfaction. I could also deploy a MongoDB instance on webfaction.

I decided to give them another try and paid my first month’s fee. With about an hour of setup I had successfully deployed a Python Flask “Hello World!” application and had a running MongoDB instance. It was surprisingly easy, especially considering that with any VPS solution I would have needed to setup and worry about a lot more than just my Flask application. It saved me time and money.

Caveat Emptor

What I don’t know is whether they have addressed nginx errors (I recall that they were 502 Bad Gateway errors). Apparently it’s related to the nginx server not getting a suitable reply from the application. If I find myself fighting with that again this time around I may end up on a VPS anyway, but for development, it’s hard to beat their pricing and flexibility.

I really hope it works out that I can run in production on webfaction. I’ll keep you posted.

Software Engineering

MongoDB – Boise Code Camp 2012

Welcome to Boise Code Camp, 2012. This year I decided to present on MongoDB. In a previous presentation on MongoDB and some video training I cover installation and configuration and using replica sets.

For the code camp this year I explore some of the differences between schema design for a relational database and schema design for MongoDB. You can copy and paste the commands below directly onto the MongoDB command line to follow along.

To begin with, it’s important to know how to get help and see what databases and collections are available. The default MongoDB ID generation is also useful to know about.

1
2
3
4
5
6
7
8
9
10
11
help
show dbs
 
// discuss default IDs and how they work
use exploreid
db.dropDatabase()
 
use exploreid
// db.collection.insert(object)
db.things.insert({'name': 'Daniel Watrous'})
db.things.find()

Schema Design

Now let’s build the course tracker database. The MongoDB website provides some great instruction and a few videos on its schema design for MongoDB page. First have a look at the relational schema and then the adjusted MongoDB schema.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// build coursetracker database
use coursetracker
db.dropDatabase()
 
use coursetracker
// create student records
db.students.insert({"_id": "john.shore", "name": {"first": "John", "last": "Shore"}, "email": "john.shore@gmail.com", "major": "Electrical Engineering"})
db.students.find()
 
db.students.insert({"_id": "jeff", "name": {"first": "Jeff", "last": "Holland"}, "email": "jeff@yahoo.com", "major": "Business"})
db.students.insert({"_id": "scott", "name": {"first": "Scott", "last": "Mills"}, "email": "scott@hotmail.com", "major": "Humanities/Art"})
db.students.insert({"_id": "rcotter", "name": {"first": "Ray", "last": "Cotter"}, "email": "rcotter@msn.com", "major": "Computer Science"})
db.students.insert({"_id": "lee2331", "name": {"first": "Lee", "last": "Aldwell"}, "email": "lee2331@aol.com", "major": "Graphic Design"})
db.students.find()
// have a look to see that the students collection now exists
show collections
 
// create course records
db.courses.insert({"_id": "HIST-1010", "name": "History of the World 1010", "description": "A bunch of really interesting things that actually happened", "students": [], "ratings": []})
db.courses.find()
 
db.courses.insert({"_id": "ENGCOMP-1010", "name": "English Composition 1010", "description": "If you can't write well, you've got nothing!", "students": [], "ratings": []})
db.courses.insert({"_id": "ART-1050", "name": "Artistic Interpretation 1050", "description": "Discover your inner beholder", "students": [], "ratings": []})
db.courses.find()
 
// create instructor records
db.instructors.insert({"_id": "wally.r.binns", "name": {"first": "Wally", "middle": "r", "last": "Binns"}, "email": "wally.r.binns@ssu.edu", "bio": "I was born in the middle of my mother\"s doctoral dissertation on Faraday Cage isolation. I\"ve been an academic ever since...", "publications": [], "courses": []})
db.instructors.find()
 
db.instructors.insert({"_id": "gerald.waterford.iii", "name": {"first": "Gerald", "last": "Waterford", "suffix": "III"}, "email": "gerald.waterford.iii@ssu.edu", "bio": "My father's father was a great man. My father, not so much. I am restoring the family honor.", "publications": [], "courses": []})
db.instructors.insert({"_id": "kim.b", "name": {"prefix": "Mrs.", "first": "Kim", "last": "Binnley"}, "email": "kim.b@ssu.edu", "bio": "My mother told me 'Don't let those dopes push you around'. My life has been a constant struggle against dopeness ever since. Sigh...", "publications": [], "courses": []})
db.instructors.find().pretty()
// now we can see all three of our collections
show collections

At this point we have records for primary entities, but all the arrays that would hold course assignments, publications, etc. are empty. Let’s start adding some additional details to those. primary documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// add some publications
var pub = {"title": "Inverted Celestial Poetry", "source": "http://www.pubcentral.com/poetry/inverted-celestial-poetry"}
pub
// db.collection.update( criteria, objNew, upsert, multi )
db.instructors.update({"_id": "wally.r.binns"}, {$push: {"publications": pub1}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "Grow, grow, little Dandelion", "source": "http://www.hopefulstories.com/my-dandelion"}}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "The teapot and the spoon", "source": "http://www.dishsoap.com/teapot-spoon"}}}, false, false)
 
// now lets assign some courses to instructors
db.courses.find({}, {'_id': 1, 'name': 1})
db.instructors.update({"_id": "kim.b"}, {$push: {"courses": "ART-1050"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "ENGCOMP-1010"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "HIST-1010"}}, false, false)
db.instructors.find().pretty()
 
// now lets add some students to courses
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "lee2331"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.find().pretty()

Map Reduce

Now we have details about which instructors teach which courses. We know which students are registered for each course. Let’s explore how we can get a clean aggregate view of these details. In SQL we would construct a query and include the grouping and calculations we wanted to get these details. In MongoDB we need to use Map Reduce.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// find out how many courses each student has
map_number_courses = function() { 
  this.students.forEach(function(value) {
    emit(value, 1);
  });
}
reduce_number_courses = function(key, values) {
    var result = 0;
 
    values.forEach(function(value) {
      result += value;
    });
 
    return result;
}
result = db.courses.mapReduce(map_number_courses, reduce_number_courses, {out: {replace: "numbercourses"}});
show collections
db.numbercourses.find()

It’s nice to see by _id how many courses each student has, but what if we want to combine details from two different collections into a single result. It turns out we can accomplish this with MongoDB too. However, we need to run two separate map reduce executions to get all the data we need in one place.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// produce a listing of student full name and how many courses he/she is taking
map_course_report_names = function() { 
  emit(this._id, {name: this.name.first+' '+this.name.last, numbercourses: 0});
}
map_course_report_number = function() { 
  this.students.forEach(function(value) {
    emit(value, {name: '', numbercourses: 1});
  });
}
reduce_course_report = function(key, values) {
    var result = {name: '', numbercourses: 0};
 
    values.forEach(function(value) {
      result.numbercourses += value.numbercourses;
      if (result.name === '') {
        result.name = value.name;
      }
    });
 
    return result;
}
result = db.courses.mapReduce(map_course_report_number, reduce_course_report, {out: {replace: "coursereport"}});
result = db.students.mapReduce(map_course_report_names, reduce_course_report, {out: {reduce: "coursereport"}});
show collections
db.coursereport.find()

You might notice that neither of these map reduce examples lend themselves to the type of ad hoc queries that are possible in SQL. One potential downside to this is that extracting informal data in various ways may be more tedious. A potential upside is that once you have defined the map, reduce structure that will produce the results you want, the output is cached in a very fast, indexable collection.

Inspiration

Some of the mapping techniques above were found here:
http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/

Software Engineering

MongoDB Using Replica Sets as a Backup

MongoDB implements a form of replication they call replica sets. Referring to a repica set instead of just calling it replication is a helpful distinction that becomes more obviously useful when you are introduced to sharding in MongoDB since each shard should be comprised of a set of replicas unique to that shard, but we’ll get to that later.

For now, I want to show you how easy it is to setup and use replica sets. The setup and initial ‘recovery’ of data to all replicas in the set is quite simple. In this video I walk you through the entire process. The video is HD, so be sure to watch it full screen to get all the details.

 

To install MongoDB see my hands on introduction.

Software Engineering

Hands on MongoDB introduction and installation

MongoDB is a database. However, unlike conventional relational databases that are based on well defined schema and use SQL as the primary interface to manage the data, MongoDB instead uses document based storage.

The storage uses a format known as BSON, which is a modified form of JSON. This makes the stored documents very flexible and lightweight. It also makes it easy to adjust what is contained in any document without any significant impact to the other documents in a collection (a collection in MongoDB is like a table in a relational database).

In this short video I show you how to install and begin using MongoDB. The video is HD, so be sure to watch it full screen to get all the details.

 

When you’re done with this, go have a look at how to improve MongoDB’s reliability using replica sets.

Software Engineering

Introduction to MongoDB

I put this presentation together for a local web developer meetup. The topic was MongoDB.

MongoDB is a database, but unlike traditional relational databases which store data in well defined tables (schema), MongoDB stores data in JSON like documents (BSON). This provides a great deal of flexibility.

After you get through this presentation, go have a look at my hands on introduction to MongoDB and my walk through of MongoDB backups using Replica Sets.

Mongodb

View more presentations from dwatrous.