Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged mongo

Software Engineering

Lightweight Replication Monitoring with MongoDB

One of my applications runs on a large assortment of hosts split between various data centers. Some of these are redundant pairs and others are in load balanced clusters. They all require a set of identical files which represent static content and other data.

rsync was chosen to facilitate replication of data from a source to many targets. What rsync lacked out of the box was a reporting mechanism to verify that the collection of files across target systems was consistent with the source.

Existing solutions

Before designing my solution, I searched for an existing solution to the same problem. I found backup monitor, but the last release was three years ago and there have only ever been 25 downloads, so it was less than compelling. It was also a heavier solution than I was interested in.

In this case it seems that developing a new solution is appropriate.

Monitoring requirements

The goal was to have each target system run a lightweight process at scheduled intervals and send a report to an aggregator service. A report could then be generated based on the aggregated data.

My solution includes a few components. One component analyzes the current state of files on disk and writes that state to a state file. Another component needs to read that file and transmit it to the aggregator. The aggregator needs to store the state identified by the host to which it corresponds. Finally there needs to be a reporting mechanism to display the data for all hosts.

Due to the distributed nature of the replication targets, the solution should be centralized so that changes in reporting structure are picked up by target hosts quickly and with minimal effort.

Current disk state analysis

This component potentially analyses many hundreds of thousands of files. That means the solution for this component must run very fast and be reliable. The speed requirements for this component eliminated some of the scripting languages that might otherwise be appealing (e.g. Perl, Python, etc.).

Instead I chose to write this as a bash script and make use of existing system utilities. The utilities I use include du, find and wc. Here’s what I came up with:

# Generate a report showing the sizes 
# and file counts of replicated folders
 
# create a path reference to the report file
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
reportfile="$BASEDIR/spacereport"
 
# create/overwrite report the file; write date
date '+%Y-%m-%d %H:%M:%S' > $reportfile
 
# append details to report file
du -sh /path/to/replicated/files/* | while read size dir;
do
    echo -n "$size ";
    # augment du output with count of files in the directory
    echo -n `find "$dir" -type f|wc -l`;
    echo " $dir ";
done >> $reportfile

These commands run very fast and produce an output that looks like this:

2012-08-06 21:45:10
4.5M 101 /path/to/replicated/files/style
24M 2002 /path/to/replicated/files/html
6.7G 477505 /path/to/replicated/files/images
761M 1 /path/to/replicated/files/transfer.tgz
30G 216 /path/to/replicated/files/data

Notice that the file output is space and newline delimited. It’s not great for human readability, but you’ll see in a minute that with regular expressions it’s super easy to build a report to send to the aggregator.

Read state and transmit to aggregator

Now that we have a report cached describing our current disk state, we need to format that properly and send it to the aggregator. To do this, Python seemed a good fit.

But first, I needed to be able to quickly and reliably extract information from this plain text report. Regular expressions seemed like a great fit for this, so I used my favorite regex tool, Kodos. The two expressions I need are to extract the date and then each line of report data.

You can see what I came up with in the Python script below.

#-------------------------------------------------------------------------------
# Name:        spacereport
# Purpose:     run spacereport.sh and report the results to a central service
#-------------------------------------------------------------------------------
#!/usr/bin/env python
 
import os
import re
import urllib, urllib2
from datetime import datetime
from socket import gethostname
 
def main():
 
    # where to send the report
    url = 'http://example.com/spacereport.php'
 
    # define regular expression(s)
    regexp_size_directory = re.compile(r"""([0-9.KGM]*)\s*([0-9]*)\s*[a-zA-Z/]*/(.+)""",  re.MULTILINE)
    regexp_report_time = re.compile(r"""^([0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2})\n""")
 
    # run the spacereport.sh script to generate plain text report
    base_dir = os.path.dirname(os.path.realpath(__file__))
    os.system(os.path.join(base_dir, 'spacereport.sh'))
 
    # parse space data from file
    spacedata = open(os.path.join(base_dir, 'spacereport')).read()
    space_report_time = regexp_report_time.search(spacedata).group(1)
    space_data_directories = regexp_size_directory.findall(spacedata)
 
    # create space data transmission
    report_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    hostname = gethostname()
    space_data = {'host': hostname, 'report_time': space_report_time, 'report_time_sent': report_time, 'space_data_by_directory': []}
    for space_data_directory in space_data_directories:
        space_data['space_data_by_directory'].append({'size_on_disk': space_data_directory[0], 'files_on_disk': space_data_directory[1], 'directory': space_data_directory[2]})
 
    # prepare the report
    # it might be better to use the json library for this :)
    report = {'report': str(space_data).replace("'", "\"")}
 
    # encode and send the report
    data = urllib.urlencode(report)
    req = urllib2.Request(url, data)
    response = urllib2.urlopen(req)
 
    # You can optionally output the response to verify that it worked
    the_page = response.read()
    print(the_page)
 
if __name__ == '__main__':
    main()

The variable report contains a value similar to what’s shown below:

{'report': '{"report_time": "2012-08-06 21:45:10", "host": "WATROUS1", "space_data_by_directory": [{"files_on_disk": "101", "directory": "style", "size_on_disk": "4.5M"}, {"files_on_disk": "2002", "directory": "html", "size_on_disk": "24M"}, {"files_on_disk": "477505", "directory": "images", "size_on_disk": "6.7G"}, {"files_on_disk": "1", "directory": "transfer.tgz", "size_on_disk": "761M"}, {"files_on_disk": "216", "directory": "data", "size_on_disk": "30G"}], "report_time_sent": "2012-08-06 16:20:53"}'}

The difference between report_time and report_time_sent, if there is a difference, is important. It can alert you to an error on the system preventing a new, valid report from being created by your shell script. It can also signal load issues if the gap is too wide.

Capture and store state data

Now we need to create spacereport.php, the aggregation script. It’s sole job is to receive the report and store it in MongoDB. This is made easy using PHP’s built in json_decode and MongoDB support. After the call to json_decode, the dates still need to be converted to MongoDates.

<?php
if ($_SERVER['REQUEST_METHOD'] == "POST") {
  // decode JSON report and convert dates to MongoDate
  $spacereport = json_decode($_POST['report'], true);
  $spacereport['report_time_sent'] = new MongoDate(strtotime($spacereport['report_time_sent']));
  $spacereport['report_time'] = new MongoDate(strtotime($spacereport['report_time']));
 
  // connect to MongoDB
  $mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
  // select a database
  $db = $mongoConnection->reports;
 
  // select a collection
  $collection = $db->spacereport;
 
  // add a record
  $collection->insert($spacereport);
 
  print_r($spacereport);
} else {
  // this should probably set the STATUS to 405 Method Not Allowed
  echo 'not POST';
}
?>

At this point, the data is now available in MongoDB. It’s possible to use Mongo’s query mechanism to query the data.

PRIMARY> db.spacereport.find({host: 't1.example.com'}).limit(1).pretty()
{
        "_id" : ObjectId("501fb1d47540c4df76000073"),
        "report_time" : ISODate("2012-08-06T12:00:11Z"),
        "host" : "t1.example.com",
        "space_data_by_directory" : [
                {
                        "files_on_disk" : "101",
                        "directory" : "style ",
                        "size_on_disk" : "4.5M"
                },
                {
                        "files_on_disk" : "2001",
                        "directory" : "html ",
                        "size_on_disk" : "24M"
                },
                {
                        "files_on_disk" : "477505",
                        "directory" : "directory ",
                        "size_on_disk" : "6.7G"
                },
                {
                        "files_on_disk" : "1",
                        "directory" : "transfer.tgz ",
                        "size_on_disk" : "761M"
                },
                {
                        "files_on_disk" : "215",
                        "directory" : "data ",
                        "size_on_disk" : "30G"
                }
        ],
        "report_time_sent" : ISODate("2012-08-06T12:00:20Z")
}

NOTE: For this system, historical data quickly diminishes in importance since what I’m interested in is the current state of replication. For that reason I made the spacereport collection capped to 1MB or 100 records.

PRIMARY> db.runCommand({"convertToCapped": "spacereport", size: 1045876, max: 100});
{ "ok" : 1 }

Display state data report

It’s not very useful to look at the data one record at a time, so we need some way of viewing the data as a whole. PHP is convenient, so we’ll use that to create a web based report.

<html>
<head>
<style>
body {
    font-family: Arial, Helvetica, Sans serif;
}
 
.directoryname {
    float: left;
    width: 350px;
}
 
.sizeondisk {
    float: left;
    text-align: right;
    width: 150px;
}
 
.numberoffiles {
    float: left;
    text-align: right;
    width: 150px;
}
</style>
</head>
 
<body>
 
<?php
 
$host_display_template = "<hr />\n<strong>%s</strong> showing report at <em>%s</em> (of %d total reports)<br />\n";
$spacedata_row_template = "<div class='directoryname'>%s</div> <div class='sizeondisk'>%s</div> <div class='numberoffiles'>%d total files</div><div style='clear: both;'></div>\n";
 
$mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
// select a database
$db = $mongoConnection->reports;
 
// select a collection
$collection = $db->spacereport;
 
// group the collection to get a unique list of all hosts reporting space data
$key = array("host" => 1);
$initial = array("reports" => 0);
$reduce = "function(obj, prev) {prev.reports++;}";
$reports_by_host = $collection->group($key, $initial, $reduce);
 
// cycle through all hosts found above
foreach ($reports_by_host['retval'] as $report_by_host) {
    // grab the reports for this host and sort to find most recent report
    $cursor = $collection->find(array("host" => $report_by_host['host']));
    $cursor->sort(array("report_time_sent" => -1))->limit(1);
    foreach ($cursor as $obj) {
        // output details about this host and the report timing
        printf ($host_display_template, $report_by_host['host'], date('M-d-Y H:i:s', $obj['report_time']->sec), $report_by_host['reports']);
        foreach ($obj["space_data_by_directory"] as $directory) {
            // output details about this directory
            printf ($spacedata_row_template, $directory["directory"], $directory["size_on_disk"], $directory["files_on_disk"]);
        }
    }
}
?>
 
</body>
</html>

Centralizing the service

At this point the entire reporting structure is in place, but it requires manual installation or updates on each host where it runs. Even if you only have a handful of hosts, it can quickly become a pain to have to update them by hand each time you can to change the structure.

To get around this, host the two scripts responsible for creating and sending the report in some location that’s accessible to the target host. Then run the report from a third script that grabs the latest copies of those scripts and run the reports.

# download the latest spacereport scripts 
# and run to update central aggregation point
 
BASEDIR=$(dirname $0)
# use wget to grab the latest scripts
wget -q http://c.example.com/spacereport.py -O $BASEDIR/spacereport.py
wget -q http://c.example.com/spacereport.sh -O $BASEDIR/spacereport.sh
# make sure that spacereport.sh is executable
chmod +x $BASEDIR/spacereport.sh
# create and send a spacereport
python $BASEDIR/spacereport.py

Since MongoDB is schemaless, the structure of the reports can be changed at will. Provided legacy values are left in place, no changes are required at any other point in the reporting process.

Future Enhancements

Possible enhancements could include active monitoring, such as allowing an administrator to define rules that would trigger notifications based on the data being aggregated. This monitoring could be implemented as a hook in the spacereport.php aggregator script or based on a cron. Rules could include comparisons between hosts, self comparison with historical data for the same host, or comparison to baseline data external to the hosts being monitored.

Some refactoring to generalize the shell script that produces the initial plain text report may improve efficiency and/or flexibility, though it’s difficult to imagine that writing a custom program to replace existing system utilities would be worthwhile.

The ubiquity of support for MongoDB and JSON would make it easy to reimplement any of the above components in other languages if there’s a better fit for a certain project.

Software Engineering

MongoDB – Boise Code Camp 2012

Welcome to Boise Code Camp, 2012. This year I decided to present on MongoDB. In a previous presentation on MongoDB and some video training I cover installation and configuration and using replica sets.

For the code camp this year I explore some of the differences between schema design for a relational database and schema design for MongoDB. You can copy and paste the commands below directly onto the MongoDB command line to follow along.

To begin with, it’s important to know how to get help and see what databases and collections are available. The default MongoDB ID generation is also useful to know about.

1
2
3
4
5
6
7
8
9
10
11
help
show dbs
 
// discuss default IDs and how they work
use exploreid
db.dropDatabase()
 
use exploreid
// db.collection.insert(object)
db.things.insert({'name': 'Daniel Watrous'})
db.things.find()

Schema Design

Now let’s build the course tracker database. The MongoDB website provides some great instruction and a few videos on its schema design for MongoDB page. First have a look at the relational schema and then the adjusted MongoDB schema.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// build coursetracker database
use coursetracker
db.dropDatabase()
 
use coursetracker
// create student records
db.students.insert({"_id": "john.shore", "name": {"first": "John", "last": "Shore"}, "email": "john.shore@gmail.com", "major": "Electrical Engineering"})
db.students.find()
 
db.students.insert({"_id": "jeff", "name": {"first": "Jeff", "last": "Holland"}, "email": "jeff@yahoo.com", "major": "Business"})
db.students.insert({"_id": "scott", "name": {"first": "Scott", "last": "Mills"}, "email": "scott@hotmail.com", "major": "Humanities/Art"})
db.students.insert({"_id": "rcotter", "name": {"first": "Ray", "last": "Cotter"}, "email": "rcotter@msn.com", "major": "Computer Science"})
db.students.insert({"_id": "lee2331", "name": {"first": "Lee", "last": "Aldwell"}, "email": "lee2331@aol.com", "major": "Graphic Design"})
db.students.find()
// have a look to see that the students collection now exists
show collections
 
// create course records
db.courses.insert({"_id": "HIST-1010", "name": "History of the World 1010", "description": "A bunch of really interesting things that actually happened", "students": [], "ratings": []})
db.courses.find()
 
db.courses.insert({"_id": "ENGCOMP-1010", "name": "English Composition 1010", "description": "If you can't write well, you've got nothing!", "students": [], "ratings": []})
db.courses.insert({"_id": "ART-1050", "name": "Artistic Interpretation 1050", "description": "Discover your inner beholder", "students": [], "ratings": []})
db.courses.find()
 
// create instructor records
db.instructors.insert({"_id": "wally.r.binns", "name": {"first": "Wally", "middle": "r", "last": "Binns"}, "email": "wally.r.binns@ssu.edu", "bio": "I was born in the middle of my mother\"s doctoral dissertation on Faraday Cage isolation. I\"ve been an academic ever since...", "publications": [], "courses": []})
db.instructors.find()
 
db.instructors.insert({"_id": "gerald.waterford.iii", "name": {"first": "Gerald", "last": "Waterford", "suffix": "III"}, "email": "gerald.waterford.iii@ssu.edu", "bio": "My father's father was a great man. My father, not so much. I am restoring the family honor.", "publications": [], "courses": []})
db.instructors.insert({"_id": "kim.b", "name": {"prefix": "Mrs.", "first": "Kim", "last": "Binnley"}, "email": "kim.b@ssu.edu", "bio": "My mother told me 'Don't let those dopes push you around'. My life has been a constant struggle against dopeness ever since. Sigh...", "publications": [], "courses": []})
db.instructors.find().pretty()
// now we can see all three of our collections
show collections

At this point we have records for primary entities, but all the arrays that would hold course assignments, publications, etc. are empty. Let’s start adding some additional details to those. primary documents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// add some publications
var pub = {"title": "Inverted Celestial Poetry", "source": "http://www.pubcentral.com/poetry/inverted-celestial-poetry"}
pub
// db.collection.update( criteria, objNew, upsert, multi )
db.instructors.update({"_id": "wally.r.binns"}, {$push: {"publications": pub1}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "Grow, grow, little Dandelion", "source": "http://www.hopefulstories.com/my-dandelion"}}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"publications": {"title": "The teapot and the spoon", "source": "http://www.dishsoap.com/teapot-spoon"}}}, false, false)
 
// now lets assign some courses to instructors
db.courses.find({}, {'_id': 1, 'name': 1})
db.instructors.update({"_id": "kim.b"}, {$push: {"courses": "ART-1050"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "ENGCOMP-1010"}}, false, false)
db.instructors.update({"_id": "gerald.waterford.iii"}, {$push: {"courses": "HIST-1010"}}, false, false)
db.instructors.find().pretty()
 
// now lets add some students to courses
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ART-1050"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.update({"_id": "HIST-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "john.shore"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "jeff"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "rcotter"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "lee2331"}}, false, false)
db.courses.update({"_id": "ENGCOMP-1010"}, {$push: {"students": "scott"}}, false, false)
db.courses.find().pretty()

Map Reduce

Now we have details about which instructors teach which courses. We know which students are registered for each course. Let’s explore how we can get a clean aggregate view of these details. In SQL we would construct a query and include the grouping and calculations we wanted to get these details. In MongoDB we need to use Map Reduce.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// find out how many courses each student has
map_number_courses = function() { 
  this.students.forEach(function(value) {
    emit(value, 1);
  });
}
reduce_number_courses = function(key, values) {
    var result = 0;
 
    values.forEach(function(value) {
      result += value;
    });
 
    return result;
}
result = db.courses.mapReduce(map_number_courses, reduce_number_courses, {out: {replace: "numbercourses"}});
show collections
db.numbercourses.find()

It’s nice to see by _id how many courses each student has, but what if we want to combine details from two different collections into a single result. It turns out we can accomplish this with MongoDB too. However, we need to run two separate map reduce executions to get all the data we need in one place.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// produce a listing of student full name and how many courses he/she is taking
map_course_report_names = function() { 
  emit(this._id, {name: this.name.first+' '+this.name.last, numbercourses: 0});
}
map_course_report_number = function() { 
  this.students.forEach(function(value) {
    emit(value, {name: '', numbercourses: 1});
  });
}
reduce_course_report = function(key, values) {
    var result = {name: '', numbercourses: 0};
 
    values.forEach(function(value) {
      result.numbercourses += value.numbercourses;
      if (result.name === '') {
        result.name = value.name;
      }
    });
 
    return result;
}
result = db.courses.mapReduce(map_course_report_number, reduce_course_report, {out: {replace: "coursereport"}});
result = db.students.mapReduce(map_course_report_names, reduce_course_report, {out: {reduce: "coursereport"}});
show collections
db.coursereport.find()

You might notice that neither of these map reduce examples lend themselves to the type of ad hoc queries that are possible in SQL. One potential downside to this is that extracting informal data in various ways may be more tedious. A potential upside is that once you have defined the map, reduce structure that will produce the results you want, the output is cached in a very fast, indexable collection.

Inspiration

Some of the mapping techniques above were found here:
http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/

Software Engineering

MongoDB Using Replica Sets as a Backup

MongoDB implements a form of replication they call replica sets. Referring to a repica set instead of just calling it replication is a helpful distinction that becomes more obviously useful when you are introduced to sharding in MongoDB since each shard should be comprised of a set of replicas unique to that shard, but we’ll get to that later.

For now, I want to show you how easy it is to setup and use replica sets. The setup and initial ‘recovery’ of data to all replicas in the set is quite simple. In this video I walk you through the entire process. The video is HD, so be sure to watch it full screen to get all the details.

 

To install MongoDB see my hands on introduction.

Software Engineering

Hands on MongoDB introduction and installation

MongoDB is a database. However, unlike conventional relational databases that are based on well defined schema and use SQL as the primary interface to manage the data, MongoDB instead uses document based storage.

The storage uses a format known as BSON, which is a modified form of JSON. This makes the stored documents very flexible and lightweight. It also makes it easy to adjust what is contained in any document without any significant impact to the other documents in a collection (a collection in MongoDB is like a table in a relational database).

In this short video I show you how to install and begin using MongoDB. The video is HD, so be sure to watch it full screen to get all the details.

 

When you’re done with this, go have a look at how to improve MongoDB’s reliability using replica sets.

Software Engineering

Introduction to MongoDB

I put this presentation together for a local web developer meetup. The topic was MongoDB.

MongoDB is a database, but unlike traditional relational databases which store data in well defined tables (schema), MongoDB stores data in JSON like documents (BSON). This provides a great deal of flexibility.

After you get through this presentation, go have a look at my hands on introduction to MongoDB and my walk through of MongoDB backups using Replica Sets.

Mongodb

View more presentations from dwatrous.