Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged json

Software Engineering

Configuration of python web applications

Hopefully it’s obvious that separating configuration from application code is always a good idea. One simple and effective way I’ve found to do this in my python (think bottle, flask, etc.) apps is with a simple JSON configuration file. Choosing JSON makes sense for a few reasons:

  • Easy to read (for humans)
  • Easy to consume (by your application
  • Can be version alongside application code
  • Can be turned into a configuration REST service

Here’s a short example of how to do this for a simple python application that uses MongoDB. First the configuration file.

{
    "use_ssl": false,
    "authentication": {
        "enabled": false,
        "username": "myuser",
        "password": "secret"},
    "mongodb_port": 27017,
    "mongodb_host": "localhost",
    "mongodb_dbname": "appname"
}

Now in the python application you would load the simple JSON file above in a single line:

# load in configuration
configuration = json.load(open('softwarelicense.conf.json', 'r'))

Now in your connection code, you would reference the configuration object.

# establish connection pool and a handle to the database
mongo_client = MongoClient(configuration['mongodb_host'], configuration['mongodb_port'], ssl=configuration['use_ssl'])
db = mongo_client[configuration['mongodb_dbname']]
if configuration['authentication']['enabled']:
    db.authenticate(configuration['authentication']['username'], configuration['authentication']['password'])
 
# get handle to collections in MongoDB
mycollection = db.mycollection

This keeps the magic out of your application and makes it easy to change your configuration without direct changes to your application code.

Configuration as a web service

Building a web service to return configuration data is more complicated for a few reasons. One is that you want to make sure it’s secure, since most configuration data involves credentials, secrets and details about deployment details (e.g. paths). If a hacker managed to get your configuration data that could significantly increase his attack surface. Another reason is that it’s another potential point of failure. If the service was unavailable for any reason, your app would still need to know how to startup and run.

If you did create a secure, highly available configuration service, the only change to your application code would be to replace the open call with a call to urllib.urlopen or something similar.

Update, October 31, 2016

The below implementation extends the above ideas.

{
   "common": {
      "authentication": {
         "jwtsecret": "badsecret",
         "jwtexpireoffset": 1800,
         "jwtalgorithm": "HS256",
         "authorize": {
            "role1": "query or group to confirm role1",
            "role2": "query or group to confirm role2"
         },
         "message":{
            "unauthorized":"Access to this system is granted by request. Contact PersonName to get access."
         }
      }
   }
}

The following class encapsulates the management of the configuration file.

import os
import json
 
class authentication_config:
    common = {}
 
    def __init__(self, configuration):
        self.common = configuration['common']['authentication']
 
    def get_config(self):
        return self.common
 
class configuration:
    authentication = None
 
    def __init__(self, config_directory, config_file):
        # load in configuration (directory is assumed relative to this file)
        config_full_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), config_directory, config_file)
        with open(config_full_path, 'r') as myfile:
            configuration_raw = json.load(myfile)
        # prepare authentication
        self.authentication = authentication_config(configuration_raw)
 
 
def main():
    # get path to configuration
    config_directory = 'conf'
    config_file = 'myapp.conf.json'
 
    config = configuration(config_directory, config_file)
    print config.authentication.get_config['jwtsecret']
 
 
if __name__ == '__main__':
    main()
Software Engineering

OpenStack REST API

There are some high quality resources that already cover the OpenStack API, so this is a YEA (yet another example) post. See the resources section below for some helpful links.

OpenStack APIs provide access to all OpenStack components, such as nova (compute), glance (VM images), swift (object storage), cinder (block storage), keystone (authentication) and neutron (networking). Authentication tokens are valid for a fixed duration, after which they expire and must be replaced. Each service requires it’s own token. Services that are hosted on the same logical server are typically accessible over different ports.

OpenStack APIs are RESTful, which means there are many ways to use them. In this post I’ll demonstrate three approaches that should provide clarity into their structure.

  • Command Line Interface (CLI)
  • cURL
  • REST Client

In this post I don’t cover programming against the REST APIs, but instead focus just on how they work. This work builds on my OpenStack development post.

Command Line Interface (CLI)

Command Line Interfaces used to manage OpenStack components make use of the REST APIs behind the scenes: a rather smart design choice on the part of the OpenStack community. This brings consistency to OpenStack management efforts and discourages disparity between standard tooling (CLI) and custom tooling (direct API access).

Credentials are required to access the REST APIs. For the command line client, these credentials are stored as environment variables. If you’re using DevStack, you can use the openrc script to automatically setup your environment.

source openrc admin admin

Some clients support a debug option that will output full details about the request and response cycle. Raw request and response details can be helpful when learning the APIs or creating programmatic access libraries that wrap the APIs. Here’s an example that will list the flavors available.

$ nova --debug flavor-list
REQ: curl -i 'http://openstack.danielwatrous.com:5000/v2.0/tokens' -X POST -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -d '{"auth": {"tenantName": "admin", "passwordCredentials": {"username": "admin", "password": "{SHA1}95397c42a173838417806ce19d78f133ae6baa24"}}}'
INFO (connectionpool:258) Starting new HTTP connection (1): proxy.company.com
DEBUG (connectionpool:375) Setting read timeout to 600.0
DEBUG (connectionpool:415) "POST http://openstack.danielwatrous.com:5000/v2.0/tokens HTTP/1.1" 200 6823
RESP: [200] CaseInsensitiveDict({'content-length': '6823', 'proxy-connection': 'Keep-Alive', 'vary': 'X-Auth-Token', 'server': 'Apache/2.4.7 (Ubuntu)', 'connection': 'Keep-Alive', 'date': 'Thu, 21 Aug 2014 19:09:21 GMT', 'content-type': 'application/json'})
RESP BODY: {"access": {"token": {"issued_at": "2014-08-21T19:09:21.692110", "expires": "2014-08-21T20:09:21Z", "id": "{SHA1}99ff604f28f5706bfd82a00c21e099cba7fafab2", "tenant": {"enabled": true, "description": null, "name": "admin", "id": "32c13e88d51e49179c28520f688fa74d"}}, "serviceCatalog": [{"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d", "internalURL": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d", "id": "03d570ce41c04daeb7ffa274c20435f0"}], "type": "compute", "name": "nova"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8776/v2/32c13e88d51e49179c28520f688fa74d", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8776/v2/32c13e88d51e49179c28520f688fa74d", "internalURL": "http://openstack.danielwatrous.com:8776/v2/32c13e88d51e49179c28520f688fa74d", "id": "20d2caebf4814e1bb2c05f30a4802a2c"}], "type": "volumev2", "name": "cinderv2"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8774/v3", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8774/v3", "internalURL": "http://openstack.danielwatrous.com:8774/v3", "id": "47f43a622264422f8980f3b0fbac5f00"}], "type": "computev3", "name": "novav3"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:3333", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:3333", "internalURL": "http://openstack.danielwatrous.com:3333", "id": "149e00e61cc543cf94ae6162f79d9f00"}], "type": "s3", "name": "s3"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:9292", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:9292", "internalURL": "http://openstack.danielwatrous.com:9292", "id": "1b7a45b6d1c840978491250fd1a67204"}], "type": "image", "name": "glance"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8000/v1", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8000/v1", "internalURL": "http://openstack.danielwatrous.com:8000/v1", "id": "0b8abc323d884a0aa657bcb2f0274ee5"}], "type": "cloudformation", "name": "heat-cfn"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8776/v1/32c13e88d51e49179c28520f688fa74d", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8776/v1/32c13e88d51e49179c28520f688fa74d", "internalURL": "http://openstack.danielwatrous.com:8776/v1/32c13e88d51e49179c28520f688fa74d", "id": "63675bf8e9a04d199cffafb7b8354b05"}], "type": "volume", "name": "cinder"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8773/services/Admin", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8773/services/Cloud", "internalURL": "http://openstack.danielwatrous.com:8773/services/Cloud", "id": "520de40a96ea47c4a08c3ae5e0a8243c"}], "type": "ec2", "name": "ec2"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:8004/v1/32c13e88d51e49179c28520f688fa74d", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:8004/v1/32c13e88d51e49179c28520f688fa74d", "internalURL": "http://openstack.danielwatrous.com:8004/v1/32c13e88d51e49179c28520f688fa74d", "id": "0dcff3f004ff4c9b9b96d012e47d2edb"}], "type": "orchestration", "name": "heat"}, {"endpoints_links": [], "endpoints": [{"adminURL": "http://openstack.danielwatrous.com:35357/v2.0", "region": "RegionOne", "publicURL": "http://openstack.danielwatrous.com:5000/v2.0", "internalURL": "http://openstack.danielwatrous.com:5000/v2.0", "id": "6f43e35702844e149dde900124c352bf"}], "type": "identity", "name": "keystone"}], "user": {"username": "admin", "roles_links": [], "id": "b9936b16c5d343588f5a19d31a55c1ea", "roles": [{"name": "_member_"}, {"name": "heat_stack_owner"}, {"name": "admin"}], "name": "admin"}, "metadata": {"is_admin": 0, "roles": ["9fe2ff9ee4384b1894a90878d3e92bab", "c28444beb7e64b4ea2ea223a6efcba6a", "3ea10423929b47779f977e11015fe480"]}}}
 
REQ: curl -i 'http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/detail' -X GET -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Project-Id: admin" -H "X-Auth-Token: {SHA1}99ff604f28f5706bfd82a00c21e099cba7fafab2"
INFO (connectionpool:258) Starting new HTTP connection (1): proxy.company.com
DEBUG (connectionpool:375) Setting read timeout to 600.0
DEBUG (connectionpool:415) "GET http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/detail HTTP/1.1" 200 3337
RESP: [200] CaseInsensitiveDict({'content-length': '3337', 'proxy-connection': 'Keep-Alive', 'x-compute-request-id': 'req-802ef8c9-d4a3-41e5-a93d-7ab2120089db', 'connection': 'Keep-Alive', 'date': 'Thu, 21 Aug 2014 19:09:24 GMT', 'content-type': 'application/json', 'age': '0'})
RESP BODY: {"flavors": [{"name": "m1.tiny", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/1", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/1", "rel": "bookmark"}], "ram": 512, "OS-FLV-DISABLED:disabled": false, "vcpus": 1, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 1, "id": "1"}, {"name": "m1.small", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/2", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/2", "rel": "bookmark"}], "ram": 2048, "OS-FLV-DISABLED:disabled": false, "vcpus": 1, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 20, "id": "2"}, {"name": "m1.medium", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/3", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/3", "rel": "bookmark"}], "ram": 4096, "OS-FLV-DISABLED:disabled": false, "vcpus": 2, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 40, "id": "3"}, {"name": "m1.large", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/4", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/4", "rel": "bookmark"}], "ram": 8192, "OS-FLV-DISABLED:disabled": false, "vcpus": 4, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 80, "id": "4"}, {"name": "m1.nano", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/42", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/42", "rel": "bookmark"}], "ram": 64, "OS-FLV-DISABLED:disabled": false, "vcpus": 1, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 0, "id": "42"}, {"name": "m1.heat", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/451", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/451", "rel": "bookmark"}], "ram": 512, "OS-FLV-DISABLED:disabled": false, "vcpus": 1, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 0, "id": "451"}, {"name": "m1.xlarge", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/5", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/5", "rel": "bookmark"}], "ram": 16384, "OS-FLV-DISABLED:disabled": false, "vcpus": 8, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 160, "id": "5"}, {"name": "m1.micro", "links": [{"href": "http://openstack.danielwatrous.com:8774/v2/32c13e88d51e49179c28520f688fa74d/flavors/84", "rel": "self"}, {"href": "http://openstack.danielwatrous.com:8774/32c13e88d51e49179c28520f688fa74d/flavors/84", "rel": "bookmark"}], "ram": 128, "OS-FLV-DISABLED:disabled": false, "vcpus": 1, "swap": "", "os-flavor-access:is_public": true, "rxtx_factor": 1.0, "OS-FLV-EXT-DATA:ephemeral": 0, "disk": 0, "id": "84"}]}
 
+-----+-----------+-----------+------+-----------+---------+-------+-------------+-----------+
| ID  | Name      | Memory_MB | Disk | Ephemeral | Swap_MB | VCPUs | RXTX_Factor | Is_Public |
+-----+-----------+-----------+------+-----------+---------+-------+-------------+-----------+
| 1   | m1.tiny   | 512       | 1    | 0         |         | 1     | 1.0         | True      |
| 2   | m1.small  | 2048      | 20   | 0         |         | 1     | 1.0         | True      |
| 3   | m1.medium | 4096      | 40   | 0         |         | 2     | 1.0         | True      |
| 4   | m1.large  | 8192      | 80   | 0         |         | 4     | 1.0         | True      |
| 42  | m1.nano   | 64        | 0    | 0         |         | 1     | 1.0         | True      |
| 451 | m1.heat   | 512       | 0    | 0         |         | 1     | 1.0         | True      |
| 5   | m1.xlarge | 16384     | 160  | 0         |         | 8     | 1.0         | True      |
| 84  | m1.micro  | 128       | 0    | 0         |         | 1     | 1.0         | True      |
+-----+-----------+-----------+------+-----------+---------+-------+-------------+-----------+

The first two sections are calls the REST APIs, first for the keystone service to Authenticate and receive a token. Responses come as JSON due to the Accept header of application/json. If you look closely, you’ll see that the response actually included an access token and entry point URLs for each of the services that are integrated with keystone. These make up the ServiceCatalog and in this case there are ten.

The second section is the actual call to the nova API. In this case it returns a list of eight flavors. The final section is a tabular view of the JSON response created by the nova command line client.

cURL

If you look closely at the debug output of the examples above, you’ll see that the command line clients use cURL to make HTTP requests. We can already see what the authentication call looks like. Calls directly to cURL look similar. For example, here I call the keystone service to get a list of tenants.

$ curl -i -X GET http://openstack.danielwatrous.com:35357/v2.0/tenants -H "User-Agent: linux-command-line" -H "X-Auth-Token: TOKEN"
HTTP/1.1 200 OK
Date: Thu, 21 Aug 2014 20:05:39 GMT
Server: Apache/2.4.7 (Ubuntu)
Vary: X-Auth-Token
Content-Length: 546
Content-Type: application/json
Proxy-Connection: Keep-Alive
Connection: Keep-Alive
 
{"tenants_links": [], "tenants": [{"description": null, "enabled": true, "id": "1b7f733fa1394b9fb96838d3d7c6feea", "name": "service"}, {"description": null, "enabled": true, "id": "298cfcec9e9e49858e9b8e83d6b7d14e", "name": "demo"}, {"description": null, "enabled": true, "id": "32c13e88d51e49179c28520f688fa74d", "name": "admin"}, {"description": null, "enabled": true, "id": "8536da0aee8149d48e1fe6078dade4bf", "name": "alt_demo"}, {"description": null, "enabled": true, "id": "e66e6a80a6014dd28c7b4c1fcad19448", "name": "invisible_to_admin"}]}

REST Client

On Windows, the tool Fiddler can be used to create REST calls. When fiddler is first started, you may need to turn off capturing of traffic. You can do this from the File menu or by pressing F12. In the right side of the window, choose the composer tab. There you can provide the URL, headers and other HTTP request details. Below you can see a call to Keystone for tokens.

openstack-rest-api-fiddler-1

The response can be viewed by selecting the resulting request in the left pane and choosing the Inspectors tab in the right pane. The results can be viewed raw, as shown here.

openstack-rest-api-fiddler-2

Fiddler also provides various parsers, including JSON, to make the content easier to visualize.

openstack-rest-api-fiddler-3

Resources

The quality of the documentation available for OpenStack APIs is really amazing. Here are a couple of starting points for you.

http://developer.openstack.org/
http://docs.openstack.org/api/quick-start/content/

Software Engineering

Lightweight Replication Monitoring with MongoDB

One of my applications runs on a large assortment of hosts split between various data centers. Some of these are redundant pairs and others are in load balanced clusters. They all require a set of identical files which represent static content and other data.

rsync was chosen to facilitate replication of data from a source to many targets. What rsync lacked out of the box was a reporting mechanism to verify that the collection of files across target systems was consistent with the source.

Existing solutions

Before designing my solution, I searched for an existing solution to the same problem. I found backup monitor, but the last release was three years ago and there have only ever been 25 downloads, so it was less than compelling. It was also a heavier solution than I was interested in.

In this case it seems that developing a new solution is appropriate.

Monitoring requirements

The goal was to have each target system run a lightweight process at scheduled intervals and send a report to an aggregator service. A report could then be generated based on the aggregated data.

My solution includes a few components. One component analyzes the current state of files on disk and writes that state to a state file. Another component needs to read that file and transmit it to the aggregator. The aggregator needs to store the state identified by the host to which it corresponds. Finally there needs to be a reporting mechanism to display the data for all hosts.

Due to the distributed nature of the replication targets, the solution should be centralized so that changes in reporting structure are picked up by target hosts quickly and with minimal effort.

Current disk state analysis

This component potentially analyses many hundreds of thousands of files. That means the solution for this component must run very fast and be reliable. The speed requirements for this component eliminated some of the scripting languages that might otherwise be appealing (e.g. Perl, Python, etc.).

Instead I chose to write this as a bash script and make use of existing system utilities. The utilities I use include du, find and wc. Here’s what I came up with:

# Generate a report showing the sizes 
# and file counts of replicated folders
 
# create a path reference to the report file
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
reportfile="$BASEDIR/spacereport"
 
# create/overwrite report the file; write date
date '+%Y-%m-%d %H:%M:%S' > $reportfile
 
# append details to report file
du -sh /path/to/replicated/files/* | while read size dir;
do
    echo -n "$size ";
    # augment du output with count of files in the directory
    echo -n `find "$dir" -type f|wc -l`;
    echo " $dir ";
done >> $reportfile

These commands run very fast and produce an output that looks like this:

2012-08-06 21:45:10
4.5M 101 /path/to/replicated/files/style
24M 2002 /path/to/replicated/files/html
6.7G 477505 /path/to/replicated/files/images
761M 1 /path/to/replicated/files/transfer.tgz
30G 216 /path/to/replicated/files/data

Notice that the file output is space and newline delimited. It’s not great for human readability, but you’ll see in a minute that with regular expressions it’s super easy to build a report to send to the aggregator.

Read state and transmit to aggregator

Now that we have a report cached describing our current disk state, we need to format that properly and send it to the aggregator. To do this, Python seemed a good fit.

But first, I needed to be able to quickly and reliably extract information from this plain text report. Regular expressions seemed like a great fit for this, so I used my favorite regex tool, Kodos. The two expressions I need are to extract the date and then each line of report data.

You can see what I came up with in the Python script below.

#-------------------------------------------------------------------------------
# Name:        spacereport
# Purpose:     run spacereport.sh and report the results to a central service
#-------------------------------------------------------------------------------
#!/usr/bin/env python
 
import os
import re
import urllib, urllib2
from datetime import datetime
from socket import gethostname
 
def main():
 
    # where to send the report
    url = 'http://example.com/spacereport.php'
 
    # define regular expression(s)
    regexp_size_directory = re.compile(r"""([0-9.KGM]*)\s*([0-9]*)\s*[a-zA-Z/]*/(.+)""",  re.MULTILINE)
    regexp_report_time = re.compile(r"""^([0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2})\n""")
 
    # run the spacereport.sh script to generate plain text report
    base_dir = os.path.dirname(os.path.realpath(__file__))
    os.system(os.path.join(base_dir, 'spacereport.sh'))
 
    # parse space data from file
    spacedata = open(os.path.join(base_dir, 'spacereport')).read()
    space_report_time = regexp_report_time.search(spacedata).group(1)
    space_data_directories = regexp_size_directory.findall(spacedata)
 
    # create space data transmission
    report_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    hostname = gethostname()
    space_data = {'host': hostname, 'report_time': space_report_time, 'report_time_sent': report_time, 'space_data_by_directory': []}
    for space_data_directory in space_data_directories:
        space_data['space_data_by_directory'].append({'size_on_disk': space_data_directory[0], 'files_on_disk': space_data_directory[1], 'directory': space_data_directory[2]})
 
    # prepare the report
    # it might be better to use the json library for this :)
    report = {'report': str(space_data).replace("'", "\"")}
 
    # encode and send the report
    data = urllib.urlencode(report)
    req = urllib2.Request(url, data)
    response = urllib2.urlopen(req)
 
    # You can optionally output the response to verify that it worked
    the_page = response.read()
    print(the_page)
 
if __name__ == '__main__':
    main()

The variable report contains a value similar to what’s shown below:

{'report': '{"report_time": "2012-08-06 21:45:10", "host": "WATROUS1", "space_data_by_directory": [{"files_on_disk": "101", "directory": "style", "size_on_disk": "4.5M"}, {"files_on_disk": "2002", "directory": "html", "size_on_disk": "24M"}, {"files_on_disk": "477505", "directory": "images", "size_on_disk": "6.7G"}, {"files_on_disk": "1", "directory": "transfer.tgz", "size_on_disk": "761M"}, {"files_on_disk": "216", "directory": "data", "size_on_disk": "30G"}], "report_time_sent": "2012-08-06 16:20:53"}'}

The difference between report_time and report_time_sent, if there is a difference, is important. It can alert you to an error on the system preventing a new, valid report from being created by your shell script. It can also signal load issues if the gap is too wide.

Capture and store state data

Now we need to create spacereport.php, the aggregation script. It’s sole job is to receive the report and store it in MongoDB. This is made easy using PHP’s built in json_decode and MongoDB support. After the call to json_decode, the dates still need to be converted to MongoDates.

<?php
if ($_SERVER['REQUEST_METHOD'] == "POST") {
  // decode JSON report and convert dates to MongoDate
  $spacereport = json_decode($_POST['report'], true);
  $spacereport['report_time_sent'] = new MongoDate(strtotime($spacereport['report_time_sent']));
  $spacereport['report_time'] = new MongoDate(strtotime($spacereport['report_time']));
 
  // connect to MongoDB
  $mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
  // select a database
  $db = $mongoConnection->reports;
 
  // select a collection
  $collection = $db->spacereport;
 
  // add a record
  $collection->insert($spacereport);
 
  print_r($spacereport);
} else {
  // this should probably set the STATUS to 405 Method Not Allowed
  echo 'not POST';
}
?>

At this point, the data is now available in MongoDB. It’s possible to use Mongo’s query mechanism to query the data.

PRIMARY> db.spacereport.find({host: 't1.example.com'}).limit(1).pretty()
{
        "_id" : ObjectId("501fb1d47540c4df76000073"),
        "report_time" : ISODate("2012-08-06T12:00:11Z"),
        "host" : "t1.example.com",
        "space_data_by_directory" : [
                {
                        "files_on_disk" : "101",
                        "directory" : "style ",
                        "size_on_disk" : "4.5M"
                },
                {
                        "files_on_disk" : "2001",
                        "directory" : "html ",
                        "size_on_disk" : "24M"
                },
                {
                        "files_on_disk" : "477505",
                        "directory" : "directory ",
                        "size_on_disk" : "6.7G"
                },
                {
                        "files_on_disk" : "1",
                        "directory" : "transfer.tgz ",
                        "size_on_disk" : "761M"
                },
                {
                        "files_on_disk" : "215",
                        "directory" : "data ",
                        "size_on_disk" : "30G"
                }
        ],
        "report_time_sent" : ISODate("2012-08-06T12:00:20Z")
}

NOTE: For this system, historical data quickly diminishes in importance since what I’m interested in is the current state of replication. For that reason I made the spacereport collection capped to 1MB or 100 records.

PRIMARY> db.runCommand({"convertToCapped": "spacereport", size: 1045876, max: 100});
{ "ok" : 1 }

Display state data report

It’s not very useful to look at the data one record at a time, so we need some way of viewing the data as a whole. PHP is convenient, so we’ll use that to create a web based report.

<html>
<head>
<style>
body {
    font-family: Arial, Helvetica, Sans serif;
}
 
.directoryname {
    float: left;
    width: 350px;
}
 
.sizeondisk {
    float: left;
    text-align: right;
    width: 150px;
}
 
.numberoffiles {
    float: left;
    text-align: right;
    width: 150px;
}
</style>
</head>
 
<body>
 
<?php
 
$host_display_template = "<hr />\n<strong>%s</strong> showing report at <em>%s</em> (of %d total reports)<br />\n";
$spacedata_row_template = "<div class='directoryname'>%s</div> <div class='sizeondisk'>%s</div> <div class='numberoffiles'>%d total files</div><div style='clear: both;'></div>\n";
 
$mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
// select a database
$db = $mongoConnection->reports;
 
// select a collection
$collection = $db->spacereport;
 
// group the collection to get a unique list of all hosts reporting space data
$key = array("host" => 1);
$initial = array("reports" => 0);
$reduce = "function(obj, prev) {prev.reports++;}";
$reports_by_host = $collection->group($key, $initial, $reduce);
 
// cycle through all hosts found above
foreach ($reports_by_host['retval'] as $report_by_host) {
    // grab the reports for this host and sort to find most recent report
    $cursor = $collection->find(array("host" => $report_by_host['host']));
    $cursor->sort(array("report_time_sent" => -1))->limit(1);
    foreach ($cursor as $obj) {
        // output details about this host and the report timing
        printf ($host_display_template, $report_by_host['host'], date('M-d-Y H:i:s', $obj['report_time']->sec), $report_by_host['reports']);
        foreach ($obj["space_data_by_directory"] as $directory) {
            // output details about this directory
            printf ($spacedata_row_template, $directory["directory"], $directory["size_on_disk"], $directory["files_on_disk"]);
        }
    }
}
?>
 
</body>
</html>

Centralizing the service

At this point the entire reporting structure is in place, but it requires manual installation or updates on each host where it runs. Even if you only have a handful of hosts, it can quickly become a pain to have to update them by hand each time you can to change the structure.

To get around this, host the two scripts responsible for creating and sending the report in some location that’s accessible to the target host. Then run the report from a third script that grabs the latest copies of those scripts and run the reports.

# download the latest spacereport scripts 
# and run to update central aggregation point
 
BASEDIR=$(dirname $0)
# use wget to grab the latest scripts
wget -q http://c.example.com/spacereport.py -O $BASEDIR/spacereport.py
wget -q http://c.example.com/spacereport.sh -O $BASEDIR/spacereport.sh
# make sure that spacereport.sh is executable
chmod +x $BASEDIR/spacereport.sh
# create and send a spacereport
python $BASEDIR/spacereport.py

Since MongoDB is schemaless, the structure of the reports can be changed at will. Provided legacy values are left in place, no changes are required at any other point in the reporting process.

Future Enhancements

Possible enhancements could include active monitoring, such as allowing an administrator to define rules that would trigger notifications based on the data being aggregated. This monitoring could be implemented as a hook in the spacereport.php aggregator script or based on a cron. Rules could include comparisons between hosts, self comparison with historical data for the same host, or comparison to baseline data external to the hosts being monitored.

Some refactoring to generalize the shell script that produces the initial plain text report may improve efficiency and/or flexibility, though it’s difficult to imagine that writing a custom program to replace existing system utilities would be worthwhile.

The ubiquity of support for MongoDB and JSON would make it easy to reimplement any of the above components in other languages if there’s a better fit for a certain project.

Software Engineering

RESTful Java Servlet: Serializing to/from JSON with Jackson

In a previous article I demonstrated one way to create a RESTful interface using a plain Java Servlet. In this article I wanted to extend that to include JSON serialization using Jackson.

I found a very simple article showing a basic case mapping a POJO to JSON and back again. However, when I copied this straight over I got the following error:

org.codehaus.jackson.map.JsonMappingException: No serializer found for class DataClass and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS)

I found there were two ways to get past that error. The first was to use the Jackson annotations to define properties more directly. The other was to add getters and setters to the class. Here is my DataClass with both configurations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import java.util.ArrayList;
import java.util.List;
 
import org.codehaus.jackson.annotate.JsonAutoDetect;
import org.codehaus.jackson.annotate.JsonIgnoreProperties;
import org.codehaus.jackson.annotate.JsonProperty;
 
@JsonAutoDetect   // use this annotation if you don't have getters and setters for each JsonProperty
//@JsonIgnoreProperties(ignoreUnknown = true)    // use this if there isn't an exact correlation between JSON and class properties
public class DataClass {
 
  //@JsonProperty("theid") // use this if the property in JSON has a different identifier than your class
  @JsonProperty
  private int id = 1;
 
  @JsonProperty
  private String name = "Test Class";
 
  @JsonProperty
  private List<String> messages = new ArrayList<String>() {
    {
      add("msg 1");
      add("msg 2");
      add("msg 3");
    }
  };
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import java.util.ArrayList;
import java.util.List;
 
public class DataClass {
 
  private int id = 1;
 
  private String name = "Test Class";
 
  private List<String> messages = new ArrayList<String>() {
    {
      add("msg 1");
      add("msg 2");
      add("msg 3");
    }
  };
 
  public String toString() {
    return "User [id=" + id + ", name=" + name + ", " + "messages=" + messages + "]";
  }
  public int getId() {
    return id;
  }
  public void setId(int id) {
    this.id = id;
  }
  public String getName() {
    return name;
  }
  public void setName(String name) {
    this.name = name;
  }
  public List<String> getMessages() {
    return messages;
  }
  public void setMessages(List<String> messages) {
    this.messages = messages;
  }
}

We are now free to modify the doGet and doPost methods in our original servlet to serialize and deserialize the DataClass to and from JSON. Here’s the code for that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
...
  protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
    PrintWriter out = response.getWriter();
 
    DataClass mydata = new DataClass();
    ObjectMapper mapper = new ObjectMapper();
 
    try {
      // display to console
      out.println(mapper.writeValueAsString(mydata));
    } catch (JsonGenerationException e) {
      e.printStackTrace();
    } catch (JsonMappingException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }
    out.close();
  }
 
  protected void doPost(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
    PrintWriter out = response.getWriter();
 
    ObjectMapper mapper = new ObjectMapper();
 
    try {
      // read from file, convert it to user class
      DataClass user = mapper.readValue(request.getReader(), DataClass.class);
      // display to console
      out.println(user);
    } catch (JsonGenerationException e) {
      e.printStackTrace();
    } catch (JsonMappingException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }
    out.close();
  }
...

Now you can easily serialize data to and from JSON using Jackson and POJOs without the need for a mapping file. There are even convenient annotations available that allow you to accommodate differences between the JSON and POJO properties.

Software Engineering

Hands on MongoDB introduction and installation

MongoDB is a database. However, unlike conventional relational databases that are based on well defined schema and use SQL as the primary interface to manage the data, MongoDB instead uses document based storage.

The storage uses a format known as BSON, which is a modified form of JSON. This makes the stored documents very flexible and lightweight. It also makes it easy to adjust what is contained in any document without any significant impact to the other documents in a collection (a collection in MongoDB is like a table in a relational database).

In this short video I show you how to install and begin using MongoDB. The video is HD, so be sure to watch it full screen to get all the details.

 

When you’re done with this, go have a look at how to improve MongoDB’s reliability using replica sets.