Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged python

Software Engineering

JWT based authentication in Python bottle

May applications require authentication to secure protected resources. While standards like oAuth accommodate sharing resources between applications, more variance exists in implementations of securing the app in the first place. A recent standard, JWT, provides a mechanism for creating tokens with embedded data, signing these tokens and even encrypting them when warranted.

This post explores how individual resource functions can be protected using JWT. The solution involves first creating a function decorator to perform the authentication step. Each protected resource call is then decorated with the authentication function and subsequent authorization can be performed against the data in the JWT. Let’s first look at the decorator.

jwtsecret = config.authentication.get('jwtsecret')
 
class AuthorizationError(Exception):
    """ A base class for exceptions used by bottle. """
    pass
 
def jwt_token_from_header():
    auth = bottle.request.headers.get('Authorization', None)
    if not auth:
        raise AuthorizationError({'code': 'authorization_header_missing', 'description': 'Authorization header is expected'})
 
    parts = auth.split()
 
    if parts[0].lower() != 'bearer':
        raise AuthorizationError({'code': 'invalid_header', 'description': 'Authorization header must start with Bearer'})
    elif len(parts) == 1:
        raise AuthorizationError({'code': 'invalid_header', 'description': 'Token not found'})
    elif len(parts) > 2:
        raise AuthorizationError({'code': 'invalid_header', 'description': 'Authorization header must be Bearer + \s + token'})
 
    return parts[1]
 
def requires_auth(f):
    """Provides JWT based authentication for any decorated function assuming credentials available in an "Authorization" header"""
    def decorated(*args, **kwargs):
        try:
            token = jwt_token_from_header()
        except AuthorizationError, reason:
            bottle.abort(400, reason.message)
 
        try:
            token_decoded = jwt.decode(token, jwtsecret)    # throw away value
        except jwt.ExpiredSignature:
            bottle.abort(401, {'code': 'token_expired', 'description': 'token is expired'})
        except jwt.DecodeError, message:
            bottle.abort(401, {'code': 'token_invalid', 'description': message.message})
 
        return f(*args, **kwargs)
 
    return decorated

In the above code the requires_auth(f) function makes use of a helper function to verify that there is an Authorization header and that it appears to contain the expected token. A custom exception is used to indicate a failure to identify a token in the header.

The requires_auth function then uses the python JWT library to decode the key based on a secret value jwtsecret. The secret is obtained from a config object. Assuming the JWT decodes and is not expired, the decorated function will then be called.

Authenticate

The following function can be use to generate a new JWT.

jwtexpireoffset = config.authentication.get('jwtexpireoffset')
jwtalgorithm = config.authentication.get('jwtalgorithm')
 
def build_profile(credentials):
    return {'user': credentials['user'],
            'role1': credentials['role1'],
            'role2': credentials['role2'],
            'exp': time.time()+jwtexpireoffset}
 
bottle.post('/authenticate')
def authenticate():
    # extract credentials from the request
    credentials = bottle.request.json
    if not credentials or 'user' not in credentials or 'password' not in credentials:
        bottle.abort(400, 'Missing or bad credentials')
 
    # authenticate against some identity source, such as LDAP or a database
    try:
        # query database for username and confirm password
        # or send a query to LDAP or oAuth
    except Exception, error_message:
        logging.exception("Authentication failure")
        bottle.abort(403, 'Authentication failed for %s: %s' % (credentials['user'], error_message))
 
    credentials['role1'] = is_authorized_role1(credentials['user'])
    credentials['role2'] = is_authorized_role2(credentials['user'])
    token = jwt.encode(build_profile(credentials), jwtsecret, algorithm=jwtalgorithm)
 
    logging.info('Authentication successful for %s' % (credentials['user']))
    return {'token': token}

Notice that two additional values are stored in the global configuration, jwtalgorithm and jwtexpireoffset. These are used along with jwtsecret to encode the JWT token. The actual verification of user credentials can happen in many ways, including direct access to a datastore, LDAP, oAuth, etc. After authenticating credentials, it’s easy to authorize a user based on roles. These could be implemented as separate functions and could confirm role based access based on LDAP group membership, database records, oAuth scopes, etc. While the role level access shown above looks binary, it could easily be more granular. Since a JWT is based on JSON, the JWT payload is represented as a JSON serializable python dictionary. Finally the token is returned.

Protected resources

At this point, any protected resource can be decorated, as shown below.

def get_jwt_credentials():
    # get and decode the current token
    token = jwt_token_from_header()
    credentials = jwt.decode(token, jwtsecret)
    return credentials
 
@appv1.get('/protected/resource')
@requires_auth
def get_protected_resource():
    # get user details from JWT
    authenticated_user = get_jwt_credentials()
 
    # get protected resource
    try:
        return {'resource': somedao.find_protected_resource_by_username(authenticated_user['username'])}
    except Exception, e:
        logging.exception("Resource not found")
        bottle.abort(404, 'No resource for username %s was found.' % authenticated_user['username'])

The function get_protected_resource will only be executed if requires_auth successfully validates a JWT in the header of the request. The function get_jwt_credentials will actually retrieve the JWT payload to be used in the function. While I don’t show an implementation of somedao, it is simply an encapsulation point to facilitate access to resources.

Since the JWT expires (optionally, but a good idea), it’s necessary to build in some way to extend the ‘session’. For this a refresh endpoint can be provided as follows.

bottle.post('/authenticate/refresh')
@requires_auth
def refresh_token():
    """refresh the current JWT"""
    # get and decode the current token
    token = jwt_token_from_header()
    payload = jwt.decode(token, jwtsecret)
    # create a new token with a new exp time
    token = jwt.encode(build_profile(payload), jwtsecret, algorithm=jwtalgorithm)
 
    return {'token': token}

This simply repackages the same payload with a new expiration time.

Improvements

The need to explicitly refresh the JWT increases (possibly double) the number of requests made to an API only for the purpose of extending session life. This is inefficient and can lead to awkward UI design. If possible, it would be convenient to refactor requires_auth to perform the JWT refresh and add the new JWT to the header of the request that is about to be processed. The UI could then grab the updated JWT that is produced with each request to use for the subsequent request.

The design above will actually decode the JWT twice for any resource function that requires access to the JWT payload. If possible, it would be better to find some way to inject the JWT payload into the decorated function. Ideally this would be done in a way that functions which don’t need the JWT payload aren’t required to add it to their contract.

The authenticate function could be modified to return the JWT as a header, rather than in the body of the request. This may decrease the chances of a JWT being cached or logged. It would also simplify the UI if the authenticate and refresh functions both return the JWT in the same manner.

Extending

This same implementation could be reproduced in any language or framework. The basic steps are

  1. Perform (role based) authentication and authorization against some identity resource
  2. Generate a token (like JWT) indicating success and optionally containing information about the authenticated user
  3. Transmit the token and refreshed tokens in HTTP Authorization headers, both for authenticate and resource requests

Security and Risks

At the heart of JWT security is the secret used to sign the JWT. If this secret is too simple, or if it is leaked, it would be possible for a third party to craft a JWT with any desired payload, and trick an application into delivering protected resources to an attacker. It is important to choose strong secrets and to rotate them frequently. It would also be wise to perform additional validity steps. These might include tracking how many sessions a user has, where those session have originated and the nature and frequency/speed of requests. These additional measures could prevent attacks in the event that a JWT secret was discovered and may indicate a need to rotate a secret.

Microservices

In microservice environments, it is appealing to authenticate once and access multiple microservice endpoints using the same JWT. Since a JWT is stateless, each microservice only needs the JWT secret in order to validate the signature. This potentially increases the attach surface for a hacker who wants the JWT secret.

Software Engineering

Configuration of python web applications

Hopefully it’s obvious that separating configuration from application code is always a good idea. One simple and effective way I’ve found to do this in my python (think bottle, flask, etc.) apps is with a simple JSON configuration file. Choosing JSON makes sense for a few reasons:

  • Easy to read (for humans)
  • Easy to consume (by your application
  • Can be version alongside application code
  • Can be turned into a configuration REST service

Here’s a short example of how to do this for a simple python application that uses MongoDB. First the configuration file.

{
    "use_ssl": false,
    "authentication": {
        "enabled": false,
        "username": "myuser",
        "password": "secret"},
    "mongodb_port": 27017,
    "mongodb_host": "localhost",
    "mongodb_dbname": "appname"
}

Now in the python application you would load the simple JSON file above in a single line:

# load in configuration
configuration = json.load(open('softwarelicense.conf.json', 'r'))

Now in your connection code, you would reference the configuration object.

# establish connection pool and a handle to the database
mongo_client = MongoClient(configuration['mongodb_host'], configuration['mongodb_port'], ssl=configuration['use_ssl'])
db = mongo_client[configuration['mongodb_dbname']]
if configuration['authentication']['enabled']:
    db.authenticate(configuration['authentication']['username'], configuration['authentication']['password'])
 
# get handle to collections in MongoDB
mycollection = db.mycollection

This keeps the magic out of your application and makes it easy to change your configuration without direct changes to your application code.

Configuration as a web service

Building a web service to return configuration data is more complicated for a few reasons. One is that you want to make sure it’s secure, since most configuration data involves credentials, secrets and details about deployment details (e.g. paths). If a hacker managed to get your configuration data that could significantly increase his attack surface. Another reason is that it’s another potential point of failure. If the service was unavailable for any reason, your app would still need to know how to startup and run.

If you did create a secure, highly available configuration service, the only change to your application code would be to replace the open call with a call to urllib.urlopen or something similar.

Update, October 31, 2016

The below implementation extends the above ideas.

{
   "common": {
      "authentication": {
         "jwtsecret": "badsecret",
         "jwtexpireoffset": 1800,
         "jwtalgorithm": "HS256",
         "authorize": {
            "role1": "query or group to confirm role1",
            "role2": "query or group to confirm role2"
         },
         "message":{
            "unauthorized":"Access to this system is granted by request. Contact PersonName to get access."
         }
      }
   }
}

The following class encapsulates the management of the configuration file.

import os
import json
 
class authentication_config:
    common = {}
 
    def __init__(self, configuration):
        self.common = configuration['common']['authentication']
 
    def get_config(self):
        return self.common
 
class configuration:
    authentication = None
 
    def __init__(self, config_directory, config_file):
        # load in configuration (directory is assumed relative to this file)
        config_full_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), config_directory, config_file)
        with open(config_full_path, 'r') as myfile:
            configuration_raw = json.load(myfile)
        # prepare authentication
        self.authentication = authentication_config(configuration_raw)
 
 
def main():
    # get path to configuration
    config_directory = 'conf'
    config_file = 'myapp.conf.json'
 
    config = configuration(config_directory, config_file)
    print config.authentication.get_config['jwtsecret']
 
 
if __name__ == '__main__':
    main()
Software Engineering

Explore CloudFoundry using Stackato and VirtualBox

Stackato, which is released by ActiveState, extends out of the box CloudFoundry. It adds a web interface and a command line client (‘stackato’), although the existing ‘cf’ command line client still works (as long as versions match up). Stackato includes some autoscale features and a very well done set of documentation.

ActiveState publishes various VM images that can be used to quickly spin up a development environment. These include images for VMWare, KVM and VirtualBox, among others. In this post I’ll walk through getting a Stackato environment running on Windows using VirtualBox.

Install and Configure Stackato

The obvious first step is to download and install VirtualBox. The steps shown below should work on any system that runs VirtualBox.

I follow these steps to install and configure the Stackato VM.
http://docs.stackato.com/admin/setup/microcloud.html

After downloading the VirtualBox images for Stackato, from VirtualBox I click “File->Import Appliance” and navigating to the unzipped contents of the VM download. I select the OVF file and click open.

import-stackato-vm-virtualbox

This process can take several minutes depending on your system.

import-stackato-vm-virtualbox-progress

After clicking next, you can configure various settings. Click the checkbox to Reinitialize the MAC address on all network cards.

import-stackato-vm-virtualbox-reinitialize-mac

Once the import completes, right click on the VM in VirtualBox and choose settings.

stackato-vm-virtualbox-settings

Navigate to the network settings and make sure it’s set to Bridged. The default is NAT. Bridged will allow the VM to obtain it’s own IP address on the network. This will facilitate access later on.

stackato-vm-virtualbox-settings-network

Depending on your system resources, you may also want to go into the System settings and increase the Base Memory and number of Processors. You can also ensure that all virtualization accelerators are enabled.

Launch the Stackato VM

After you click OK, the VM is ready to launch. When you launch the VM, there is an initial message asking if you want to boot into recovery mode. Choose regular boot and you will then see a message about the initial setup.

launch-stackato-virtualbox-initial-setup

Eventually the screen will show system status. You’ll notice that it bounces around until it finally settles on a green READY status.

launch-stackato-virtualbox-ready

At this point you have a running instance of Stackato.

Configure Stackato

You can see in the above screenshot that Stackato displays a URL for the management console and the IP address of the system. In order to complete the configuration of the system (and before you can SSH into the server), you need to access the web console using that URL. This may require that you edit your local hosts file (/etc/hosts on Linux or C:\Windows\System32\drivers\etc\hosts on Windows). The entries in your hosts file should look something like this:

16.85.146.131	stackato-7kgb.local
16.85.146.131	api.stackato-7kgb.local

You can now access the console using https://api.stackato-7kgb.local. Don’t forget to put “http://” in front so the browser knows to request that URL rather than search for it. The server is also using a self-signed certificate, so you can expect your browser to complain. It’s OK to tell your browser to load the page despite the self-signed certificate.

On the page that loads, you need to provide a username, email address, and some other details to configure this Stakcato installation. Provide the requested details and click to setup the initial admin user.

stackato-cloudfoundry-configuration

A couple of things just happened. First off, a user was created with the username you provide. The password you chose will also become the password for the system user ‘stackato‘. This is important because it allows you to SSH into your instance.

Wildcard DNS

Widlcard DNS, using a service like xip.io, will make it easier to access Stackato and any applications that you deploy there. First we log in to our VM over SSH and we use the node rename command to enable wildcard DNS.

kato node rename 16.85.146.131.xip.io

Stopping and starting related roles takes a few minutes after running the command above. Once the server returns to READY state, the console is available using the xip.io address, https://api.16.85.146.131.xip.io. This also applies to any applications that are deployed.

The entries in the local hosts file are no longer needed and can be removed.

Proxy Configuration

Many enterprise environments route internet access through a proxy. If this is the case for you, it’s possible to identify the upstream proxy for all Stackato related services. Run the following commands on the Stackato VM to enable proxy access.

kato op upstream_proxy set proxy.domain.com:8080
sudo /etc/init.d/polipo restart

It may also be necessary to set the http_proxy and https_proxy environment variables by way of the .bashrc for the stackato system user.

At this point you should be able to easily deploy a new app from the app store using the Stackato console. Let’s turn our attention now to using a client to deploy a simple app.

Use the stackato CLI to Deploy an App

The same virtual host that is running Stackato also includes the command line client. That meas it can be used to deploy a simple application and verify that Stackato is working properly. To do this, first connect to the VM using SSH. Once connected, the following steps will prepare the command line client to deploy a simple application.

  1. set the target
  2. login/authenticate
  3. push the app

To set the target and login, we use these commands

stackato target api.16.85.146.131.xip.io
stackato login watrous

The output of the commands can be seen in the image below:

stackato-cli-target-login

Example Python App

There is a simple python bottle application we can use to confirm our Stackato deployment. To deploy we first clone, then use the stackato client to push, as shown below.

stackato-cli-deploy-push-python

Here are those commands in plain text:

git clone https://github.com/Stackato-Apps/bottle-py3.git
cd bottle-py3/
stackato push -n

Using the URL provided in the output from ‘stackato push’ we can view the new app.

stackato-python-bottle-app

You can now scale up instances and manage other aspects of the app using the web console or the stackato client.

Software Engineering

Detecting Credit Card Fraud – Frequency Algorithm

About 13 years ago I created my first integration with Authorize.net for a client who wanted to accept credit card payments directly on his website. The internet has changed a lot since then and the frequency of fraud attempts has increased.

One credit card fraud signature I identified while reviewing my server logs for one of my e-commerce websites was consistent. I refer to this is a shotgun attack, since the hacker sends through hundreds of credit card attempts. Here’s how it works and what to look for.

  1. All requests from a single throw away IP address
  2. Each request uses a different card
  3. Throwaway details are often used, including a generic email with some numbers in it

Frequency Algorithm

On the other hand, the overwhelming majority of other transactions were performed using a single card. Even if there were multiple attempts, they generally used one or two cards, but not more. I guessed I could use an algorithm that worked as follows for each transaction.

  1. Create a hash based on the last four digits on the card, the expiration date. This could use MD5, SHA or any other algorithm.
  2. Create a counter for the IP address that submitted that combination of values as represented by the hash and initialize to one
  3. For each transaction attempt, repeat step 1. If the hash matches what was stored in step 2 then don’t increment. If it doesn’t match, then increment the counter to two.

This process is repeated for every transaction attempt. Notice that a customer is free to continue submitting different addresses or CCV values for a single card without incrementing the counter. If the counter reaches a threshold, all transactions submitted from that IP address can be dropped. In my implementation I provided for an hour retention of data on a given IP address. The hour retention is reset every time a transaction is attempted from the IP address, which could keep it blocked indefinitely.

This credit card fraud prevention algorithm was implemented as a RESTful service using python bottle and memcached and provides less 100ms response times under heavy load and high concurrency.

Software Engineering

Hadoop Scripts in Python

I read that Hadoop supports scripts written in various languages other than Java, such as Python. Since I’m a fan of python, I wanted to prove this out.

It was my good fortune to find an excellent post by Michael Noll that walked me through the entire process of scripting in Python for Hadoop. It’s an excellent post and worked as written for me in Hadoop 2.2.0.

How hadoop processes scripts from other languages (stdin/stdout)

In order to accommodate scripts from other languages, hadoop focuses on standard in (stdin) and standard out (stdout). As lines are read from the input, they are written to stdout, so that the external script simply monitors stdin and then processes the line. Similarly, the mapper, which reads from stdin then writes to stdout so that the reducer can read from stdin and write again to stdout. Hadoop then consumes the reducer output to write the results. Hopefully this is easy to see in the scripts below.

It should be possible to write similar scripts in any language that can read stdin and write to stdout. There may be some performance implications when choosing another language for processing, but from a developer productivity perspective, a language like python can allow for more rapid development. If you have a process that would benefit from multi-processor use and is CPU bound, it may be worth the extra effort to create in Java.

Revisit Tomcat Logs in Python

I’ll try to add value by revisiting my original example of analyzing Apache Tomcat (Java) logs in Hadoop using python this time.

mapper.py

#!/usr/bin/env python
 
import sys
import re
 
log_pattern = r"""([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3})\s*([a-zA-Z]+)\s*([a-zA-Z0-9.]+)\s*-\s*(.+)$"""
log_pattern_re = re.compile(log_pattern)
 
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into segments
    try:
        segments = log_pattern_re.search(line).groups()
    except:
        segments = None
 
    if segments != None and segments[1] == "ERROR":
        # tab delimited indexed by log date (data[0])
        print '%s\t%s' % (segments[0], 1)
    else:
        pass

reducer.py

#!/usr/bin/env python
 
from operator import itemgetter
import sys
 
current_word = None
current_count = 0
key = None
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    key, value = line.split('\t', 1)
 
    # convert value (currently a string) to int
    try:
        value = int(value)
    except ValueError:
        # value was not a number, so silently
        # ignore/discard this line
        continue
 
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: key) before it is passed to the reducer
    if current_word == key:
        current_count += value
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = value
        current_word = key
 
# do not forget to output the last key if needed!
if current_word == key:
    print '%s\t%s' % (current_word, current_count)

Running a streaming hadoop job using python

[watrous@myhost ~]$ hadoop jar ./hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file /home/watrous/mapper.py -mapper /home/watrous/mapper.py -file /home/watrous/reducer.py -reducer /home/watrous/reducer.py -input /user/watrous/log_myhost.houston.hp.com/* -output /user/watrous/python_output_myhost.houston.hp.com
packageJobJar: [/home/watrous/mapper.py, /home/watrous/reducer.py] [] /tmp/streamjob6730226310886003770.jar tmpDir=null
13/11/18 19:41:09 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
...
13/11/18 19:41:10 INFO mapred.LocalDistributedCacheManager: Localized file:/home/watrous/mapper.py as file:/tmp/hadoop-watrous/mapred/local/1384803670534/mapper.py
13/11/18 19:41:10 INFO mapred.LocalDistributedCacheManager: Localized file:/home/watrous/reducer.py as file:/tmp/hadoop-watrous/mapred/local/1384803670535/reducer.py
...
13/11/18 19:41:11 INFO streaming.PipeMapRed: PipeMapRed exec [/home/watrous/./mapper.py]
...
13/11/18 19:43:35 INFO streaming.PipeMapRed: PipeMapRed exec [/home/watrous/./reducer.py]
...
13/11/18 19:43:36 INFO mapreduce.Job:  map 100% reduce 100%
13/11/18 19:43:36 INFO mapreduce.Job: Job job_local848014294_0001 completed successfully
13/11/18 19:43:36 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=291759
                FILE: Number of bytes written=3394292
...

After successfully running the map reduce job, confirm the results are present and view the results.

[watrous@myhost ~]$ hdfs dfs -ls -R /user/watrous/python_output_myhost.houston.hp.com/
13/11/18 19:45:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   3 watrous supergroup          0 2013-11-18 19:43 /user/watrous/python_output_myhost.houston.hp.com/_SUCCESS
-rw-r--r--   3 watrous supergroup      10813 2013-11-18 19:43 /user/watrous/python_output_myhost.houston.hp.com/part-00000
[watrous@myhost ~]$ hdfs dfs -cat /user/watrous/python_output_myhost.houston.hp.com/part-00000 | more
13/11/18 19:45:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-25 06:07:37,108 1
2013-10-25 06:07:37,366 1
2013-10-27 06:08:35,770 1
2013-10-27 06:08:35,777 1
2013-10-27 06:08:35,783 1
2013-10-27 06:09:33,748 11
2013-10-27 06:09:33,749 25
2013-10-27 06:09:33,750 26
2013-10-27 06:09:33,751 14
2013-10-27 07:07:58,383 5
Software Engineering

Lightweight Replication Monitoring with MongoDB

One of my applications runs on a large assortment of hosts split between various data centers. Some of these are redundant pairs and others are in load balanced clusters. They all require a set of identical files which represent static content and other data.

rsync was chosen to facilitate replication of data from a source to many targets. What rsync lacked out of the box was a reporting mechanism to verify that the collection of files across target systems was consistent with the source.

Existing solutions

Before designing my solution, I searched for an existing solution to the same problem. I found backup monitor, but the last release was three years ago and there have only ever been 25 downloads, so it was less than compelling. It was also a heavier solution than I was interested in.

In this case it seems that developing a new solution is appropriate.

Monitoring requirements

The goal was to have each target system run a lightweight process at scheduled intervals and send a report to an aggregator service. A report could then be generated based on the aggregated data.

My solution includes a few components. One component analyzes the current state of files on disk and writes that state to a state file. Another component needs to read that file and transmit it to the aggregator. The aggregator needs to store the state identified by the host to which it corresponds. Finally there needs to be a reporting mechanism to display the data for all hosts.

Due to the distributed nature of the replication targets, the solution should be centralized so that changes in reporting structure are picked up by target hosts quickly and with minimal effort.

Current disk state analysis

This component potentially analyses many hundreds of thousands of files. That means the solution for this component must run very fast and be reliable. The speed requirements for this component eliminated some of the scripting languages that might otherwise be appealing (e.g. Perl, Python, etc.).

Instead I chose to write this as a bash script and make use of existing system utilities. The utilities I use include du, find and wc. Here’s what I came up with:

# Generate a report showing the sizes 
# and file counts of replicated folders
 
# create a path reference to the report file
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
reportfile="$BASEDIR/spacereport"
 
# create/overwrite report the file; write date
date '+%Y-%m-%d %H:%M:%S' > $reportfile
 
# append details to report file
du -sh /path/to/replicated/files/* | while read size dir;
do
    echo -n "$size ";
    # augment du output with count of files in the directory
    echo -n `find "$dir" -type f|wc -l`;
    echo " $dir ";
done >> $reportfile

These commands run very fast and produce an output that looks like this:

2012-08-06 21:45:10
4.5M 101 /path/to/replicated/files/style
24M 2002 /path/to/replicated/files/html
6.7G 477505 /path/to/replicated/files/images
761M 1 /path/to/replicated/files/transfer.tgz
30G 216 /path/to/replicated/files/data

Notice that the file output is space and newline delimited. It’s not great for human readability, but you’ll see in a minute that with regular expressions it’s super easy to build a report to send to the aggregator.

Read state and transmit to aggregator

Now that we have a report cached describing our current disk state, we need to format that properly and send it to the aggregator. To do this, Python seemed a good fit.

But first, I needed to be able to quickly and reliably extract information from this plain text report. Regular expressions seemed like a great fit for this, so I used my favorite regex tool, Kodos. The two expressions I need are to extract the date and then each line of report data.

You can see what I came up with in the Python script below.

#-------------------------------------------------------------------------------
# Name:        spacereport
# Purpose:     run spacereport.sh and report the results to a central service
#-------------------------------------------------------------------------------
#!/usr/bin/env python
 
import os
import re
import urllib, urllib2
from datetime import datetime
from socket import gethostname
 
def main():
 
    # where to send the report
    url = 'http://example.com/spacereport.php'
 
    # define regular expression(s)
    regexp_size_directory = re.compile(r"""([0-9.KGM]*)\s*([0-9]*)\s*[a-zA-Z/]*/(.+)""",  re.MULTILINE)
    regexp_report_time = re.compile(r"""^([0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2})\n""")
 
    # run the spacereport.sh script to generate plain text report
    base_dir = os.path.dirname(os.path.realpath(__file__))
    os.system(os.path.join(base_dir, 'spacereport.sh'))
 
    # parse space data from file
    spacedata = open(os.path.join(base_dir, 'spacereport')).read()
    space_report_time = regexp_report_time.search(spacedata).group(1)
    space_data_directories = regexp_size_directory.findall(spacedata)
 
    # create space data transmission
    report_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    hostname = gethostname()
    space_data = {'host': hostname, 'report_time': space_report_time, 'report_time_sent': report_time, 'space_data_by_directory': []}
    for space_data_directory in space_data_directories:
        space_data['space_data_by_directory'].append({'size_on_disk': space_data_directory[0], 'files_on_disk': space_data_directory[1], 'directory': space_data_directory[2]})
 
    # prepare the report
    # it might be better to use the json library for this :)
    report = {'report': str(space_data).replace("'", "\"")}
 
    # encode and send the report
    data = urllib.urlencode(report)
    req = urllib2.Request(url, data)
    response = urllib2.urlopen(req)
 
    # You can optionally output the response to verify that it worked
    the_page = response.read()
    print(the_page)
 
if __name__ == '__main__':
    main()

The variable report contains a value similar to what’s shown below:

{'report': '{"report_time": "2012-08-06 21:45:10", "host": "WATROUS1", "space_data_by_directory": [{"files_on_disk": "101", "directory": "style", "size_on_disk": "4.5M"}, {"files_on_disk": "2002", "directory": "html", "size_on_disk": "24M"}, {"files_on_disk": "477505", "directory": "images", "size_on_disk": "6.7G"}, {"files_on_disk": "1", "directory": "transfer.tgz", "size_on_disk": "761M"}, {"files_on_disk": "216", "directory": "data", "size_on_disk": "30G"}], "report_time_sent": "2012-08-06 16:20:53"}'}

The difference between report_time and report_time_sent, if there is a difference, is important. It can alert you to an error on the system preventing a new, valid report from being created by your shell script. It can also signal load issues if the gap is too wide.

Capture and store state data

Now we need to create spacereport.php, the aggregation script. It’s sole job is to receive the report and store it in MongoDB. This is made easy using PHP’s built in json_decode and MongoDB support. After the call to json_decode, the dates still need to be converted to MongoDates.

<?php
if ($_SERVER['REQUEST_METHOD'] == "POST") {
  // decode JSON report and convert dates to MongoDate
  $spacereport = json_decode($_POST['report'], true);
  $spacereport['report_time_sent'] = new MongoDate(strtotime($spacereport['report_time_sent']));
  $spacereport['report_time'] = new MongoDate(strtotime($spacereport['report_time']));
 
  // connect to MongoDB
  $mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
  // select a database
  $db = $mongoConnection->reports;
 
  // select a collection
  $collection = $db->spacereport;
 
  // add a record
  $collection->insert($spacereport);
 
  print_r($spacereport);
} else {
  // this should probably set the STATUS to 405 Method Not Allowed
  echo 'not POST';
}
?>

At this point, the data is now available in MongoDB. It’s possible to use Mongo’s query mechanism to query the data.

PRIMARY> db.spacereport.find({host: 't1.example.com'}).limit(1).pretty()
{
        "_id" : ObjectId("501fb1d47540c4df76000073"),
        "report_time" : ISODate("2012-08-06T12:00:11Z"),
        "host" : "t1.example.com",
        "space_data_by_directory" : [
                {
                        "files_on_disk" : "101",
                        "directory" : "style ",
                        "size_on_disk" : "4.5M"
                },
                {
                        "files_on_disk" : "2001",
                        "directory" : "html ",
                        "size_on_disk" : "24M"
                },
                {
                        "files_on_disk" : "477505",
                        "directory" : "directory ",
                        "size_on_disk" : "6.7G"
                },
                {
                        "files_on_disk" : "1",
                        "directory" : "transfer.tgz ",
                        "size_on_disk" : "761M"
                },
                {
                        "files_on_disk" : "215",
                        "directory" : "data ",
                        "size_on_disk" : "30G"
                }
        ],
        "report_time_sent" : ISODate("2012-08-06T12:00:20Z")
}

NOTE: For this system, historical data quickly diminishes in importance since what I’m interested in is the current state of replication. For that reason I made the spacereport collection capped to 1MB or 100 records.

PRIMARY> db.runCommand({"convertToCapped": "spacereport", size: 1045876, max: 100});
{ "ok" : 1 }

Display state data report

It’s not very useful to look at the data one record at a time, so we need some way of viewing the data as a whole. PHP is convenient, so we’ll use that to create a web based report.

<html>
<head>
<style>
body {
    font-family: Arial, Helvetica, Sans serif;
}
 
.directoryname {
    float: left;
    width: 350px;
}
 
.sizeondisk {
    float: left;
    text-align: right;
    width: 150px;
}
 
.numberoffiles {
    float: left;
    text-align: right;
    width: 150px;
}
</style>
</head>
 
<body>
 
<?php
 
$host_display_template = "<hr />\n<strong>%s</strong> showing report at <em>%s</em> (of %d total reports)<br />\n";
$spacedata_row_template = "<div class='directoryname'>%s</div> <div class='sizeondisk'>%s</div> <div class='numberoffiles'>%d total files</div><div style='clear: both;'></div>\n";
 
$mongoConnection = new Mongo("m1.example.com,m2.example.com", array("replicaSet" => "replicasetname"));
 
// select a database
$db = $mongoConnection->reports;
 
// select a collection
$collection = $db->spacereport;
 
// group the collection to get a unique list of all hosts reporting space data
$key = array("host" => 1);
$initial = array("reports" => 0);
$reduce = "function(obj, prev) {prev.reports++;}";
$reports_by_host = $collection->group($key, $initial, $reduce);
 
// cycle through all hosts found above
foreach ($reports_by_host['retval'] as $report_by_host) {
    // grab the reports for this host and sort to find most recent report
    $cursor = $collection->find(array("host" => $report_by_host['host']));
    $cursor->sort(array("report_time_sent" => -1))->limit(1);
    foreach ($cursor as $obj) {
        // output details about this host and the report timing
        printf ($host_display_template, $report_by_host['host'], date('M-d-Y H:i:s', $obj['report_time']->sec), $report_by_host['reports']);
        foreach ($obj["space_data_by_directory"] as $directory) {
            // output details about this directory
            printf ($spacedata_row_template, $directory["directory"], $directory["size_on_disk"], $directory["files_on_disk"]);
        }
    }
}
?>
 
</body>
</html>

Centralizing the service

At this point the entire reporting structure is in place, but it requires manual installation or updates on each host where it runs. Even if you only have a handful of hosts, it can quickly become a pain to have to update them by hand each time you can to change the structure.

To get around this, host the two scripts responsible for creating and sending the report in some location that’s accessible to the target host. Then run the report from a third script that grabs the latest copies of those scripts and run the reports.

# download the latest spacereport scripts 
# and run to update central aggregation point
 
BASEDIR=$(dirname $0)
# use wget to grab the latest scripts
wget -q http://c.example.com/spacereport.py -O $BASEDIR/spacereport.py
wget -q http://c.example.com/spacereport.sh -O $BASEDIR/spacereport.sh
# make sure that spacereport.sh is executable
chmod +x $BASEDIR/spacereport.sh
# create and send a spacereport
python $BASEDIR/spacereport.py

Since MongoDB is schemaless, the structure of the reports can be changed at will. Provided legacy values are left in place, no changes are required at any other point in the reporting process.

Future Enhancements

Possible enhancements could include active monitoring, such as allowing an administrator to define rules that would trigger notifications based on the data being aggregated. This monitoring could be implemented as a hook in the spacereport.php aggregator script or based on a cron. Rules could include comparisons between hosts, self comparison with historical data for the same host, or comparison to baseline data external to the hosts being monitored.

Some refactoring to generalize the shell script that produces the initial plain text report may improve efficiency and/or flexibility, though it’s difficult to imagine that writing a custom program to replace existing system utilities would be worthwhile.

The ubiquity of support for MongoDB and JSON would make it easy to reimplement any of the above components in other languages if there’s a better fit for a certain project.

Software Engineering

Revisit Webfaction Hosting

Several years ago I hosted with webfaction for about a year. I was drawn to them at the time because they allowed SSH access and I could run Java applications on my account. Those were not common features available under shared hosting at the time. I didn’t end up deploying any Java applications and the PHP sites I did deploy routed through webfaction’s nginx to PHP configuration which frequently failed. That meant that many visitors to my site saw nginx errors rather than my web page. When they couldn’t resolve the issue I moved my hosting to hostgator and have been extremely happy ever since.

I recently decided to explore Python Flask with MongoDB and was looking at hosting options. Google App Engine is a little too restrictive and I’ve had a mixed experience while working on the Software Licensing system that I’ve developed for that platform. I considered several VPS options including Amazon EC2 and Linode. As I was looking around I thought about webfaction again.

Ready to run Flask and MongoDB

I did a little searching and found that I could deploy Flask applications easily on webfaction. I could also deploy a MongoDB instance on webfaction.

I decided to give them another try and paid my first month’s fee. With about an hour of setup I had successfully deployed a Python Flask “Hello World!” application and had a running MongoDB instance. It was surprisingly easy, especially considering that with any VPS solution I would have needed to setup and worry about a lot more than just my Flask application. It saved me time and money.

Caveat Emptor

What I don’t know is whether they have addressed nginx errors (I recall that they were 502 Bad Gateway errors). Apparently it’s related to the nginx server not getting a suitable reply from the application. If I find myself fighting with that again this time around I may end up on a VPS anyway, but for development, it’s hard to beat their pricing and flexibility.

I really hope it works out that I can run in production on webfaction. I’ll keep you posted.