Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged datastore

Software Engineering

External Services in CloudFoundry

CloudFoundry, Stackato and Helion Development Platform accommodate (and encourage) external services for persistent application needs. The types of services include relational databases, like MySQL or PostgreSQL, NoSQL datastores, like MongoDB, messaging services like RabbitMQ and even cache technologies like Redis and Memcached. In each case, connection details, such as a URL, PORT and credentials, are maintained by the cloud controller and injected into the environment of new application instances.

cloudfoundry-service-injection

Injection

It’s important to understand that regardless of how the cloud controller receives details about the service, the process of getting those details to application instances is the same. Much like applications use dependency injection, the cloud controller injects environment variables into each application instance. The application is written to use these environment variables to establish a connection to the external resource. From an application perspective, this looks the same whether a warden or docker container is used.

Connecting to the Service

Connecting to the service from the application instance is the responsibility of the application. There is nothing in CloudFoundry or its derivatives that facilitates this connection beyond injecting the connection parameters into the application instance container. The fact that there is no intermediary between the application instance and the service means that there is no additional latency or potential disconnect. However, the fact that CloudFoundry can scale to an indefinite number of application instances does mean the external service must be able to accommodate all the connections that will result.

Connection pooling is a popular method to reduce the overhead of creating new connections. Since CloudFoundry scales out to many instances, it may be less valuable to manage connection pooling in the application. This may increase memory usage on the application instance while consuming available connections that should be distributed among all instances.

Managed vs. Unmanaged

The Service Broker API may be implemented to facilitate provisioning, binding, unbinding and deprovisioning of resources. This is referred to as a managed service, since the life-cycle of the resource is managed by the PaaS. In the case of managed services, the user interacts with the service only by way of the CloudFoundry command line client.

In an unmanaged scenario, the service resource is provisioned outside of the PaaS. The user then provides connection details to the PaaS in one of two ways.

  • The first is to register it as a service that can then be bound to application instances.
  • The second is to add connection details manually as individual environment variable key/name pairs.

The three methods of incorporating services discussed in this post range from high to low touch and make it possible to incorporate any type of service, even existing services.

Use caution when designing services to prevent them from getting overwhelmed. The scalable character of CloudFoundry means that the number of instances making connections to a service can grow very quickly to an indeterminate number.

Software Engineering

MongoDB Secure Mode

Security in MongoDB is relatively young in terms of features and granularity. Interestingly, they indicate that a typical use case would be to use Mongo on a trusted network “much like how one would use, say, memcached.

MongoDB does NOT run in secure mode by default.

As it is, the features that are available are standard, proven and probably sufficient for most use cases. Here’s a quick summary of pros and cons.

  • Pros
    • Nonce-based digest for authentication
    • Security applies across replica set nodes and shard members
  • Cons
    • Few recent replies on security wiki page
    • Course grained access control

User access levels

Course grained access control allows for users to be defined per database and given either read only or read/write access. Since there is no rigid schema in MongoDB, it’s not possible to limit access to a subset of collections or documents.

Limit to expected IPs

Along the lines of the ‘trusted network’ mentioned above, it’s recommended to configure each mongo instance to accept connections from specific ports. For example, you could limit access to the loopback address, or to an IP for a local private network.

Disable http interface

By default, a useful HTTP based interface provides information about the mongodb instance on a machine and links to similar interfaces on related machines in the replica set. This can be disabled by providing –nohttpinterface when starting mongod.

SSL ready

In cases where SSL security is required, Mongo can be compiled to include support for it. The standard downloads do not include this feature. A standard SSL key can be produced in the usual way, using openssl for example.

Software Engineering

Redis as a cache for Java servlets

I’ve been refactoring an application recently to move away from a proprietary and inflexible in memory datastore. The drawbacks of the proprietary datastore included the fact that the content was static. The only way to update data involved a build and replication process that took much longer than the stakeholders were willing to wait. The main selling point in favor of the in memory datastore was that it is blazing fast. And I mean blazing fast.

My choice for a replacement datastore technology is MongoDB. MongoDB worked great, but the profiling and performance comparison naturally showed that the in memory solution out performed the MongoDB solution. Communication with MongoDB for every request was obviously much slower than the previous in memory datastore solution, and the response time was less consistent from one request to another.

Caching for performance

When the data being used to generate a response changes infrequently, it’s generally bad design to serve dynamic content on every page load. Enter caching. There are a host of caching approaches covering everything from reverse proxies, like varnish, to platform specific solutions, like EHCache.

As a first stab, I chose a golden oldie, memcached, and an up and coming alternative, redis. There’s some lively discussion online about the performance differences between these two technologies. Ultimately I chose Redis due to the active development on the platform and the feature set.

Basic cache

In Java there are a handful of available Redis drivers. I started with the Jedis client. In order to use Jedis, I added this to my pom.xml.

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>2.0.0</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

I then modified my basic Servlet to init a JedisPool and use jedis to cache the values I was retrieving from MongoDB. Here’s what my class ended up looking like.

package com.danielwatrous.cachetest;
 
import com.google.inject.Guice;
import com.google.inject.Injector;
import com.danielwatrous.linker.domain.WebLink;
import com.danielwatrous.linker.modules.MongoLinkerModule;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
 
public class BuildCnavLink extends HttpServlet {
 
    private static Injector hbinjector = null;
    private static JedisPool pool = null;
 
    @Override
    public void init() {
        hbinjector = Guice.createInjector(new MongoLinkerModule());
        pool = new JedisPool(new JedisPoolConfig(), "localhost", 6379);
    }
 
    @Override
    public void destroy() {
        pool.destroy();
    }
 
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        response.setContentType("text/xml;charset=UTF-8");
        PrintWriter out = response.getWriter();
        String value = "";
        Jedis jedis = null;
 
        try {
            jedis = pool.getResource();
            String cacheKey = getCacheKey (request.getParameter("country"), request.getParameter("lang"), request.getParameter("company"));
            value = jedis.get(cacheKey);
            if (value == null) {
                WebLink webLink = hbinjector.getInstance(WebLink.class);
                webLink.setLanguage(request.getParameter("lang"));
                webLink.setCountry(request.getParameter("country"));
                webLink.setCompany(request.getParameter("company"));
                value = webLink.buildWebLink();
                jedis.set(cacheKey, value);
            }
        } finally {
            pool.returnResource(jedis);            
        }
 
        try {
            out.println("<link>");
            out.println("<url>" + value + "</url>");
            out.println("</link>");
        } finally {            
            out.close();
        }
    }
 
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        processRequest(request, response);
    }
 
    protected String getCacheKey (String country, String lang, String company) {
        String cacheKey = country + lang + company;
        return cacheKey;
    }
}

Observations

It’s assumed that a combination of country, lang and company will produce a unique value when buildWebLink is called. That must be the case if you’re using those to generate a cache key.

There’s also nothing built in above to invalidate the cache. In order to validate the cache it may work to build a time/age check. There may be other more sophisticated optimistic or pessimistic algorithms to manage cached content.

In the case above, I’m using redis to store a simple String value. I’m also still generating a dynamic response, but I’ve effectively moved the majority of my data calls to redis.

Conclusion

As a first stab, this performs on par with the proprietary in memory solution that we’re replacing and the consistency from one request to another is very tight. Here I’m connecting to a local redis instance. If redis were on a remote box, network latency may erase these gains. Object storage or serialization may also affect performance if it’s determined that simple String caching isn’t sufficient or desirable.

Resources

http://www.ibm.com/developerworks/java/library/j-javadev2-22/index.html
https://github.com/xetorthio/jedis/wiki/Getting-started

Software Engineering

WordPress plugin licensing: Google App Engine vs. Amazon EC2

In the introduction to this series, I outlined some of the requirements for the WordPress plugin licensing platform: Speed, reliability and scalability. These are critical. Just imagine what would happen if any of those were missing.

Requirements Justification

A slow platform might result in significantly fewer sales. One of our use cases is to provide a free, limited time trial, and poor performance when installing or using a plugin would almost certainly decrease sales conversions. Reliability issues would, at a minimum, reduce developer confidence when coupling a new plugin to the licensing platform. Finally, if the speed and reliability don’t scale then the market of potential consumers is limited to smaller plugins.

Possible solutions

Fortunately the problem of speed, reliability and scalability have already been solved. I know that it’s possible to build out servers, load balance them and otherwise build out systems to achieve these three aims, but I have something much simpler in mind. The two most compelling options available today both allow a developer to leverage the infrastructure of very large companies that exist solely on the internet: Amazon and Google.

The business model of both Amazon and Google require them to build out their own internal infrastructure to accommodate peak volume. The big downside to this is that the majority of the time, some or most of that infrastructure is sitting idle. A somewhat interesting upside to the scale of their infrastructure is that they have had to develop internal processes that enable them to expand supply in step with demand. In other words, they have to be able to add additional resources on the fly in the event of a new record peak. That may not sound as impressive as it is 🙂

At some point, each of these companies realized that they could leverage their unused infrastructure to increase their revenue. They more or less sub-lease existing resources to third parties. As this product model developed they may have isolated the resources they sell from the resources that power their main websites, but there is still a great deal of play between them. The two offerings are Google App Engine (GAE) and Amazon Web Services (AWS).

App Engine vs. Amazon Web Services

.

There are many more differences between these two platforms than I have time to get into here. However, one distinction between the two is helpful. Amazon offers a wide range of services (they add new services often) that provide the developer with a great deal of flexibility. However, the burden of choosing the right platform components and interconnecting them is also on the developer.

Google on the other hand has a more narrowly defined and inclusive platform. Rather than separating content distribution, processing, messaging, etc., Google keeps it all under the same hood. This reduces complexity for the developer at the cost of some flexibility.

This distinction is rather natural when you consider the diversity of products and engagement channels employed by Amazon and compare that to the more narrow range of services and engagement channels employed by Google.

The winner?

For the licensing project that I’m developing in this series, the scope is well defined and not overly complex. Google App Engine is the most appealing due to the ease of working in a local development environment and the ability to deploy and test on the live platform under the free quota limits (no initial cost or setup). It’s important to note that choosing Google’s platform instead of Amazon’s doesn’t make Amazon the loser and it doesn’t have to mean that I need to exclusively run on Google’s platform forever.

GAE provides both Python and Java enviornments. If I choose Java and approach the design carefully (e.g. good datastore abstraction…), it may not require too much effort to deploy on an Amazon EC2 instance if that becomes more appealing down the road.

Conclusion

The WordPress plugin licensing system will target the Google App Engine platform initially. Special attention will be given to abstracting the datastore so that I can take advantage of Google’s fast and scalable datastore and leave myself flexibility to move to an alternate if I deploy on Amazon’s platform in the future. Java is a first class citizen on both platforms. This provides some assurance that mainstream, mature frameworks will run smoothly. It also typically means that there will be plenty of documentation and support to accelerate development and deployment along.