Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged cache

Software Engineering

Caching in Java using Aspect Oriented Programming (AOP)

As the scale of web applications increases, performance optimization considerations are more frequently included in initial design. One optimization technique used extensively is caching. A cache contains pre-processed data that is ready to use without redoing the processing. Processing may include extensive computation and accessing data over a network or on disk.

Keep the details out of your code

One design consideration when introducing a caching mechanism in your code is to keep the details out of your code. Most caches are just simple key value stores, and so it’s tempting to introduce them where you have a performance issue. Imagine the java method.

    public Shop getShopsNear(Shop shop) {
        Shop returnShop;
        // query google maps api with shop data to get region
        // query database for region
        // calculate center of region
        // sort by distance from center
        returnShop = closestShopToCenter;
        return returnShop;
    }

Obviously that has the potential to take a long time to process. It can also be expected that there are a reasonably small number of shops. Even if there are several thousand, caching several thousand results in memory is very manageable.

Wrong way

It might be tempting to jump right in and introduce caching like this:

    public Shop getShopsNear(Shop shop) {
        Shop returnShop;
        // create a cache key
        String cacheKey = "getShopsNear: " + shop.toString();
        // check for cached result
        Cache cache = CacheProvider.getCache();
        returnShop = cache.get(cacheKey);
        if (returnShop == null) {
            // query google maps api with shop data to get region
            // query database for region
            // calculate center of region
            // sort by distance from center
            returnShop = closestShopToCenter;
            // cache result
            cache.put(cacheKey, returnShop);
        }
        return returnShop;
    }

At first glance, that’s great. It follows most quick start tutorials for caching solutions and it will improve the performance of your code. However, there are several problems with this approach. I highlight some here.

  • Higher risk of key conflicts and debugging since it’s managed at the method level
  • The code is tied to specific cache implementation
  • The method is less clear due to clutter from caching

AOP – a better way

One way to get around the problems above and have a caching mechanism that will grow with your application and provide flexibility is to use Aspect Oriented Programming. Google Guice also refers to this as Method Interception. The idea is that you identify some pre-processing that should take place before calling the actual method.

In this case, we want to put a cache interceptor at the method level and keep the processing of caching, such as key generation, cache provider selection, etc. centralized. Here’s what that might look like.

    @Cached(timeToLiveSeconds = 3600)
    public Shop getShopsNear(Shop shop) {
        Shop returnShop;
        // query google maps api with shop data to get region
        // query database for region
        // calculate center of region
        // sort by distance from center
        returnShop = closestShopToCenter;
        return returnShop;
    }

The only change to your original method is the @Cached annotation. You also have the option of defining the duration of that data in the cache, or leave it out to use the default duration.

The method interceptor code deals with the selection of cache provider, key generation, etc. This makes cache evaluation fast and effortless. It also keeps your application code clear. Changes in the future are now easy to configure.

Getting Started

I’ve created an AOP caching for Guice project. Fork the code or use it as is.

References

My library is based on the initial work by bpoulson.

Software Engineering

Big Data Cache Approaches

I’ve had several conversations recently about caching as it relates to big data. As a result of these discussions I wanted to review some details that should be considered when deciding if a cache is necessary and how to cache big data when it is necessary.

What is a Cache?

The purpose of a cache is to duplicate frequently accessed or important data in such a way that it can be accessed very fast and close to where it is needed. Caching generally moves data from a low cost, high density location (e.g. disk) to a high cost, very fast location (e.g. RAM). In some cases a cache moves data from a high latency context (e.g. network) to a low latency context (e.g. local disk). A cache increases hardware requirements due to duplication of data.

Big Data Alternatives

A review of big data alternatives shows that some are designed to operate entirely from RAM while others operate primarily from disk. Each option has made a trade off that can be viewed along two competing priorities: Scalability & Performance vs. Depth of Functionality.

scale-performance-functionality

While relational databases are heavy on functionality and lag on performance, other alternatives like Redis are heavy on performance but lag with respect to features.

It’s also important to recognize that there are some tools, such as Lucene and Memcached that are not typically considered as belonging to the collection of big data tools, but they effectively occupy the same feature space.

Robust functionality requirements may necessitate using a lower performance big data alternative. If that lower performance impacts SLAs (Service Level Agreement), it may be necessary to introduce a more performant alternative, either as a replacement or compliment to the first choice.

In other words, some of the big data alternatives occupy the cache tool space, while others are more feature rich datastores.

Cache Scope and Integration

Cache scope and cache positioning have a significant impact on the effectiveness of a solution. In the illustration below I outline a typical tiered web application architecture.

Big data integration with a cache can be introduced at any tier within a web application

Big data integration with a cache can be introduced at any tier within a web application

As indicated, each layer can be made to interact with a cache. I’ve also highlighted a few example technologies that may be a good fit for each feature space. A cache mechanism can be introduced in any tier, or even in multiple tiers.

Some technologies, like MongoDB build some level of caching into their engine directly. This can be convenient, but it may not address latency or algorithm complexity. It may also not be feasible to cache a large result set when a query affects a large number of records or documents.

Algorithm Complexity and Load

To discuss algorithm complexity, let’s consider a couple of examples. Imagine a sample of web analytics data comprising 5,000,000 interactions. Each interaction document includes the following information:

  • timestamp
  • browser details
  • page details
  • tagging details
  • username if available

Algorithm #1: User Behavior

An algorithm that queries interactions by username and calculates the total number of interactions and time on site will require access to a very small subset of the total data since not all records contain a username, and not all users are logged in when viewing the site (some may not even have an account). Also, the number of interactions a single user can produce is limited.

It’s likely that in this case no caching would be required. A database level cache may be helpful since the number of records are few and can remain in memory throughout the request cycle. Since the number of requests for a particular user is likely to be small, caching at the web tier would yield little value, while still consuming valuable memory.

Algorithm #2: Browser Choice by Demographic

Unlike the algorithm above that assumes many usernames (fewer records per username), there are likely to be a dozen or fewer browser types across the entire collection of data (many records per browser type). As a result, an algorithm that calculates the number of pages visited using a specific browser may reach into the millions.

In this case a native datastore cache may fall short and queries would resort to disk every time. A carefully designed application level cache could reduce load on the datastore while increasing responsiveness.

If an application level cache is selected, a decision still needs to be made about scope. Should a subset of the data as individual records or record fragments be stored in the cache or should that data be transformed into a more concise format before placing it in the cache.

Notice here that the cache is not only a function of memory resources, but also CPU. There may be sufficient memory to store all the data, but if the algorithm consumes a great deal of CPU then despite fast access to the data, the application may lag.

If the exact same report is of interest to many people, a web tier cache may be justified, since there would be no value in generating the exact same report on each request. However, this needs to be balanced against authentication and authorization requirements that may protect that data.

Algorithm complexity and server load can influence cache design as much as datastore performance.

Balance

We’ve discussed some trade offs between available memory and CPU. Additional factors include clarity of code, latency between tiers, scale and performance. In some cases an in memory solution may fall short due to latency between a cache server and application servers. In those cases, proximity to the data may require distributing a cache so that latency is limited by system bus speeds and not network connectivity.

CAUTION: As you consider introducing a cache at any level, make sure you’re not prematurely optimizing. Code clarity and overall architecture can suffer considerably when optimization is pursued without justification.

If you don’t need a cache, don’t introduce one.

Existing Cache Solutions

Caching can be implemented inside or outside your application. SaaS solutions, such as Akamai, prevent the request from ever reaching your servers. Build your own solutions which leverage cloud CDN offerings are also available. When building a cache into your application, there are options like Varnish Cache. For Java applications EHCache or cache4guice may be a good fit. As mentioned above, there are many technologies available to create a custom cache, among which Memcached and Redis are very popular. These also work with a variety of programming languages and offer flexibility with respect to what is stored and how it is stored.

Best Solution

Hopefully it’s obvious that there’s no single winner when it comes to cache technology. However, if you find that the solution you’ve chosen doesn’t perform as well as required, there are many options to get that data closer to the CPU and even reduce the CPU load for repetitive processing.

Software Engineering

Redis as a cache for Java servlets

I’ve been refactoring an application recently to move away from a proprietary and inflexible in memory datastore. The drawbacks of the proprietary datastore included the fact that the content was static. The only way to update data involved a build and replication process that took much longer than the stakeholders were willing to wait. The main selling point in favor of the in memory datastore was that it is blazing fast. And I mean blazing fast.

My choice for a replacement datastore technology is MongoDB. MongoDB worked great, but the profiling and performance comparison naturally showed that the in memory solution out performed the MongoDB solution. Communication with MongoDB for every request was obviously much slower than the previous in memory datastore solution, and the response time was less consistent from one request to another.

Caching for performance

When the data being used to generate a response changes infrequently, it’s generally bad design to serve dynamic content on every page load. Enter caching. There are a host of caching approaches covering everything from reverse proxies, like varnish, to platform specific solutions, like EHCache.

As a first stab, I chose a golden oldie, memcached, and an up and coming alternative, redis. There’s some lively discussion online about the performance differences between these two technologies. Ultimately I chose Redis due to the active development on the platform and the feature set.

Basic cache

In Java there are a handful of available Redis drivers. I started with the Jedis client. In order to use Jedis, I added this to my pom.xml.

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>2.0.0</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

I then modified my basic Servlet to init a JedisPool and use jedis to cache the values I was retrieving from MongoDB. Here’s what my class ended up looking like.

package com.danielwatrous.cachetest;
 
import com.google.inject.Guice;
import com.google.inject.Injector;
import com.danielwatrous.linker.domain.WebLink;
import com.danielwatrous.linker.modules.MongoLinkerModule;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
 
public class BuildCnavLink extends HttpServlet {
 
    private static Injector hbinjector = null;
    private static JedisPool pool = null;
 
    @Override
    public void init() {
        hbinjector = Guice.createInjector(new MongoLinkerModule());
        pool = new JedisPool(new JedisPoolConfig(), "localhost", 6379);
    }
 
    @Override
    public void destroy() {
        pool.destroy();
    }
 
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        response.setContentType("text/xml;charset=UTF-8");
        PrintWriter out = response.getWriter();
        String value = "";
        Jedis jedis = null;
 
        try {
            jedis = pool.getResource();
            String cacheKey = getCacheKey (request.getParameter("country"), request.getParameter("lang"), request.getParameter("company"));
            value = jedis.get(cacheKey);
            if (value == null) {
                WebLink webLink = hbinjector.getInstance(WebLink.class);
                webLink.setLanguage(request.getParameter("lang"));
                webLink.setCountry(request.getParameter("country"));
                webLink.setCompany(request.getParameter("company"));
                value = webLink.buildWebLink();
                jedis.set(cacheKey, value);
            }
        } finally {
            pool.returnResource(jedis);            
        }
 
        try {
            out.println("<link>");
            out.println("<url>" + value + "</url>");
            out.println("</link>");
        } finally {            
            out.close();
        }
    }
 
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        processRequest(request, response);
    }
 
    protected String getCacheKey (String country, String lang, String company) {
        String cacheKey = country + lang + company;
        return cacheKey;
    }
}

Observations

It’s assumed that a combination of country, lang and company will produce a unique value when buildWebLink is called. That must be the case if you’re using those to generate a cache key.

There’s also nothing built in above to invalidate the cache. In order to validate the cache it may work to build a time/age check. There may be other more sophisticated optimistic or pessimistic algorithms to manage cached content.

In the case above, I’m using redis to store a simple String value. I’m also still generating a dynamic response, but I’ve effectively moved the majority of my data calls to redis.

Conclusion

As a first stab, this performs on par with the proprietary in memory solution that we’re replacing and the consistency from one request to another is very tight. Here I’m connecting to a local redis instance. If redis were on a remote box, network latency may erase these gains. Object storage or serialization may also affect performance if it’s determined that simple String caching isn’t sufficient or desirable.

Resources

http://www.ibm.com/developerworks/java/library/j-javadev2-22/index.html
https://github.com/xetorthio/jedis/wiki/Getting-started