Daniel Watrous on Software Engineering

A Collection of Software Problems and Solutions

Posts tagged redis

Software Engineering

Increase Munin Resolution to sub-minute

I previously explained how to get one-minute resolution in Munin. The process to get sub-minute resolution in Munin is more tricky. The main reason it’s more tricky is that cron only runs once per minute, which means data must be generated and cached in between cron runs for collection when cron runs.

munin-plugin-resolution

In the case where a single datapoint is collected each time cron runs, the time at which cron runs is sufficient to store the data in rrd. With multiple datapoints being collected on a single cron run, it’s necessary to embed a timestamp with each datapoint so the datapoints can be properly stored in rrd.

For example, the load plugin which produces this for a one minute or greater collection time:

load.value 0.00

Would need to produce output like this for a five (5) second collection time:

load.value 1426889886:0.00
load.value 1426889891:0.00
load.value 1426889896:0.00
load.value 1426889901:0.00
load.value 1426889906:0.00
load.value 1426889911:0.00
load.value 1426889916:0.00
load.value 1426889921:0.00
load.value 1426889926:0.00
load.value 1426889931:0.00
load.value 1426889936:0.00
load.value 1426889941:0.00

Caching mechanism

In one example implementation of a one second collection rate, a datafile is either appended to or replaced using standard Linux file management mechanisms.

This looks a lot like a message queue problem. Each message needs to be published to the queue. The call to fetch is like the subscriber who then pulls all available messages, which empties the queue from the cache window.

Acquire must be long running

In the case of one minute resolution, the data can be generated at the moment it is collected. This means the process run by cron is sufficient to generate the desired data and can die after the data is output. For sub-minute resolution, a separate long running process is required to generate and cache the data. There are a couple of ways to accomplish this.

  1. Start a process that will only run util the next cron runs. This would be started each time cron fetched the data
  2. Create a daemon process that will produce a stream of data.

A possible pitfall with #2 above is that it would continue producing data even if the collection cron was failing. Option #1 results in more total processes being started.

Example using Redis

Redis is a very fast key/value datastore that runs natively on Linux. The following example shows how to use a bash script based plugin with Redis as the cache between cron runs. I can install Redis on Ubuntu using apt-get.

sudo apt-get install -y redis-server

And here is the plugin.

#!/bin/bash
# (c) 2015  - Daniel Watrous
 
update_rate=5                   # sampling interval in seconds
cron_cycle=1                    # time between cron runs in minutes
pluginfull="$0"                 # full name of plugin
plugin="${0##*/}"               # name of plugin
redis_cache="$plugin.cache"
graph="$plugin"
section="system:load"
style="LINE"
 
run_acquire() {
   while [ "true" ]
   do
      sleep $update_rate
      datapoint="$(cat /proc/loadavg | awk '{print "load.value " systime() ":" $1}')"
      redis-cli RPUSH $redis_cache "$datapoint"
   done
}
 
# --------------------------------------------------------------------------
 
run_autoconf() {
   echo "yes"
}
 
run_config() {
cat << EOF
graph_title $graph
graph_category $section
graph_vlabel System Load
graph_scale no
update_rate $update_rate
graph_data_size custom 1d, 10s for 1w, 1m for 1t, 5m for 1y
load.label load
load.draw $style
style=STACK
EOF
}
 
run_fetch()  {
   timeout_calc=$(expr $cron_cycle \* 60 + 5)
   timeout $timeout_calc $pluginfull acquire >/dev/null &
   while [ "true" ]
   do
     datapoint="$(redis-cli LPOP $redis_cache)"
     if [ "$datapoint" = "" ]; then
       break
     fi
     echo $datapoint
   done
}
 
 
run_${1:-fetch}
exit 0

Restart munin-node to find plugin

Before the new plugin will be found and executed, it’s necessary to restart munin-node. If the autoconfig returns yes data collection will start automatically.

ubuntu@munin-dup:~$ sudo service munin-node restart
munin-node stop/waiting
munin-node start/running, process 4684

It’s possible to view the cached values using the LRANGE command without disturbing their collection. Recall that calling fetch will remove them from the queue, so you want to leave Munin to call that.

ubuntu@munin-dup:~$ redis-cli LRANGE load_dynamic.cache 0 -1
1) "load.value 1426910287:0.13"
2) "load.value 1426910292:0.12"
3) "load.value 1426910297:0.11"
4) "load.value 1426910302:0.10"

That’s it. Now you have a Munin plugin with resolution down to the second.

Software Engineering

Big Data Cache Approaches

I’ve had several conversations recently about caching as it relates to big data. As a result of these discussions I wanted to review some details that should be considered when deciding if a cache is necessary and how to cache big data when it is necessary.

What is a Cache?

The purpose of a cache is to duplicate frequently accessed or important data in such a way that it can be accessed very fast and close to where it is needed. Caching generally moves data from a low cost, high density location (e.g. disk) to a high cost, very fast location (e.g. RAM). In some cases a cache moves data from a high latency context (e.g. network) to a low latency context (e.g. local disk). A cache increases hardware requirements due to duplication of data.

Big Data Alternatives

A review of big data alternatives shows that some are designed to operate entirely from RAM while others operate primarily from disk. Each option has made a trade off that can be viewed along two competing priorities: Scalability & Performance vs. Depth of Functionality.

scale-performance-functionality

While relational databases are heavy on functionality and lag on performance, other alternatives like Redis are heavy on performance but lag with respect to features.

It’s also important to recognize that there are some tools, such as Lucene and Memcached that are not typically considered as belonging to the collection of big data tools, but they effectively occupy the same feature space.

Robust functionality requirements may necessitate using a lower performance big data alternative. If that lower performance impacts SLAs (Service Level Agreement), it may be necessary to introduce a more performant alternative, either as a replacement or compliment to the first choice.

In other words, some of the big data alternatives occupy the cache tool space, while others are more feature rich datastores.

Cache Scope and Integration

Cache scope and cache positioning have a significant impact on the effectiveness of a solution. In the illustration below I outline a typical tiered web application architecture.

Big data integration with a cache can be introduced at any tier within a web application

Big data integration with a cache can be introduced at any tier within a web application

As indicated, each layer can be made to interact with a cache. I’ve also highlighted a few example technologies that may be a good fit for each feature space. A cache mechanism can be introduced in any tier, or even in multiple tiers.

Some technologies, like MongoDB build some level of caching into their engine directly. This can be convenient, but it may not address latency or algorithm complexity. It may also not be feasible to cache a large result set when a query affects a large number of records or documents.

Algorithm Complexity and Load

To discuss algorithm complexity, let’s consider a couple of examples. Imagine a sample of web analytics data comprising 5,000,000 interactions. Each interaction document includes the following information:

  • timestamp
  • browser details
  • page details
  • tagging details
  • username if available

Algorithm #1: User Behavior

An algorithm that queries interactions by username and calculates the total number of interactions and time on site will require access to a very small subset of the total data since not all records contain a username, and not all users are logged in when viewing the site (some may not even have an account). Also, the number of interactions a single user can produce is limited.

It’s likely that in this case no caching would be required. A database level cache may be helpful since the number of records are few and can remain in memory throughout the request cycle. Since the number of requests for a particular user is likely to be small, caching at the web tier would yield little value, while still consuming valuable memory.

Algorithm #2: Browser Choice by Demographic

Unlike the algorithm above that assumes many usernames (fewer records per username), there are likely to be a dozen or fewer browser types across the entire collection of data (many records per browser type). As a result, an algorithm that calculates the number of pages visited using a specific browser may reach into the millions.

In this case a native datastore cache may fall short and queries would resort to disk every time. A carefully designed application level cache could reduce load on the datastore while increasing responsiveness.

If an application level cache is selected, a decision still needs to be made about scope. Should a subset of the data as individual records or record fragments be stored in the cache or should that data be transformed into a more concise format before placing it in the cache.

Notice here that the cache is not only a function of memory resources, but also CPU. There may be sufficient memory to store all the data, but if the algorithm consumes a great deal of CPU then despite fast access to the data, the application may lag.

If the exact same report is of interest to many people, a web tier cache may be justified, since there would be no value in generating the exact same report on each request. However, this needs to be balanced against authentication and authorization requirements that may protect that data.

Algorithm complexity and server load can influence cache design as much as datastore performance.

Balance

We’ve discussed some trade offs between available memory and CPU. Additional factors include clarity of code, latency between tiers, scale and performance. In some cases an in memory solution may fall short due to latency between a cache server and application servers. In those cases, proximity to the data may require distributing a cache so that latency is limited by system bus speeds and not network connectivity.

CAUTION: As you consider introducing a cache at any level, make sure you’re not prematurely optimizing. Code clarity and overall architecture can suffer considerably when optimization is pursued without justification.

If you don’t need a cache, don’t introduce one.

Existing Cache Solutions

Caching can be implemented inside or outside your application. SaaS solutions, such as Akamai, prevent the request from ever reaching your servers. Build your own solutions which leverage cloud CDN offerings are also available. When building a cache into your application, there are options like Varnish Cache. For Java applications EHCache or cache4guice may be a good fit. As mentioned above, there are many technologies available to create a custom cache, among which Memcached and Redis are very popular. These also work with a variety of programming languages and offer flexibility with respect to what is stored and how it is stored.

Best Solution

Hopefully it’s obvious that there’s no single winner when it comes to cache technology. However, if you find that the solution you’ve chosen doesn’t perform as well as required, there are many options to get that data closer to the CPU and even reduce the CPU load for repetitive processing.

Software Engineering

Redis as a cache for Java servlets

I’ve been refactoring an application recently to move away from a proprietary and inflexible in memory datastore. The drawbacks of the proprietary datastore included the fact that the content was static. The only way to update data involved a build and replication process that took much longer than the stakeholders were willing to wait. The main selling point in favor of the in memory datastore was that it is blazing fast. And I mean blazing fast.

My choice for a replacement datastore technology is MongoDB. MongoDB worked great, but the profiling and performance comparison naturally showed that the in memory solution out performed the MongoDB solution. Communication with MongoDB for every request was obviously much slower than the previous in memory datastore solution, and the response time was less consistent from one request to another.

Caching for performance

When the data being used to generate a response changes infrequently, it’s generally bad design to serve dynamic content on every page load. Enter caching. There are a host of caching approaches covering everything from reverse proxies, like varnish, to platform specific solutions, like EHCache.

As a first stab, I chose a golden oldie, memcached, and an up and coming alternative, redis. There’s some lively discussion online about the performance differences between these two technologies. Ultimately I chose Redis due to the active development on the platform and the feature set.

Basic cache

In Java there are a handful of available Redis drivers. I started with the Jedis client. In order to use Jedis, I added this to my pom.xml.

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>2.0.0</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

I then modified my basic Servlet to init a JedisPool and use jedis to cache the values I was retrieving from MongoDB. Here’s what my class ended up looking like.

package com.danielwatrous.cachetest;
 
import com.google.inject.Guice;
import com.google.inject.Injector;
import com.danielwatrous.linker.domain.WebLink;
import com.danielwatrous.linker.modules.MongoLinkerModule;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
 
public class BuildCnavLink extends HttpServlet {
 
    private static Injector hbinjector = null;
    private static JedisPool pool = null;
 
    @Override
    public void init() {
        hbinjector = Guice.createInjector(new MongoLinkerModule());
        pool = new JedisPool(new JedisPoolConfig(), "localhost", 6379);
    }
 
    @Override
    public void destroy() {
        pool.destroy();
    }
 
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        response.setContentType("text/xml;charset=UTF-8");
        PrintWriter out = response.getWriter();
        String value = "";
        Jedis jedis = null;
 
        try {
            jedis = pool.getResource();
            String cacheKey = getCacheKey (request.getParameter("country"), request.getParameter("lang"), request.getParameter("company"));
            value = jedis.get(cacheKey);
            if (value == null) {
                WebLink webLink = hbinjector.getInstance(WebLink.class);
                webLink.setLanguage(request.getParameter("lang"));
                webLink.setCountry(request.getParameter("country"));
                webLink.setCompany(request.getParameter("company"));
                value = webLink.buildWebLink();
                jedis.set(cacheKey, value);
            }
        } finally {
            pool.returnResource(jedis);            
        }
 
        try {
            out.println("<link>");
            out.println("<url>" + value + "</url>");
            out.println("</link>");
        } finally {            
            out.close();
        }
    }
 
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        processRequest(request, response);
    }
 
    protected String getCacheKey (String country, String lang, String company) {
        String cacheKey = country + lang + company;
        return cacheKey;
    }
}

Observations

It’s assumed that a combination of country, lang and company will produce a unique value when buildWebLink is called. That must be the case if you’re using those to generate a cache key.

There’s also nothing built in above to invalidate the cache. In order to validate the cache it may work to build a time/age check. There may be other more sophisticated optimistic or pessimistic algorithms to manage cached content.

In the case above, I’m using redis to store a simple String value. I’m also still generating a dynamic response, but I’ve effectively moved the majority of my data calls to redis.

Conclusion

As a first stab, this performs on par with the proprietary in memory solution that we’re replacing and the consistency from one request to another is very tight. Here I’m connecting to a local redis instance. If redis were on a remote box, network latency may erase these gains. Object storage or serialization may also affect performance if it’s determined that simple String caching isn’t sufficient or desirable.

Resources

http://www.ibm.com/developerworks/java/library/j-javadev2-22/index.html
https://github.com/xetorthio/jedis/wiki/Getting-started