I’ve had several conversations recently about caching as it relates to big data. As a result of these discussions I wanted to review some details that should be considered when deciding if a cache is necessary and how to cache big data when it is necessary.
What is a Cache?
The purpose of a cache is to duplicate frequently accessed or important data in such a way that it can be accessed very fast and close to where it is needed. Caching generally moves data from a low cost, high density location (e.g. disk) to a high cost, very fast location (e.g. RAM). In some cases a cache moves data from a high latency context (e.g. network) to a low latency context (e.g. local disk). A cache increases hardware requirements due to duplication of data.
Big Data Alternatives
A review of big data alternatives shows that some are designed to operate entirely from RAM while others operate primarily from disk. Each option has made a trade off that can be viewed along two competing priorities: Scalability & Performance vs. Depth of Functionality.
While relational databases are heavy on functionality and lag on performance, other alternatives like Redis are heavy on performance but lag with respect to features.
It’s also important to recognize that there are some tools, such as Lucene and Memcached that are not typically considered as belonging to the collection of big data tools, but they effectively occupy the same feature space.
Robust functionality requirements may necessitate using a lower performance big data alternative. If that lower performance impacts SLAs (Service Level Agreement), it may be necessary to introduce a more performant alternative, either as a replacement or compliment to the first choice.
In other words, some of the big data alternatives occupy the cache tool space, while others are more feature rich datastores.
Cache Scope and Integration
Cache scope and cache positioning have a significant impact on the effectiveness of a solution. In the illustration below I outline a typical tiered web application architecture.
As indicated, each layer can be made to interact with a cache. I’ve also highlighted a few example technologies that may be a good fit for each feature space. A cache mechanism can be introduced in any tier, or even in multiple tiers.
Some technologies, like MongoDB build some level of caching into their engine directly. This can be convenient, but it may not address latency or algorithm complexity. It may also not be feasible to cache a large result set when a query affects a large number of records or documents.
Algorithm Complexity and Load
To discuss algorithm complexity, let’s consider a couple of examples. Imagine a sample of web analytics data comprising 5,000,000 interactions. Each interaction document includes the following information:
- browser details
- page details
- tagging details
- username if available
Algorithm #1: User Behavior
An algorithm that queries interactions by username and calculates the total number of interactions and time on site will require access to a very small subset of the total data since not all records contain a username, and not all users are logged in when viewing the site (some may not even have an account). Also, the number of interactions a single user can produce is limited.
It’s likely that in this case no caching would be required. A database level cache may be helpful since the number of records are few and can remain in memory throughout the request cycle. Since the number of requests for a particular user is likely to be small, caching at the web tier would yield little value, while still consuming valuable memory.
Algorithm #2: Browser Choice by Demographic
Unlike the algorithm above that assumes many usernames (fewer records per username), there are likely to be a dozen or fewer browser types across the entire collection of data (many records per browser type). As a result, an algorithm that calculates the number of pages visited using a specific browser may reach into the millions.
In this case a native datastore cache may fall short and queries would resort to disk every time. A carefully designed application level cache could reduce load on the datastore while increasing responsiveness.
If an application level cache is selected, a decision still needs to be made about scope. Should a subset of the data as individual records or record fragments be stored in the cache or should that data be transformed into a more concise format before placing it in the cache.
Notice here that the cache is not only a function of memory resources, but also CPU. There may be sufficient memory to store all the data, but if the algorithm consumes a great deal of CPU then despite fast access to the data, the application may lag.
If the exact same report is of interest to many people, a web tier cache may be justified, since there would be no value in generating the exact same report on each request. However, this needs to be balanced against authentication and authorization requirements that may protect that data.
Algorithm complexity and server load can influence cache design as much as datastore performance.
We’ve discussed some trade offs between available memory and CPU. Additional factors include clarity of code, latency between tiers, scale and performance. In some cases an in memory solution may fall short due to latency between a cache server and application servers. In those cases, proximity to the data may require distributing a cache so that latency is limited by system bus speeds and not network connectivity.
CAUTION: As you consider introducing a cache at any level, make sure you’re not prematurely optimizing. Code clarity and overall architecture can suffer considerably when optimization is pursued without justification.
If you don’t need a cache, don’t introduce one.
Existing Cache Solutions
Caching can be implemented inside or outside your application. SaaS solutions, such as Akamai, prevent the request from ever reaching your servers. Build your own solutions which leverage cloud CDN offerings are also available. When building a cache into your application, there are options like Varnish Cache. For Java applications EHCache or cache4guice may be a good fit. As mentioned above, there are many technologies available to create a custom cache, among which Memcached and Redis are very popular. These also work with a variety of programming languages and offer flexibility with respect to what is stored and how it is stored.
Hopefully it’s obvious that there’s no single winner when it comes to cache technology. However, if you find that the solution you’ve chosen doesn’t perform as well as required, there are many options to get that data closer to the CPU and even reduce the CPU load for repetitive processing.