Reading Time: 4m3s

Optimised actor caching

Recently at work we wanted to solve a problem, for anonymity I will not go into much detail. In any case we had a common problem, where our db was under a lot of query pressure. In this journal entry I will walk through some of the techniques we used to help alleviate the db query pressure.

Goals

Aim	Ideas
Reduce query pressure	Reducing the number of queries to the `db`, would translate to a reduction in CPU load and overall improved performance
Accurate data after cache invalidations	Designing a cache key which is generic would maintain a high cache hit ratio. Alongside this we need a cache invalidation mechanism, which does not result in sudden surges of `db` load
Improve overall latency	Serving frequently accessed data from a cache, with low query latency will help reduce `db` load and improve overall latency

Nature of data served

On our platform we allow users to sell a wide variety of items, which are composed of a wide variety of attributes. This results in a wide variety of product permutations which accompanied by availability factors, results in a dynamic nature of the data.

From the perspective of users who are buying these items, the data is both frequently accessed and changed. This is a caching headache, a more suitable nature of data would be frequently accessed and infrequently changed. Nonetheless, we had this interesting problem to solve.

Due to the nature of the data served on the platform, and the accuracy of data being of paramount importance. We rely on our db providing the source of truth for our data.

Characteristics of the actors in the system

The architecture of a system hints at the characteristics of actors within the system. In a distributed system, depending on the scaling strategies, there can be one or many actors trying to fetch the same underlying data.

Accompanied by traffic and load patterns with peak periods during the day, means that there are often surges of db queries. This can set into motion a spiral of increasing latencies as actors fight over db resources to fetch the data they need.

Cache invalidations & mechanics

Finally, we get to the crescendo of this journal entry. Given the important parameters of the problem, (1) Nature of data served and (2) Characteristics of the actors in the system. A picture is worth a thousand words, and to illustrate the technique we used. See the illustration below:

This mechanism, where the first actor sets a Pending value for the calculated cache key such that following actors wait for the Pending value to be replaced with the results matching the filter significantly alleviates db surges once cache items are invalidated on mutations.

Another benefit of this approach is that we maintain a high cache hit ratio. Following actors either retrieve a Pending result, in which case there is a backoff retry mechanism in place or the actual result from the db.

In theory, we could have returned an error if another actor is busy with a fetch and an actor retrieved a Pending result. However, the first actor could be done with fetching the data from the db and if we retry after a short wait delay (10-50ms) we would still be able to return the expected results. This significantly improves the overall UX, the user does not need to refresh or retry a fetch the waiting actor does that for us.

Wrapping up

Aim	Outcomes
Reduce query pressure	Load of the CPU reduced by 50%
Accurate data after cache invalidations	Cache hit ratio maintained at 90%
Improve overall latency	95th percentile latency of queries improved by 65%