Before You Add Another Service for Caching, Read This.

I'll start this article by giving you a real-world problem scenario that I faced in one of my projects. Let’s say we have a service whose primary goal is to build a configuration on demand, the building of configuration requires querying the database and those queries basically do expensive joins across the different tables.

To put it in numbers, this whole query was taking around 900ms, which seems to be way too much, because the caller service has to wait for 900ms before they get their configuration. Now this might seem okay if there was only one service asking for it. What if there are 100s of these applications asking for it? Are we going to query our database 100 times?

Of course not, right? You might now immediately think, why aren’t you caching it? And you might be right, why am I not caching it? But cache at which layer?

If I cache at the application layer, there might be a different view of my config in different applications right?

Note: I have used application and instances interchangeably.

It might be possible that the latest configuration seen by Instance 1 is different from the latest configuration seen by Instance 2. But why will the latest configuration be any different? Because between the two calls, someone might have made any config change and when we are building the config it would reflect that. So it might be possible that when we ask for the latest configuration from these different instances we might get different results. Because these instances caches don’t know about the state of each other.

Make sense, then instead of having instance (application) level cache, have a distributed cache? Something like Redis?

Why not use Redis instead?

Yes, that would solve our problem, that definitely will, but it comes with its own set of challenges: we will have to manage a separate piece of infrastructure. And usually this is not directly under our influence but with the SREs. This means setup, configuration, monitoring, patching, and handling failures for the Redis cluster itself, in addition to your application.

Can we do something else? I don’t want to include another infra just to get the consistency across these instances.

Why not use Hazelcast instead?

Wait, what? Is it similar to redis? Will I need to deploy my own server? Is it cheap?

Now, let’s define Hazelcast in more easy to understand terms: it is a distributed cache that lives within your applications, keeping them perfectly synchronized without the need for an external caching service like Redis.

At its core, it is an in-memory data grid (IMDG), which means it pools the memory of the application (nodes) to store and process the data, now this memory can also be used as a distributed cache.

Earlier the cache (new HashMap<String, ConfigurationObject>()) didn’t know about the state of the other caches, in Hazelcast case, it knew the state of the cache of other nodes (or members). I think we might be getting ahead of ourselves. Let’s slow down and understand how Hazelcast is able to do it? and how do we integrate with Hazelcast?

We put the following library in our pom.xml file to include Hazelcast.

1<dependency>
2<groupId>com.hazelcast</groupId>
3<artifactId>hazelcast-spring</artifactId>
4</dependency>

Alright, so this is how we include the Hazelcast into our application, but shall we not understand how it talks to the other instances of our application? Say on production we have deployed 5 instances of our application, how does one instance talk to the other instance? How does one instance maintain the cache consistency?

What will happen if I put a value in the cache of say I1 (Instance 1), will it get reflected in I3 (Instance 3)?

What if Instance 1 goes down? Will we lose the data?

We’ll be answering all the questions in this article.

Let’s start with, how does Instance 1 knows that it needs to talk to Instance 2, 3, 4, and 5? How do they even know about each other’s existence? i.e., how do they discover each other?

Let’s call this module:

Forming a Cluster

When we start our application, we also start a Hazelcast “member” within it. Now this member needs to find the others to form a cluster. This process is called discovery. This discovery can happen in two ways:

Automatic Discovery: this is the default setting where the Hazelcast uses multicast on a network to broadcast its presence. Other members on the same network will “hear” this broadcast and they’ll automatically form a cluster.
Explicit Discovery: considering your application is running on cloud, it is possible that multicast isn’t available. In this case, we can configure our Hazelcast member to find each other using a list of known IP addresses (TCP/IP discovery).

Now, here comes the interesting part, once the members find each other, they establish a peer-to-peer network. Here, there is no such central instance or master instance, everyone is equal, no master, no slave.

I hope till this point, we are on the same page, because now comes the most interesting part, the distribution of the data.

Distribution of the Data

Earlier, when we put something in the map of our application, we controlled over the data, i.e. which map will be holding what. For example, say for some reason I have exposed an API which lets me put data into the map (application level — old flow), say /put. Now if I pass on the key -> animal, and value as Giraffe to Instance 1, the map of the application will store it. If I now explicitly query Instance 2 to get that key. What do you think we will get?

We will get null right? Yes, because we stored the value in Instance 1 not Instance 2.

What if I told you if we are using the Hazelcast, it doesn’t matter if you explicitly save the value in Instance 1 or 3, and it doesn’t matter if you query for the key from Instance 2 or 4. You will get the same answer “giraffe”. Earlier, we were deciding which application instance stores the data, with Hazelcast, it decides for us and it uses the data’s key to make a consistent decision.

Sounds interesting right?

Let’s understand how it works under the hood. Think of our entire distributed cache not as a separate cache, but as one giant, shared filing cabinet.

This cabinet has a fixed number of drawers, lets’ say 271 (default in Hazelcast). The drawers are numbered 0 to 270.
When we start the application instances, they agree to share the work of managing this cabinet:
Instance 1 says, “I’ll be responsible for drawers 0 to 135”.
Instance 2 says, “I’ll be responsible for drawers 136 through 270”.

Each instance is now the primary owner for its set of drawers. We will come back to what we mean by primary.

Say we want to store a product with key = “abc”, and value isProduct(“abc”, “iPhone”, “1,35,000”, “INR”).

The request landed to store this information on Instance 1 (say by round-robin or some sort of load balancing). Now here is what the Hazelcast library does.

It takes the key which is “abc”. It performs a mathematical calculation on this key:

1 hash("abc") % 271 = 150

The logic is simple: take the key, run it through a hash function, and then use a module with the total number of drawers. In our case, the result comes out to be 150. This means the product with the key “abc” belongs in drawer number 150. This calculation is deterministic. No matter which instance performs it, the key ‘abc’ will always map to drawer 150.

Now, the Hazelcast library inside Instance 1 (since the request landed on Instance 1), looks at its “Responsibility Chart” (or partition table), it clearly says that drawers from 136 to 270 will be managed by Instance 2. But the request to store came on instance 1. What to do?

It knows that this data doesn’t belong to the Instance 1 but to Instance 2, so it automatically does two things:

It packages up the product object.
Sends it over the network to the Hazelcast library running inside Instance 2.

When Instance 2 receives the request, it opens its virtual drawer number 150 and stores the product object into its own memory (its JVM heap). But, for fault tolerance, it immediately sends a copy of the data to another member (in this case, Instance 1) to hold as a backup. Now, if Instance 2 crashes, the data for drawer 150 is not lost. Instance 1 has a copy, and the cluster can promote it to be the new primary.

Note: The put operation on Instance 1 doesn’t officially complete until Instance 2 confirms it has the data and Instance 1 confirms it has the backup copy.

The primary is used to notify that they are the one with the most latest up to date value of a particular key. The rest might be just holding a backup copy which might not be up to date.

So when a get request comes for the key ‘abc’, the same process happens again. The instance that receives the request — let’s say Instance 3 — first calculates the partition: hash(‘abc’) % 271 = 150. It then consults the partition table, sees that Instance 2 is the primary owner for partition 150, and transparently forwards the request to Instance 2 to retrieve the data.

I hope this clears the interaction of storing and retrieving the data.

. . .

In the next article, we’ll get our hands dirty with code. I’ll walk you through building a complete Spring Boot application that uses Hazelcast to solve our 900ms configuration problem. Follow me to get notified when it’s published!

References:

https://hazelcast.com/

Hello!