For minimizing data divergence during concurrent updates, I really liked the idea of CAS (compare and swap). I had read about it in context of operating systems. It was nice to know how its used in distributed systems.
You wrote "Concurrent modifications can occur when routers and cache updaters try to update a cache entry." , but Espresso router never directly updates the couchbase as per the architecture right ? Its either via Cache updater or Cache bootstrapping. Please correct me if i have missing something
1 - Yes, latency will be higher as compared to the in-memory cache (like the OHC). But it will still be much less when compared to the DB.
2 - Cache miss is less because Couchbase is a distributed cache whereas OHC was confined to a particular router instance. Plus, they also make sure to keep the cache updated using CDC and the bootstrap process.
I love how they made sure that Couchbase is healthy because the alternative is just to fall back to DB reads/writes.
As for Minimizing Data Divergence, is that something you have to implement in these systems, or are they shipped as simple config options as part of these solutions?
This post is also a good reminder to introduce such tech when you actually hit a limit with your current solution, which is – for most SaaS – never.
And yes, it was important to avoid fallback as it would make the cache useless when it's needed the most.
With regards to minimizing the divergence, a lot of it has to be done by the team. For example, they implemented periodic bootstrapping of the cache using Brooklin. Also, using a Couchbase versioning feature to implement compare-and-swap. It all depends on their service level objective concerning divergence.
Of course, as you said, very few companies reach the scale of LinkedIn to implement such solutions. In fact, LinkedIn also implemented these solutions when they really needed.
Thanks for the shoutout, Saurabh.
Interesting the CAS to ensure you can retry as a client starting from the very beginning
It's definitely a nice approach. Glad you found it useful 👍
amazing post. thanks Saurabh!
For minimizing data divergence during concurrent updates, I really liked the idea of CAS (compare and swap). I had read about it in context of operating systems. It was nice to know how its used in distributed systems.
Thanks for the great feedback Ashwani!
Hi Saurabh, great read. One Q !
You wrote "Concurrent modifications can occur when routers and cache updaters try to update a cache entry." , but Espresso router never directly updates the couchbase as per the architecture right ? Its either via Cache updater or Cache bootstrapping. Please correct me if i have missing something
Thanks Sounak!
Actually, the router also tries to update the Couchbase cache when there's a cache miss. Here is the exact point from the Read Path section:
"In case of a cache miss, the request is served by the storage node. The router returns the profile information to the backend.
Lastly, the router upserts the data asynchronously into the cache."
Thanks for this Saurabh, I have just basic question ,
1. Since we are using counchbase, which won't be in memory if i am correct . Would be using the disk. Wouldn't that increase the latency ?
2. Here cache miss is less , because using couchbase with large memory is that ?
Hi Anubhav
1 - Yes, latency will be higher as compared to the in-memory cache (like the OHC). But it will still be much less when compared to the DB.
2 - Cache miss is less because Couchbase is a distributed cache whereas OHC was confined to a particular router instance. Plus, they also make sure to keep the cache updated using CDC and the bootstrap process.
Great stuff, Saurabh!
I love how they made sure that Couchbase is healthy because the alternative is just to fall back to DB reads/writes.
As for Minimizing Data Divergence, is that something you have to implement in these systems, or are they shipped as simple config options as part of these solutions?
This post is also a good reminder to introduce such tech when you actually hit a limit with your current solution, which is – for most SaaS – never.
Thanks Akos!
And yes, it was important to avoid fallback as it would make the cache useless when it's needed the most.
With regards to minimizing the divergence, a lot of it has to be done by the team. For example, they implemented periodic bootstrapping of the cache using Brooklin. Also, using a Couchbase versioning feature to implement compare-and-swap. It all depends on their service level objective concerning divergence.
Of course, as you said, very few companies reach the scale of LinkedIn to implement such solutions. In fact, LinkedIn also implemented these solutions when they really needed.