SDC#26 - Intro to Change Data Capture

Consistent Hashing and Tricky Caching Issues...

Jan 30, 2024

Hello, this is Saurabh…👋

Welcome to the 578 new subscribers who have joined us since last week.

If you aren’t subscribed yet, join 5300+ curious Software Developers looking to expand their system design knowledge by subscribing to this newsletter.

This week I’m trying a slightly different format where you get a chance to vote for what you’d like to read more about.

In this edition, I cover the following topics:

🖥 Change Data Capture

🎯 Consistent Hashing

⏰ Caching Issues

So, let’s dive in.

🖥 Change Data Capture

Have you ever received notifications for your bank account transactions?

I’m sure you have.

Any new notification contains details of a specific transaction - the credit or debit amount, the parties involved in the transaction and a reference number. The bank doesn’t send you the entire transaction history of your account and asks you to figure out what’s new.

In other words, the bank communicates only what has changed.

The same approach is followed in Change Data Capture.

How does CDC work?

Change Data Capture is a data replication method that identifies and tracks changes changes to data in a table to provide real-time movement of data.

On a high-level, CDC works by tracking changes in a source dataset and automatically transferring those changes to a target dataset.

The below diagram shows a typical CDC setup for reference:

You can play around with the diagram on Eraser.io

As you can see, the CDC process monitors any insert, update or delete transactions on the source database.

The captured changes can then be delivered to multiple consumers (such as another database, data warehouse, cache and so on) via some sort of messaging system.

Here are the 5 main steps:

Create an initial snapshot and load it into the target system.
Enable CDC on the source database
Identify the tables to be replicated
Start the CDC replication process
Monitor

Types of CDC

While the overall concept of CDC is the same, there are multiple approaches to implement it.

The most important ones are:

Log-Based : Transactional databases log all changes into special files known as transaction logs. These logs are used to publish the changes to target systems using messaging queues.
Trigger-Based : These are stored procedures that are automatically executed when a specific events occurs on a table. The triggers help capture any data changes in a shadow table or publish them to a message queue.
Timestamp-Based : A special column is added to the table to reflect the most recent change (last_modified). The CDC process can query this field and get the records updated since the last execution time.

👉 This was a brief intro to CDC.

Request you to answer the anonymous poll to show your interest about this topic and help improve System Design Codex.

🎯 Consistent Hashing

The concept of Consistent Hashing appears simple.

But I’ve also seen people struggle while trying to understand it in the beginning.

Consistent Hashing is a technique used for distributing keys uniformly across a cluster of nodes.

The focus behind Consistent Hashing is to minimize the number of keys that need to be moved around when we add or remove a node from the cluster.

Below are the steps to demonstrate how Consistent Hashing works:

STEP 1

The keys are hashed using a hash function.

The output range of these key values is treated as a fixed circular space or ring. For example, in the below diagram K1, K2, K3 and K15 are the positions of the keys on the hash ring.

STEP 2

Next, the servers or nodes are also hashed using the IP address or the domain name as input.

We use the same hashing function to determine their respective positions on the ring.

See the below diagram:

STEP 3

Lastly, for every key, we traverse the ring in a clockwise direction starting from the position of the key.

Once a node is found, we store the key on that node.

See the below diagram:

And that’s basically it.

In case you are still a little confused, I’ve created small video that attempts to demonstrate this concept with animation.

👉 This was a brief intro to Consistent Hashing.

Request you to answer the anonymous poll to show your interest about this topic and help improve System Design Codex

⏰ Caching Issues

Caching is a pretty effective approach for boosting the performance of your application.

I discussed various database caching strategies in detail in an earlier post.

SDC#17 - Database Caching Strategies

Saurabh Dashora

December 5, 2023

Read full story

However, caching can also have issues.

A couple of important ones are as follows:

Stale Sets

Stale set situation happens when outdated data is set in the cache when compared to the source-of-truth database.

This scenario can happen quite easily while using the Cache-Aside strategy where the incoming request checks for data in cache and if it doesn’t find the data, it fetches from the database and inserts the same in the cache.

But why does Stale sets happen?

See the below sequence diagram that shows the scenario:

You can play around with this sequence diagram on Eraser. It’s generated via code.

The steps leading to this are as follows (from the perspective of a single key):

User A tries to read from cache but gets a cache miss.
The query goes to the database which returns the data.
Meanwhile, User B updates the database successfully. It also deletes the record (invalidates) from the cache.
Now, User A writes to the cache because there was a cache miss earlier. But it writes the data it had read earlier which is now stale after User B was successful.
At the end User C comes along and reads stale data from the cache.

Thundering Herd

The thundering herd problem occurs in a highly concurrent environment with many users.

When many users make a request for a particular record simultaneously and there’s a cache miss, the thundering herd problem gets triggered. All of these requests will move on to read the data from source database.

In other words, despite the use of caching, the database is hit by a thundering herd of users.

See the below diagram that shows the concept of Thundering Herd.

You can play around with this diagram on Eraser.io

👉 This was a brief intro to important caching issues such as Stale sets and Thundering herd.