SDC#13 - The Secret Trick to High-Availability
How Uber Managed High-Availability Write Operations and More...
Hello, this is Saurabh…👋
Welcome to the 119 new subscribers who have joined us since last week.
If you aren’t subscribed yet, join 1300+ curious Software Engineers looking to expand their knowledge by subscribing to this newsletter.
In this edition, I cover the following topics:
🖥 System Design Concept → The Secret Trick to High-Availability
🧰 Case Study → How Uber Managed High-Availability Write Operations?
🍔 Food For Thought → Diversification for Developers
So, let’s dive in.
🖥 The Secret Trick to High Availability
The secret trick to building highly-available systems is hidden in plain sight.
It’s a costly approach but big companies use it all the time.
That’s because, at a certain level, you can’t afford to take chances. One wrong move can collapse your entire business model.
Turns out - we can also use a similar approach to build our applications.
But what’s the secret?
It’s called Static Stability.
I know it sounds cool!
But what does it really mean?
When we run a service in a particular availability zone and there’s a disruption in that zone, we scale up the service in another availability zone.
This is a Reactive approach.
Static stability encourages us to be Proactive.
In the Proactive approach, we consciously over-provision the infrastructure.
Even if an AZ goes down, the system continues to operate.
Let’s look at a couple of examples:
Active-Active High Availability
Let’s say you have a public-facing load balancer.
It manages traffic to instances spread across 3 availability zones.
So, how do you make things statically stable?
If you need 2 instances, you actually create 3
Basically, over-provisioning by 50%
This means that even if an entire AZ goes down for some reason, you can still work at full capacity.
Here’s what it looks like:
Active-Passive High Availability
Let’s say you are dealing with a stateful service like a database and want to make sure it’s highly available.
In this case, you can create a primary instance in one availability zone. This instance handles all reads and writes.
But to achieve static stability, you create a standby instance in a separate availability zone.
If the availability zone with the primary goes down, the standby instance becomes the new primary.
Here’s what it looks like:
The Criticism
Does all of this sound like a waste of resources?
Yes, it might. In fact, this is a common criticism for the proactive approach.
But in truth, a lot of world-famous services use static stability to ensure high availability. For example, AWS EC2, S3, RDS, and even other cloud providers.
They can’t operate without this approach.
Imagine the uproar if the AWS EC2 service goes down. So many workloads across the world are dependent on it.
As I mentioned earlier, the entire business model of AWS can collapse.
If your system is statically stable, it continues to work even when a dependency becomes impaired.
So what’s the takeaway?
Outages and disruptions are part of the game on any platform.
It’s your job as a developer to figure out how much availability is needed in the context of the system you are developing.
For mission-critical applications, static stability is a must.
You’d rather over-provision resources than risk downtime.
🧰 How Uber Managed High-Availability Write Operations?
In an earlier post, we explored the architecture of Uber’s in-house Schemaless Database.
At the end of that discussion, I touched upon the topic of Buffered Writes.
Buffered Writes allowed Uber to accept write requests even if a primary node is down. It minimizes the chance of losing data by writing it to multiple clusters.
Here’s how it works:
When the client makes a request, it goes to the request handler.
The request handler sends the write requests to the secondary leader. The data is stored in a special buffer table.
Then, it also sends the write requests to the primary leader. Only if both writes are successful, the client receives a successful write confirmation.
The primary leader’s job is to replicate the data. However, if the leader goes down before the asynchronous replication is successful, the secondary leader serves as a temporary backup of the data.
A background worker monitors the primary follower for when the record appears after replication
Once the record appears on the primary follower, the background worker goes ahead and deletes the record from the buffer table in the secondary leader.
A few important points to note over here:
The number of secondary leaders is configurable.
The secondary leader is chosen at random.
Buffered writes utilize idempotency. This means that if there are multiple writes with the same identifying fields, it doesn’t matter how many times the request is made.
P.S. This post is inspired by the explanation provided on the Uber Engineering Blog. However, the diagrams have been drawn or re-drawn based on the information shared to make things clearer. You can find the original article over here.
🍔 Food For Thought
👉 Diversification for Developers
Deep technical skills in a particular area are extremely valuable in the job market.
But some amount of diversification can be a tremendous value addition to your core technical skills.
However, it’s NOT always a matter of taking big drastic steps to diversify your knowledge.
Here’s a little nugget I posted on X that suggests a path you can take:
The link to the post:
https://x.com/ProgressiveCod2/status/1717083116291182903?s=20
👉 Do Real Developers read Docs?
We’ve all been there at some point or not.
Why read the docs when you can spend dozens of hours trying to find something by going through thousands of lines of code? 😅
However, over the years, I’ve started enjoying the act of reading the official documentation of programming languages or frameworks (if they are well-written, of course).
And I must say that it has helped me save time on multiple occasions.
What about you? Do you like reading docs?
That’s it for today! ☀️
Enjoyed this issue of the newsletter?
Share with your friends and colleagues
See you later with another value-packed edition — Saurabh.
I like to read docs, but Perplexity now does it for me :3
Though, quick google search and dive into docs help a ton, even AWS docs, despite it looking a bit scary.