Hello, this is Saurabh…👋
Welcome to the 90 new subscribers who have joined us since last week.
If you aren’t subscribed yet, join 1100+ curious developers looking to expand their knowledge by subscribing to this newsletter.
In this edition, I cover the following topics:
🖥 System Design Concept → How Request Coalescing Works?
🧰 Case Study → Scaling Cron Jobs at Slack
🍔 Food For Thought → How to Bring Change to Your System?
So, let’s dive in.
🖥 How Request Coalescing Works?
This brilliant technique for handling database queries literally saved Discord.
It helped them store trillions of messages and fetch them without bringing their database cluster to its knees.
The technique is called Request Coalescing.
And it’s a technique that you simply can’t ignore.
But what’s so special about it?
If multiple users are requesting the same row at the same time, why not the query the database only once?
And that’s exactly what Request Coalescing helps us achieve.
In a typical Request Coalescing setup, you build special data services.
These data services are basically intermediary services that sit between the API layer and the database cluster.
Check the below illustration:
For reference, the above architecture was used by Discord to solve their issue with hot partitions. We spoke about it in an earlier edition.
These data services implement Request Coalescing.
Here’s what happens under the hood:
The first user that makes a request causes a worker task to spin up in the data service
Subsequent requests for the same data will check for the existence of that task and subscribe to it
Once the initial worker task queries the database and gets the result, it will return the row to all subscribers at the same time.
The animated illustration below should make the whole concept clearer to understand:
Now, Request Coalescing is pretty neat.
But it also generates a lot of debate.
When I posted about it on social media, there were all sorts of questions regarding the very existence of a technique like this.
Here’s a summary of the most important questions and their answers:
1 - How is Request Coalescing different from Caching?
This was the most prominent question that was brought up.
The simple answer is that with request coalescing, only one requester triggers the actual query to the database. The rest just subscribe to it. If it was caching, all requests would have hit the cache.
2 - Why not use Caching instead of Request Coalescing?
The second most common objection!
The answer to that is that request coalescing is not even a competitor to caching. You can coalesce requests even on top of a cache instead of the database.
The whole point of request coalescing is to reduce the number of requests hitting a data source.
3 - How does it work internally in the case of Discord?
A far more interesting question, I must say.
Though Discord hasn’t officially revealed its implementation details, I found a few more juicy tidbits by scouring some forums.
Each worker task has its own local state, which is primarily just a Hashmap storing requests and a list of senders waiting for the response.
Whenever a response comes in, they remove the request from the Hashmap and propagate the result to all the requesters waiting for the response.
4 - Why should we even bother about Request Coalescing since it’s not applicable to us?
I must confess that this was the toughest question to answer.
Of course, I get the sentiment behind the question. The scale at which a technique like Request Coalescing might come in handy is not applicable for a majority of the applications.
The concurrency problems described by Discord were quite crazy enough to warrant such a solution.
Should we bother to read about it? Well - I guess it’s a personal choice.
🧰 Scaling Cron Scripts at Slack
Slack is a messaging platform for efficient team collaboration.
Their success depends on the right message reaching the right person on time. And for that, notifications are extremely important.
A lot of their functionality relies on cron scripts. These scripts ensure:
Timely reminders
Email notifications
Message notifications
Database clean-up
For those of you who don’t know, Cron jobs are used to schedule and automate repetitive tasks. These jobs ensure that specific scripts or commands run at predefined intervals without any sort of manual intervention.
As the platform expanded, there has been a huge growth in the number of cron scripts and also, the amount of data processed by the scripts. This led to a dip in the reliability of the overall execution environment.
The Issues
Here’s a summary of the issues Slack was facing with their cron jobs:
A single node executed all the scripts locally. It kept a copy of the scripts and one crontab file with the schedules. At scale, this solution wasn’t easy to maintain.
Vertical scaling of the node by adding more CPU and more RAM to support a higher number of cron scripts became cost-ineffective
The individual node was a single point of failure. Any configuration issues could bring down critical Slack functionality.
To solve these issues, it was decided to build a more reliable and scalable cron execution service.
The System Components
There were 3 main components of the new cron service.
The below illustration shows the high-level details:
Let’s look at each component one by one.
1 - Scheduled Job Conductor
This is a new service written in Go and deployed on Bedrock (Slack’s in-house wrapper around Kubernetes)
It basically mimics the cron behavior by using a Go-based cron library. Deploying it on Bedrock allows them to scale up multiple pods easily.
Incidentally, they don’t process jobs on multiple pods. Only one pod takes care of the scheduling while others remain in standby mode.
While this may feel like intentionally having a single point of failure, the Slack team felt that synchronizing the nodes would be a bigger headache. This was supported by two additional points:
Firstly, pods can switch leaders very quickly in case the leader pod goes down. This made downtime quite unlikely.
Secondly, they offload all memory and CPU-intensive work of running the scripts to Slack’s Job Queue. The pod is just for scheduling.
Here’s what it looks like in practice:
2 - Slack Job Queue
The Job Queue is an existing component that serves a bunch of requirements at Slack.
Basically, it’s an asynchronous compute platform that runs about 9 billion jobs per day and consists of multiple “queues”.
These queues are like logical pathways to move jobs through Kafka into Redis where the job metadata is stored.
From Redis, the job is finally handed over to a job worker. The worker is a node that actually executes the cron job.
See the below illustration:
Since this system was already existing and could handle the compute and memory load, it was easy for the team to adapt it to handle the cron jobs as well.
3 - Vitess Database Table
Lastly, they employed a Vitess table to handle the job data, particularly for two purposes:
Handling deduplication
Report job tracking to internal users.
For those of you who may not be aware, Vitess is a scalable MySQL-compatible Cloud-Native database.
In the new system, each job execution is recorded as a new row in a table. Also, the job’s status is updated as it moves through various stages (enqueued, in progress, done).
See the below illustration:
Before starting a new run of a job, the system checks whether another instance of the job is not already running.
This table also serves as the backend for a simple web page that displays cron script execution information. It allows the users to look up the state of their script runs and any errors they encounter.
P.S. This post is inspired by the explanation provided on the Slack Engineering Blog. However, the diagrams have been drawn or re-drawn based on the information shared to make things clearer. You can find the original article over here.
🍔 Food For Thought
👉 How to bring change to your current system?
Have you ever been in a situation where you had a wonderful idea for your system or application, but no one is ready to listen to your idea?
And if that’s the case, how can you even think of bringing change?
Here’s a simple trick that has worked for me time and again.
Link to the post:
https://x.com/ProgressiveCod2/status/1712424160024997909?s=20
👉 Is there anything worse than a prod issue on Monday? 😅
No one wants to deal with a production issue in the best of times.
But you know what’s worse?
A production issue on Monday with the last modified showing your name.
That’s it for today! ☀️
Enjoyed this issue of the newsletter?
Share with your friends and colleagues
See you later with another value-packed edition — Saurabh.