Database Sharding

Introduction and Guiding Principles

Aug 27, 2024

Sharding is a strategy to horizontally scale the database.

It evolves out of horizontal partitioning in which you separate the rows of one table into multiple different tables, known as partitions.

See the example below:

You can play around with the diagram on Eraser.io

But what’s the point of partitioning data?

The goal is to reduce the effort of querying data. Lesser the number of records a query has to run over, the better will be the performance.

Logical vs Physical Sharding

When you start with sharding, you generally start with logical shards (also known as partitions). Most databases have out-of-the-box support for partitioning.

Logical shards are horizontal partitions of your data based on some strategy. They sit on the same database instance or server. However, when a single database instance is not able to handle the workload even after partitioning, you need to go for physical sharding.

In physical sharding, each shard is hosted on a different node or server instance. However, a physical shard can contain one or more logical shards.

Here’s what it may appear like in practice.

While this may look complicated, there are scenarios where systems need to embrace this complexity.

For example, consider a large e-commerce platform operating at a global scale.

The platform must store and manage a large amount of data related to product listings, customer orders, and inventory across multiple geographic regions.
The database must be highly available, scalable, and performant enough to handle the high request and transaction volumes. Remember - these requests occur across different regions.

Sharding can help the platform achieve its goals. For example:

You can use physical sharding to divide the data into multiple shards. Different shards handle different geographic regions making the data distributed.
Within each physical shard, you can use logical sharding to group related data together. Think of data such as orders for specific product categories.
This can help improve query performance by minimizing the amount of data that must be accessed for each request. Also, the write or update operations are distributed across multiple servers.

Remember that the goal of sharding is to help your application deal with fewer data for a given request.

Sharding Strategies

There are multiple sharding strategies that you can adopt depending on your requirements.

Key-Based Sharding

In this strategy, you partition the data based on a hash function applied to one or more columns in the table. That’s why this type of sharding is also known as hash-based sharding.

Usually, you apply the function to the primary key column but you can also use a combination of columns. The result of the hash function determines which shard the data will be stored in.

A very simple hash function might look as below:

function shardNumber(num, numShards) {
  const shardSize = Math.ceil(num / numShards);
  return Math.floor(num / shardSize);
}

Here’s what key-based sharding looks like in practice.

Key-based sharding is great for tables where the data is not naturally partitioned and you want to evenly distribute the data across multiple shards.

Range-Based Sharding

In this strategy, you partition the data based on a specific range of values in a particular column.

The below diagram makes it easier to understand.

Here, you have a product table that contains a big list of products with their prices. You can create different shards based on the price range they fall into.

For example, shard 1 contains all products within the price band of $0-$75. Shard 2 contains all products within the price band of $76-$150.

The main benefit of range-based sharding is the ease of implementation. However, it’s far from perfect and we will look at the downsides in the last section of this post.

Directory-Based Sharding

In this strategy, you perform sharding based on a lookup table.

Think of this table as a directory (similar to an address book) that holds the relation between the data and the specific shard where you can find it. Hence, the name directory-based sharding.

The diagram below explains it better.

In the above example, the Location field acts like a shard key.

Data from the shard key is written to a lookup table that maps the key to a particular shard. For example, data for the USA location is stored in shard 1, and so on.

The main benefit of directory-based sharding is higher flexibility when compared to the other strategies.

Benefits and Drawbacks of Sharding

Sharding is usually not the first choice to achieve database scalability.

It has its share of benefits and drawbacks you must look into before betting your money on sharding.

Here are the main benefits of sharding:

Sharding helps you achieve horizontal scaling or scaling out of the database.
With sharding, you can reduce query response times since the queries run on a smaller set of data.
Sharding also helps make an application more reliable. Even if a shard goes down, the entire application does not stop functioning.

There are also several drawbacks of sharding:

Sharding makes your application quite complex. Instead of one place, the data is scattered across multiple servers.
Even after you have implemented a great sharding strategy, the shards can get unbalanced over time resulting in hotspots.
With sharding, you can practically say goodbye to joining data across multiple shards.

Before Adopting Sharding

Some key considerations you should have before you decide to use sharding:

When to go for sharding?

At the bare minimum, you should not go for premature sharding.

Sharding is a complex business and it should never be the first option on the table when your application faces problems with database scalability.

You should look at sharding only when you have exhausted other options like proper indexing, replication, and partitioning.

Designing Hash Function

Key-based sharding is typically the most popular sharding strategy. However, the success of this strategy depends on the hash function.

Few things to keep in mind while designing a hash function for sharding:

Try to make the hash function fast and efficient.
The hash function should result in a uniform distribution of keys.
Make the hash function deterministic. In other words, it should always produce the same output for a given input.

Choosing a Strategy Based on Type of Requests

Yes, sharding strategies can differ depending on whether reads or writes are more important.

If reads are more important, you should go for a sharding strategy that prioritizes query performance. One approach to support this is to use a range-based sharding strategy that results in more efficient range queries and results in faster query execution.
If writes are more important, your sharding strategy should prioritize write throughput and data consistency. In other words, you can choose a shard key that distributes writes evenly across shards and minimizes the likelihood of hotspots. A hash-based sharding strategy can help with this.
Another approach can be to use a time-based sharding strategy where data is partitioned based on time intervals such as hours or days. This can help distribute write operations evenly. Discord uses this time-based sharding strategy to handle trillions of messages.

When NOT to use a particular sharding strategy?

Sometimes, elimination is a great way to arrive at an ideal solution.

Since all strategies have potential problems, you can arrive at an ideal strategy by rejecting a particular strategy.

Here’s a quick reference you can use:

Avoid key-based sharding if you need to add more shards frequently to rebalance the data. When you add a shard, your hashing function will start to give a different result if it depends on the number of shards.
Avoid range-based sharding if data within each range can be drastically unbalanced. If this happens, you will have an uneven distribution of data resulting in database hotspots.
Avoid directory-based sharding if you don’t want a single point of failure like a lookup table. The entire sharding logic will go for a toss if something goes wrong with the lookup table.

So - have you used Sharding in your application?

Shoutout

Here are some interesting articles I’ve read recently:

That’s it for today! ☀️

Enjoyed this issue of the newsletter?

Share with your friends and colleagues.

See you later with another edition — Saurabh

Akos Komuves

Aug 31, 2024

Did logical sharding once but I think we reached for it too soon. It wasn’t particularly complicated to implement but it gave the system a new property we had to consider constantly. Great overview and thanks for the mention!

Expand full comment

1 reply by Saurabh Dashora

1 more comment...

System Design Codex

Discussion about this post