Just like any smart startup, Airbnb began life as a monolithic application and it was great going for a while.
The monolithic approach allowed them to get off the ground, test out their product with real users, and achieve a product-market fit.
And the architecture also looked so simple…
However, they soon warped through the wormhole of growth and the monolithic architecture just couldn’t keep up with the scaling demands.
At this point, Airbnb embraced SOA (service-oriented architecture) and migrated most of its backend functionalities to dedicated services. Payments being one of the oldest parts of the system went through the same transformation, providing a couple of crucial benefits:
Now, there were clear boundaries between different services. No more fights between teams on who owns what piece of data because everyone knew their responsibilities and limits. There was now less red tape while making changes and faster iterations.
Data separation into domains helped with normalization, resulting in improved correctness and consistency. After all, who doesn’t like consistency?
But there’s no free lunch on the other side of the wormhole.
Normalization helps reduce redundancy and data duplication. You break down a database into multiple tables and define relations between them.
Denormalization is the opposite. You introduce redundancy into a database by combining data from multiple tables into a single table or adding redundant data into an existing table. Overall less need for joins between tables.
Challenges of SOA
The payment data was normalized and scattered across multiple payment sub-domains. Each sub-domain became the responsibility of a separate team and it looked like life was sorted.
However, this change had a side effect.
Suddenly, the presentation services had to integrate with multiple services in the payment domain to fetch all the required data.
In other words, what happened to be a single API call turned into multiple fetch requests to collect all the data and aggregate it based on the client's requirements.
Needless to say, this created a few problems:
Clients now had to build a fair amount of understanding about the payments domain so that they could call the correct API. It’s like going to a restaurant and rather than simply ordering the pizza you love, you need to explain to the chef about all the ingredients needed to prepare that pizza and how those ingredients should be combined. Not what you expect while going to a restaurant!
Since the payments domain was split up, there were many instances where a change spanned multiple teams. Prioritizing a change request where multiple teams are involved is never straightforward and it negatively impacts time to market.
Due to multiple integrations, the performance, reliability, and scalability of the overall system weren’t at the expected levels.
Creating a Unified Payments Data Read Layer
Naturally, this situation couldn’t last and Airbnb was keen to find a long-term solution to the problem.
They came up with two changes:
1 - Unified Entry Point
The first task was to unify the entry points to the payment system.
The Airbnb team built a data-oriented service mesh where clients can query for the “entity” instead of being forced to identify dozens of APIs. It’s like a typical restaurant where you go and order a specific pizza without worrying about how it’s actually made.
Here’s what the approach looked like on a high-level:
In these entry points, they provided a bunch of filtering options to hide the complexity from the client and reduce the number of APIs they had to expose.
2 - Unified Higher-Level Data Entities
While the unified entry point was a good start, it wasn’t good enough to resolve all the complexity.
The payment system had 100+ data models and it still required a lot of domain knowledge to interact with all these models.
To make things easier for the client, they came up with higher-level domain entities to represent the payment domain. With this, the entire core payments data was confined to less than ten high-level entities.
See the below example:
Three main principles were followed while designing the high-level entities:
Keep the terminology simple for non-payment engineers
Maintain loose coupling with the storage schema to allow backend changes without requiring client changes
Hide the complexity behind a rich data model.
Materialize Denormalized Data
Unifying entry points and entities was a good way to remove complexity for the client.
But it didn’t mean that the platform itself became less complex. It was more like picking up the mess from one place and putting it elsewhere.
There were still expensive and complex application layer aggregations that were giving nightmares to the developers. The problem was that each client query was dependent on many services for its fulfillment. Sure, the client wasn’t doing the aggregation but it was still happening within the control flow.
To handle this, Airbnb decided to de-normalize the payment data and materialize it with less than 10 seconds of replication lag.
Think of it like creating and packaging ready-to-eat meals before the customer even places an order.
Airbnb built a special framework for this known as the Read-Optimized Store Framework that takes an event-driven lambda approach to materialize secondary indices. The framework takes data via data change capture mechanisms and daily database dumps and builds a near real-time secondary store.
The below diagram shows how the Read-Optimized Store Framework works:
Resulting System
After all the changes were made, the overall architecture for payment read flow looked something like this:
As you can see, the clients had no reason to know about any payment services or database internals. At the same time, complex aggregation queries were removed using the read-optimized store indexing service.
Sure, the index was eventually consistent but it was acceptable for their use case.
Airbnb first implemented this architecture for the transaction history page that showed a list of all the transactions for a customer. After 100% of the traffic was migrated to the new setup, they achieved a 150X latency improvement while improving the reliability from 96% to 99.9%.
These results have allowed Airbnb to expand this setup to other areas within the payment domain.
So - what do you think about Airbnb’s solution? Would you have done things differently?
The reference for this article comes from the Airbnb Engineering Blog.
Shoutout
Here are a few interesting articles I read this week:
Developing essential soft skills for engineers and leaders by
5 Non-Verbal Behaviors Killing Team Health by
That’s it for today! ☀️
Enjoyed this issue of the newsletter?
Share with your friends and colleagues.
See you later with another value-packed edition — Saurabh.