LinkedIn runs hundreds of microservices, which communicate at an average rate of tens of millions of calls per second.
Wherever there’s communication, there are also chances of security issues creeping in like those pesky neighbors peeking into your house. Proper authorization controls are critical to minimize data breaches if a service is compromised.
Access Control Lists (ACLs) are the most common approach to enforce such authorization controls.
With ACLs, you can define which users, groups, or processes have access to specific objects such as files, directories, applications, or network resources. It’s like a table or list specifying a particular object's permissions.
Here’s an example of ACL for a particular service:
In this example ACL:
The "client-service" is allowed to perform GET requests on the "greeting" resource, but denied from making PUT requests.
The "admin-service" can perform GET and PUT requests on the "greeting" resource.
For every request, the ACL is checked and access is granted or denied based on the defined permission levels.
The Challenge at LinkedIn’s scale
While the process sounds simple, scale changes everything.
There are 4 main challenges for LinkedIn:
They need to check authorization quickly
They need to deliver ACL changes promptly across the service stack
They need to manage a large number of ACLs
They need to monitor the ACL checks
The diagram below shows how they handled each of these challenges.
Let’s look at how LinkedIn solves each of these issues.
Fast Authorization Checks
To handle fast authorization checks, an authorization client module runs on every service at LinkedIn.
It keeps relevant ACL data in memory to avoid network calls during checks.
Deliver ACL Changes Quickly
ACL data is periodically refreshed in the background.
The refresh rate is such that it balances the need for timely changes and the load on the system.
Manage ACL Data
ACLs are stored in LinkedIn’s Espresso database, with a look-aside Couchbase cache for improved latency and scalability.
But how is the cache kept consistent with the database?
A change data capture system based on Brooklin notifies the services when an ACL changes to clear the cache.
Lastly, a REST API is exposed through a management interface and a command-line tool. Developers can use these interfaces to manage the ACL data.
Monitoring ACL Data
Every authorization check is logged asynchronously using LinkedIn’s Kafka message queue.
This is used for debugging, traffic analysis, auditing, and investigations. Engineers can access insights through the inGraphs monitoring system.
So - what do you think about this architectural approach?
And would you have done things differently?
Reference: Authorization at LinkedIn’s Scale
Eraser Professional Plan Free Trial (Affiliate)
As you all know, I use the Eraser for drawing all the diagrams in this newsletter.
Eraser is a fantastic tool that you can use as an all-in-one markdown editor, collaborative canvas, and diagram-as-code builder.
And now you can get one month free on their Professional Plan or a $12 discount if you go for the annual plan. The Professional Plan contains some amazing features like unlimited AI diagrams, unlimited files, PDF exports, and many more.
Head over to Eraser and at the time of checkout, use the promo code “CODEX” to get this offer now.
Shoutout
Here are a few interesting articles I read this week:
That’s it for today! ☀️
Enjoyed this issue of the newsletter?
Share with your friends and colleagues.
See you later with another value-packed edition — Saurabh
Great article! On which level these ACLs are implemented? For example, if I add an admin to my company account and this user tries to access my company account dashboard would that be a call to the ACL service?
Thanks for the shout out!
LinkedIn uses a relative complex solution, which is totally reasonable to authorize such number of API calls per seconds.
Awesome write up, and thanks for the shoutout, brother Saurabh!