SDC#16 - Cookies and Sessions

MySQL High Availability at Flipkart and More...

Dec 05, 2023

Hello, this is Saurabh…👋

Welcome to the 185 new subscribers who have joined us since last week.

If you aren’t subscribed yet, join 1700+ curious Software Engineers looking to expand their system design knowledge by subscribing to this newsletter.

In this edition, I cover the following topics:

🖥 System Design Concept → Cookies and Sessions

🧰 Case Study → MySQL High Availability at Flipkart

🍔 Food For Thought → How to Safeguard your Career?

So, let’s dive in.

🖥 Cookies and Sessions

HTTP is a stateless protocol

This means every request is independent.

The web application server can’t tell if 2 requests came from the same browser or user.

But the users aren’t stateless.

No one wants to log in to your application every time they make a request.

So - how do you help them?

One solution is to use cookies.

Yes, cookies! But not the one you eat when you’re hungry.

A cookie is basically a key-value pair that’s stored on the browser.

How do they work?

The user logs in to your frontend application.
The frontend sends the request to the backend server
The backend server generates a cookie
It sets the cookie on the browser via the Set-Cookie response header.
The user makes a new request to view a different page.
The front end sends the request to the backend and includes the Cookie as part of the header.
The server checks the cookie for the user and responds with the required data.

Here’s what the process looks like:

You can play around with the image on Eraser.io

Sounds good, doesn’t it?

But there’s a major issue with using cookies.

Cookies are accessible via the browser. You can modify the cookie information.

That’s why it’s not a good idea to use cookies for storing sensitive data about the users.

This is where sessions come into the picture.

The session contains a unique set of characters to identify the user.

It works as follows:

The user makes a login request
The frontend sends the request to the backend server
The backend creates a session using a secret key and stores it in some sort of session storage (database or cache)
Next, the server sends a cookie back to the client
However, the cookie contains the unique identifier for the session
The user makes a new request to view another page.
The browser sends the session ID as part of the cookie. No other user information is stored in the cookie.

This time only the server can validate whether the session is valid.

Few important points to mention over here:

Cookies can have a “Secure” flag indicating that it should only be sent over HTTPS. This is good for security reasons.
Also, “HttpOnly” cookies restrict the cookie’s access to JavaScript reducing the risk of XSS attacks.
Cookies (especially 3rd party cookies) raise a bunch of privacy concerns because they can be used to track user behavior.
While cookies can be made secure, server-side sessions provide additional layers of security against CSRF attacks and handling sensitive information
Also, server-side sessions can be centrally managed. This means you can invalidate sessions, expire or revoke them if needed.

🧰 MySQL High Availability at Flipkart

Flipkart is India’s e-commerce giant.

With almost $7 billion in revenue, it’s one of the largest competitors of Amazon when it comes to the Indian market.

Just like Amazon, Flipkart also runs massive sale days known as Big Billion Days and needs crazy levels of availability to handle the load during these critical days.

Availability is equal to revenue for them. In fact, what they need is High Availability.

But what exactly is High Availability?

High Availability or HA is a system’s ability to operate continuously without failures.

Most organizations usually aim for the five nines or 99.999% availability when they talk about HA.

That comes to 864 milliseconds of allowed downtime per day!

To get close to these anxiety-inducing numbers, Flipkart uses a microservice architecture with thousands of services spread across multiple sub-systems such as:

Order Management
Supply Chain
Logistics
Seller management

But if there’s one component that can seriously hurt those HA aspirations, it’s the database.

Initially, every team at Flipkart managed its own MySQL clusters.

This meant that each team was sort of reinventing the wheel in maintaining their MySQL clusters to operate at 99.999% availability.

In other words, each team had to worry about hiring MySQL experts with the right level of skills so that they could keep the whole ship running.

As you might know, microservices architecture can suffer from cascading failures. Hypothetically, a DB cluster going down in one part of the system can impact other parts bringing the whole thing down.

Needless to say, this is not an efficient setup if a majority of teams are already using MySQL.

This led to the birth of ALTAIR.

No - not Altair the Assassin.

ALTAIR is Flipkart’s in-house managed MySQL service that helps achieve High Availability of MySQL clusters.

With ALTAIR, developers can spend less time worrying about whether their database is up and running and focus more on product development.

How does ALTAIR provide High Availability?

The biggest hurdle to High Availability is failure detection.

The faster you can detect a failure and take appropriate action, the lower will be the downtime.

And this is where ALTAIR plays a key role.

ALTAIR kickstarts the fail-over process by enabling failure detection, weeding out false positives and seamlessly triggering the recovery process.

At Flipkart, the MySQL clusters are set up with a primary replica configuration.

The primary accepts the write requests and can also handle the reads.
The replicas in the cluster replicate the data asynchronously and also serve the read traffic.

How does the routing of requests (reads and writes) take place?

Flipkart uses DNS for service discovery.

Clients discover the primary node using the DNS which resolves to the IP address of the primary node.

Though they haven’t explicitly talked about it, the decision to send the requests to primary or secondary seems to rest with the client application.

Here’s what the arrangement looks like on a high level.

As you can notice, the High Availability of the entire cluster depends heavily on the availability of the primary node so that it can continue accepting writes.

But hardware failures can happen in data centers and primary nodes can fail.

And they need to be detected fast so that the fail-over process can be triggered.

Here’s what the failure detection setup in ALTAIR looks like:

The entire setup has multiple components:

Agent

On every MySQL node, the agent runs as a daemon along with the MySQL process.

The job of this agent is to collect health metrics about the MySQL instance, disk usage, replication lag, and so on.

It sends the health updates to the Monitor every 10 seconds.

Monitor

As the name suggests, the Monitor keeps track of the health of MySQL nodes.

Each Monitor node is allocated a subset of MySQL nodes to oversee. Internally, it’s just a service written in Go that is scaled out to multiple instances.

It performs a few important activities such as:

Updating the health events received from the Agent into Zookeeper every 10 seconds
Comparing the previous health with the latest health update received from the Agent
In case of MySQL failure or any other issue (such as disk usage or replication lag), it notifies the Orchestrator

Essentially, you can think of the Monitor as the central piece that coordinates the failure detection system and also acts as the gateway to Zookeeper.

Orchestrator

The Orchestrator receives notifications of MySQL failure from the Monitor.

Its job is to check for false positives if any and trigger the recovery workflow in case the failure is guaranteed.

Checking for false positives is important because the fail-over process can result in data loss and downtime.

It’s a costly process and you don’t want to be doing it in case of false alarms.

P.S. This post is inspired by the explanation provided on the Flipkart Engineering Blog. However, the diagrams have been drawn or re-drawn based on the information shared to make things clearer. You can find the original article over here.