Understanding And Scaling Webhook Notifications In Razorpay

In the world of online transactions, real-time notifications play a crucial role in keeping customers informed. At Razorpay, we’ve built a robust Notification Service to handle this task. However, as our transaction volume grew exponentially, we faced a few challenges, especially with webhooks, the most widely used form of notifications.

What are Webhooks?

Webhooks are user-defined HTTP callbacks triggered by specific events. They are used to send data from one application to another whenever a given event occurs. Webhooks provide real-time data and are more efficient than traditional polling mechanisms. Instead of constantly checking for updates, your application can sit back and let the data come to it.

The Existing Architecture

Our existing architecture for handling webhook notifications involves several components:

API Pods: These receive the notification request from another system or service within Razorpay. After validating the request, the API Pods send it to the SQS queue.
SQS Queue: The validated notification request is sent to an Amazon Simple Queue Service (SQS) queue.
Workers: Worker processes consume messages from the SQS queue. Each message corresponds to a notification that needs to be sent.
Executors: The result of the notification execution (success or failure) is recorded by the Executors.
Databases: The Executors persist the transaction details related to the notification execution in two different data stores – a MySQL database for transactional data and a data lake for analytical purposes.
Scheduler: A Scheduler runs periodically, identifies notifications that need to be retried (due to failures or timeouts), and pushes them back to the SQS queue for reprocessing by the workers.

Challenges Faced

While the existing system could handle a load of up to 2K transactions per second (TPS), we started experiencing performance degradation at peak loads of around 1K TPS. The 99th percentile latency increased from ~2 seconds to ~4 seconds, which could further degrade based on the load, customers’ response time, etc.

Potential Solutions

To address these challenges and ensure the long-term evolution of the service, we’re considering several initiatives:

Prioritizing Incoming Load: One approach could be to prioritize the incoming load based on certain criteria. This could help manage the system resources more effectively and ensure that the most critical operations are handled first.
Reducing Database Bottleneck: To reduce the database bottleneck, we could consider optimizing our database queries, implementing caching, or using a more scalable database solution.
Managing SLAs with Unexpected Response Times: To manage SLAs with unexpected response times from customer’s servers, we could consider implementing a more robust error handling and retry mechanism.
Reducing Detection and Resolution Times: To reduce detection and resolution times, we could consider implementing more effective monitoring and alerting systems.

By addressing these challenges, we aim to build a more robust and scalable Notification Service that can handle a large number of transactions without compromising on performance or reliability.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

What is the Notification Service? Notification service is a platform service that handles all the customer notifications requirements, such as SMSs, Emails, webhooks, etc. This system was designed a while ago, and has been performing reliably. But, as the number of transactions grew exponentially, a few challenges popped up especially with webhooks, the most widely used form of notifications (we will focus on webhooks only for this blog). To ensure we meet our SLAs and for better long-term evolution of the service, we had to go back to the drawing board and re-think the design. please explain

Copilot

Sure, I’d be happy to explain.

A Notification Service is a system that manages and sends notifications to customers through various channels like SMS, Email, and Webhooks. It’s a crucial part of many businesses as it allows them to keep their customers informed about updates, transactions, and other important information.

In your context, it seems the Notification Service has been facing challenges due to the exponential growth in the number of transactions. This is particularly true for webhooks, which are user-defined HTTP callbacks. Webhooks are triggered by specific events and are used to send data from one application to another whenever a given event occurs.

As the usage of webhooks increased, the system started to struggle to meet its Service Level Agreements (SLAs). SLAs are the agreed-upon levels of service between a service provider and its customers, and they’re crucial in maintaining trust and satisfaction among users.

To address these challenges and ensure the long-term evolution of the service, the team decided to rethink the design of the Notification Service. This could involve optimizing the current architecture, introducing new technologies, or even redesigning the entire system to better handle the increased load and complexity.

The goal of this redesign would be to create a more robust and scalable Notification Service that can handle a large number of transactions without compromising on performance or reliability. This way, the service can continue to effectively meet its SLAs and serve its users well into the future.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

The image depicts the existing architecture for the notification service, specifically focused on handling webhooks. Here’s an explanation of the components and the flow:

API (API pods): This is the entry point where notification requests are received.

SQS Live: The API pods send the validated notification requests to an Amazon Simple Queue Service (SQS) queue called “SQS Live”.

Workers: The Workers component consumes messages from the SQS Live queue and handles the actual execution of sending the notifications (webhooks in this case).

Executors: The Executors are responsible for persisting the transaction details related to the notification execution. In the image, the Executors persist data to two different data stores:

Datalake: Likely a data lake or data warehouse solution for storing and analyzing notification data.

MySQL Database: A MySQL database for storing transactional data related to notification executions.

Scheduler: The Scheduler component periodically reads requests that need to be retried (possibly failed or timed-out notifications) and pushes them back to the SQS Live queue for reprocessing by the Workers.

The existing flow can handle up to 2K transactions per second (TPS). However, when the load reaches around 1K TPS, the system starts to experience performance degradation. Specifically:

The 99th percentile (p99) latency increases from ~2 seconds to ~4 seconds.

The p99 latency may further degrade based on additional factors like the load, customers’ response times, etc.

The performance degradation can be attributed to several factors:

Contention and Bottlenecks: At higher loads, various components like the API pods, SQS queue, Workers, and databases may experience contention, leading to increased latencies and potential bottlenecks.

Resource Limitations: The existing resources (CPU, memory, network, etc.) allocated to the different components may become insufficient to handle the increased load, causing performance issues.

Database Scalability: The MySQL database used for persisting transactional data may struggle to scale horizontally, leading to performance bottlenecks at higher loads.

Queue Limitations: The SQS queue may have limitations in terms of throughput or concurrency, causing delays or backlogs in message processing.

Network Latency: As the load increases, network latency between different components (API pods, Workers, databases, etc.) may become a factor, contributing to higher overall latencies.

Customer Response Times: The degradation in performance may also be influenced by the response times of customers receiving the webhooks, as the Workers may need to wait for these responses before completing the execution.

To address these performance issues and ensure the notification service can handle higher loads while meeting the desired SLAs, a re-evaluation and potential redesign of the architecture may be necessary. This could involve optimizing existing components, introducing additional scaling mechanisms, or adopting a more scalable and distributed architecture tailored for high-throughput workloads.How the performance decrease here we are using SQS message queye?

Copilot

Great question! While Amazon Simple Queue Service (SQS) is a highly scalable and reliable service, there can still be performance issues in certain scenarios. Here are some potential reasons why performance might decrease even when using SQS:

Throughput Limitations: Each SQS queue has a maximum throughput of 300 transactions per second (TPS) for standard queues and 3000 TPS for FIFO queues. If the number of messages being sent or received exceeds this limit, it can result in throttling and increased latencies.

Large Message Sizes: SQS supports message sizes up to 256 KB. Larger messages must be broken down into smaller ones, which can increase the time it takes to enqueue and dequeue messages.

Network Latency: The time it takes for a message to travel from the sender to SQS and then to the receiver can add to the overall latency. This can be particularly noticeable if the sender, SQS, and receiver are all in different geographical locations.

Processing Time: The time it takes to process a message can also contribute to increased latencies. This includes the time it takes for the API pods to validate notification requests, for the Workers to execute notifications, and for the Executors to persist transaction details.

Visibility Timeout: The visibility timeout is the period of time that a message is invisible to the rest of the system after it has been read from the queue. If this timeout is set too high, it can result in messages remaining unprocessed in the queue for longer periods of time, thereby increasing latencies.

Retry Policies: If a message fails to be processed successfully, it can be returned to the queue to be retried. However, if there are a large number of failing messages, this can result in a backlog of messages in the queue, leading to increased latencies.

To improve the performance of the system, you might consider strategies such as increasing the number of API pods or Workers, optimizing the processing time of messages, adjusting the visibility timeout, or fine-tuning the retry policies. Additionally, you could consider using other AWS services that are designed to handle higher loads, such as Amazon Kinesis or DynamoDB Streams. I hope this helps! 😊

Source: Conversation with Copilot, 13/6/2024

https://github.com/openfaas/faas/issues/1730

https://medium.com/@dimi_2011/solution-architects-challenges-and-common-mistakes…

https://shiftmag.dev/talking-to-rubber-duck-draw-a-diagram-582

philserver.bplaced.net/fbe/documentation.php

https://github.com/MAPIRlab/rf2o_laser_odometry/issues/36

You

Challenges We Found While Scaling Up While we were onboarding new use cases and adopting services, we encountered a few scalability issues:

The Read query performance degraded with increase in data in the table. Although there is a replica DB, it also peaks as the number of write operations increases. The worker pods can be scaled to any limit. However, due to the limitation in IOPS on the DB, the pods couldn’t be scaled as an increase in Pods results in DB degradation. Each time the DB became a bottleneck due to high IOPS, our short-term fix was to increase the size of the database from 2x.large to 4x.large to 8x.large. But this was not a long-term solution. In the Payments system, all the operations must be performed within a specified SLA. But various variable factors like customers’ response time for webhooks, load TPS, etc., posed a challenge to complete the operations within the SLAs as we could not scale up due to DB limitations. The load also increases unexpectedly for specific events or days like Quarter Closing, Special market trading occasions like Festivals, Annual Flagship sales by e-com, World Cup, IPL etc. During these days, the notifications were getting impacted. The solution was tricky: we were not looking at developing a system to handle a certain load as this would further increase year-on-year. What could be a possible solution? We wanted to tame the problem and picked up a few initiatives in that direction.

Prioritizing incoming load Reducing DB bottleneck Managing SLAs with unexpected response times from customer’s servers Reducing detection and resolution times Please explain now

Copilot

Absolutely, let’s break down the challenges and potential solutions you’ve mentioned:

Challenges:

Read Query Performance Degradation: As the amount of data in the table increases, the performance of read queries degrades. Even with a replica database, the performance peaks as the number of write operations increases.

Database Limitations: The worker pods can be scaled up to any limit, but due to limitations in the database’s Input/Output Operations Per Second (IOPS), scaling up the pods results in database degradation. Increasing the size of the database was a short-term fix, but not a long-term solution.

Service Level Agreement (SLA) Challenges: In the Payments system, all operations must be performed within a specified SLA. However, variable factors like customers’ response time for webhooks and load transactions per second (TPS) posed a challenge to complete the operations within the SLAs.

Unexpected Load Increases: The load also increases unexpectedly for specific events or days like Quarter Closing, Special market trading occasions like Festivals, Annual Flagship sales by e-commerce, World Cup, IPL, etc. During these days, the notifications were getting impacted.

Potential Solutions:

Prioritizing Incoming Load: One approach could be to prioritize the incoming load based on certain criteria. This could help manage the system resources more effectively and ensure that the most critical operations are handled first.

Reducing Database Bottleneck: To reduce the database bottleneck, you could consider optimizing your database queries, implementing caching, or using a more scalable database solution.

Managing SLAs with Unexpected Response Times: To manage SLAs with unexpected response times from customer’s servers, you could consider implementing a more robust error handling and retry mechanism. This could help ensure that failed or delayed operations are retried in a timely manner.

Reducing Detection and Resolution Times: To reduce detection and resolution times, you could consider implementing more effective monitoring and alerting systems. This could help you detect and resolve issues more quickly, thereby minimizing the impact on your SLAs.

Remember, these are just potential solutions and the actual implementation would depend on the specific requirements and constraints of your system. I hope this helps! If you have any more questions, feel free to ask. 😊

You

he number of transactions grew exponentially, a few challenges popped up especially with webhooks, the most widely used form of notifications (we will focus on webhooks only for this blog). To ensure we meet our SLAs and for better long-term evolution of the service, we had to go back to the drawing board and re-think the design. Please explain what why using webhooks as notification, How it can be used notification, what is webhooks, can you show using code, I am java and javascript background.

Copilot

Sure, I’d be happy to explain!

Webhooks are a method of augmenting or altering the behavior of a web page or web application with custom callbacks. These callbacks may be maintained, modified, and managed by third-party users and developers who may not necessarily be affiliated with the originating website or application.

Webhooks are used for real-time notifications between applications and services. When a specific event occurs in the source site, it triggers a webhook that sends an HTTP request to the URL specified by you. This request contains information about the event which can be used by the receiving application.

The reason webhooks are widely used for notifications is because they provide real-time data and are more efficient than traditional polling mechanisms. Instead of constantly checking for updates, your application can sit back and let the data come to it.

Here’s a simple example of how you might set up a webhook in Node.js (JavaScript):

const express = require(‘express’);

const bodyParser = require(‘body-parser’);

const app = express();

app.use(bodyParser.json());

app.post(‘/webhook’, (req, res) => {

console.log(‘Received a POST request’);

console.log(req.body); // This contains the data sent by the webhook

res.status(200).end(); // Responding is important

});

app.listen(3000, () => console.log(‘Server started on port 3000’));

In this example, we’re creating a simple Express server that listens for POST requests on the /webhook endpoint. When a request is received, it logs the request body (which contains the data sent by the webhook) and then sends a 200 status code back to acknowledge receipt of the data.

And here’s a simple example of how you might send a webhook notification in Java using the Unirest library:

import com.mashape.unirest.http.HttpResponse;

import com.mashape.unirest.http.JsonNode;

import com.mashape.unirest.http.Unirest;

import com.mashape.unirest.http.exceptions.UnirestException;

public class Main {

public static void main(String[] args) {

try {

HttpResponse<JsonNode> response = Unirest.post(“http://localhost:3000/webhook”)

.header(“Content-Type”, “application/json”)

.body(“{\”key\”:\”value\”}”)

.asJson();

} catch (UnirestException e) {

e.printStackTrace();

}

In this example, we’re sending a POST request to the /webhook endpoint on our local server with a JSON body. This simulates an event occurring that triggers the webhook.

Remember to replace “http://localhost:3000/webhook” with the actual URL you want to send the webhook to, and replace “{\”key\”:\”value\”}” with the actual data you want to send.

I hope this helps! If you have any more questions, feel free to ask. 😊

Source: Conversation with Copilot, 13/6/2024

github.com

github.com

You

API pods receive the notification request, which after validation, is sent to the SQS queue to be read by the worker, and a notification is sent. Workers write the result of this execution in the database and push it to the data lake. A Scheduler runs periodically, reads the requests that have to be retried, and then pushes them to the SQS for processing. The existing system could handle a load of up to 2K TPS. Generally, we experience a peak load of up to 1k TPS. But, at these levels, the system performance starts degrading like p99 increases from ~2 secs to ~4 secs, which may further degrade based on the load, customers’ response time, etc. This Razorpay’s Notification Service Here I am not able to understand who is sending the notification and who is receiving can explain me using some example

Copilot

Sure, I’d be happy to explain!

In the context of Razorpay’s Notification Service, the “sender” and “receiver” of the notifications can be understood as follows:

Sender (Razorpay’s Notification Service): This is the system that generates and sends the notifications. It includes several components:

API Pods: These receive the notification request from another system or service within Razorpay (for example, a payment gateway or a transaction processing system). After validating the request, the API Pods send it to the SQS queue.

Workers: These consume messages from the SQS queue and handle the actual execution of sending the notifications. The result of this execution is written in the database and pushed to the data lake.

Scheduler: This component runs periodically, reads the requests that have to be retried (possibly failed or timed-out notifications), and pushes them back to the SQS queue for reprocessing by the Workers.

Receiver (Customer’s Server): This is the system that receives the notifications sent by Razorpay’s Notification Service. The receiver could be a server owned by a customer of Razorpay. When a specific event occurs (like a transaction being processed), Razorpay’s Notification Service sends a notification to the customer’s server via a webhook. The customer’s server then processes this notification accordingly.

Here’s a simple example to illustrate this:

Let’s say you run an online store and use Razorpay as your payment gateway. When a customer makes a purchase on your store, the payment is processed by Razorpay. Once the payment is successful, Razorpay’s Notification Service needs to inform your store’s system about this successful transaction.

In this case, Razorpay’s Notification Service (the sender) sends a notification containing the transaction details to your store’s server (the receiver). Your store’s server then updates the order status to ‘Paid’ and proceeds with the order fulfillment process.

I hope this clarifies your question! If you have any more questions, feel free to ask. 😊

You

So how rozorpay added the webhook for notification, please explain me through same explample ?

Copilot

Sure, I’d be happy to explain!

In the context of Razorpay and your online store, here’s how the webhook for notifications might be set up:

Webhook Setup: As the owner of the online store, you would first need to set up a webhook on your server. This involves creating an endpoint (for example, https://yourstore.com/razorpay-webhook) that is capable of receiving POST requests.

Register Webhook with Razorpay: Once your webhook is set up, you would then need to register this webhook URL with Razorpay. This is typically done through the Razorpay dashboard or API. When registering, you can specify which events you want to be notified about (for example, successful payments, refunds, etc.).

Event Occurs: Now, when a customer makes a purchase on your store and the payment is processed by Razorpay, this triggers the event you specified when registering the webhook.

Notification Sent: Razorpay’s Notification Service sends a POST request to the webhook URL (https://yourstore.com/razorpay-webhook) with a payload containing the details of the event (for example, the transaction details for a successful payment).

Notification Received: Your server receives this POST request and can then process the data accordingly (for example, updating the order status to ‘Paid’).

Here’s a simple example of how the webhook endpoint might look in Node.js:

const express = require(‘express’);

const bodyParser = require(‘body-parser’);

const app = express();

app.use(bodyParser.json());

app.post(‘/razorpay-webhook’, (req, res) => {

console.log(‘Received a POST request from Razorpay’);

console.log(req.body); // This contains the data sent by Razorpay

// Here you can add your logic to handle the notification

res.status(200).end(); // Responding is important

});

app.listen(3000, () => console.log(‘Server started on port 3000’));

In this example, we’re creating a simple Express server that listens for POST requests on the /razorpay-webhook endpoint. When a request is received, it logs the request body (which contains the data sent by Razorpay) and then sends a 200 status code back to acknowledge receipt of the data.

Remember, this is a simplified example and in a real-world scenario, you would need to add your own logic to handle the notification and secure your webhook to verify that the requests are indeed coming from Razorpay.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

Please design Notification flow for razorpay :

Copilot

Sure, here’s a high-level design of the notification flow for Razorpay:

1. Transaction Event: A transaction event occurs in the Razorpay system (e.g., a payment is processed).

2. API Pods: The event triggers a notification request within the Razorpay system. This request is received and validated by the API pods.

3. SQS Queue: The validated notification request is sent to an Amazon Simple Queue Service (SQS) queue.

4. Workers: Worker processes consume messages from the SQS queue. Each message corresponds to a notification that needs to be sent.

5. Webhook Notification: The worker processes send the notifications to the registered webhook URLs. These could be URLs of Razorpay’s customers who need to be notified about the transaction events.

6. Executors: The result of the notification execution (success or failure) is recorded by the Executors.

7. Databases: The Executors persist the transaction details related to the notification execution in two different data stores – a MySQL database for transactional data and a data lake for analytical purposes.

8. Scheduler: A Scheduler runs periodically, identifies notifications that need to be retried (due to failures or timeouts), and pushes them back to the SQS queue for reprocessing by the workers.

This flow ensures that notifications about transaction events are reliably delivered to Razorpay’s customers. The use of an SQS queue and worker processes allows the system to handle a high volume of notifications. The Executors and Scheduler ensure that failures are properly handled and retried.

Please note that this is a simplified view of the notification flow. The actual implementation may involve additional steps and components based on the specific requirements and constraints of the Razorpay system.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

Can you create blog on this please include everything which we have discussed ?

Copilot

Sure, here’s a draft for a blog post based on our discussion:

Our existing architecture for handling webhook notifications involves several components:

API Pods: These receive the notification request from another system or service within Razorpay. After validating the request, the API Pods send it to the SQS queue.

SQS Queue: The validated notification request is sent to an Amazon Simple Queue Service (SQS) queue.

Workers: Worker processes consume messages from the SQS queue. Each message corresponds to a notification that needs to be sent.

Executors: The result of the notification execution (success or failure) is recorded by the Executors.

Databases: The Executors persist the transaction details related to the notification execution in two different data stores – a MySQL database for transactional data and a data lake for analytical purposes.

Scheduler: A Scheduler runs periodically, identifies notifications that need to be retried (due to failures or timeouts), and pushes them back to the SQS queue for reprocessing by the workers.

To address these challenges and ensure the long-term evolution of the service, we’re considering several initiatives:

Prioritizing Incoming Load: One approach could be to prioritize the incoming load based on certain criteria. This could help manage the system resources more effectively and ensure that the most critical operations are handled first.

Reducing Database Bottleneck: To reduce the database bottleneck, we could consider optimizing our database queries, implementing caching, or using a more scalable database solution.

Managing SLAs with Unexpected Response Times: To manage SLAs with unexpected response times from customer’s servers, we could consider implementing a more robust error handling and retry mechanism.

Reducing Detection and Resolution Times: To reduce detection and resolution times, we could consider implementing more effective monitoring and alerting systems.

By addressing these challenges, we aim to build a more robust and scalable Notification Service that can handle a large number of transactions without compromising on performance or reliability.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

You did not included the exmaples?

Copilot

I apologize for the oversight. Let’s add the examples to the blog post:

…

Here’s a simple example of how you might set up a webhook in Node.js (JavaScript):

const express = require(‘express’);

const bodyParser = require(‘body-parser’);

const app = express();

app.use(bodyParser.json());

app.post(‘/webhook’, (req, res) => {

console.log(‘Received a POST request’);

console.log(req.body); // This contains the data sent by the webhook

res.status(200).end(); // Responding is important

});

app.listen(3000, () => console.log(‘Server started on port 3000’));

And here’s a simple example of how you might send a webhook notification in Java using the Unirest library:

import com.mashape.unirest.http.HttpResponse;

import com.mashape.unirest.http.JsonNode;

import com.mashape.unirest.http.Unirest;

import com.mashape.unirest.http.exceptions.UnirestException;

public class Main {

public static void main(String[] args) {

try {

HttpResponse<JsonNode> response = Unirest.post(“http://localhost:3000/webhook”)

.header(“Content-Type”, “application/json”)

.body(“{\”key\”:\”value\”}”)

.asJson();

} catch (UnirestException e) {

e.printStackTrace();

}

…

By addressing these challenges, we aim to build a more robust and scalable Notification Service that can handle a large number of transactions without compromising on performance or reliability.

I hope this helps! If you have any more questions, feel free to ask. 😊

Source: Conversation with Copilot, 13/6/2024

github.com

github.com

You

Challenges We Found While Scaling Up While we were onboarding new use cases and adopting services, we encountered a few scalability issues:

Prioritizing incoming load Reducing DB bottleneck Managing SLAs with unexpected response times from customer’s servers Reducing detection and resolution times Please explain it with example

Copilot

Absolutely, let’s break down the challenges and potential solutions you’ve mentioned:

Challenges:

Read Query Performance Degradation: As the amount of data in the table increases, the performance of read queries degrades. Even with a replica database, the performance peaks as the number of write operations increases.

Example: Imagine a database table that stores all the transactions. As more transactions occur, more records are added to the table. When the table becomes large, it takes longer to read data from it.

Database Limitations: The worker pods can be scaled up to any limit, but due to limitations in the database’s Input/Output Operations Per Second (IOPS), scaling up the pods results in database degradation. Increasing the size of the database was a short-term fix, but not a long-term solution.

Example: Let’s say you have 10 worker pods and a database that can handle 1000 IOPS. If you increase the worker pods to 20, the database might not be able to handle the increased load, leading to slower response times.

Service Level Agreement (SLA) Challenges: In the Payments system, all operations must be performed within a specified SLA. However, variable factors like customers’ response time for webhooks and load transactions per second (TPS) posed a challenge to complete the operations within the SLAs.

Example: If a customer’s server takes a long time to respond to a webhook, it could delay the completion of the operation, causing it to exceed the specified SLA.

Unexpected Load Increases: The load also increases unexpectedly for specific events or days like Quarter Closing, Special market trading occasions like Festivals, Annual Flagship sales by e-commerce, World Cup, IPL, etc. During these days, the notifications were getting impacted.

Example: During a big sale event, there might be a sudden surge in transactions, leading to a higher load on the system. This could cause delays in sending notifications.

Potential Solutions:

Prioritizing Incoming Load: One approach could be to prioritize the incoming load based on certain criteria. This could help manage the system resources more effectively and ensure that the most critical operations are handled first.

Example: You could prioritize notifications related to payment success over other types of notifications, as these might be more critical for your business.

Reducing Database Bottleneck: To reduce the database bottleneck, you could consider optimizing your database queries, implementing caching, or using a more scalable database solution.

Example: You could use a NoSQL database like DynamoDB that can handle higher IOPS, or implement caching using Redis to reduce the load on the database.

Managing SLAs with Unexpected Response Times: To manage SLAs with unexpected response times from customer’s servers, you could consider implementing a more robust error handling and retry mechanism. This could help ensure that failed or delayed operations are retried in a timely manner.

Example: If a webhook fails to deliver, you could put it in a retry queue and attempt to send it again after a certain period of time.

Reducing Detection and Resolution Times: To reduce detection and resolution times, you could consider implementing more effective monitoring and alerting systems. This could help you detect and resolve issues more quickly, thereby minimizing the impact on your SLAs.

Example: You could use a monitoring tool like Prometheus to monitor your system’s performance in real-time, and an alerting tool like Alertmanager to send alerts when certain performance thresholds are breached.

You

What is the MySQL, Postgresql, IOPS ?

Copilot

Sure, let’s break down these terms:

MySQL [5] [6] [7]: MySQL is an open-source Relational Database Management System (RDBMS) that enables users to store, manage, and

Source: Conversation with Copilot, 13/6/2024

MySQL – Wikipedia

What Is MySQL? | Oracle

What is MySQL: MySQL Explained for Beginners – Hostinger

You

What is the IOPS of MySQL, Postgresql?

Copilot

IOPS, or Input/Output Operations Per Second, is a common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area networks (SAN). In the context of databases like MySQL and PostgreSQL, IOPS can often be a critical factor in database performance [5] [1].

For MySQL, the innodb_io_capacity configuration parameter defines the overall I/O capacity available to InnoDB, which should be set to approximately the number of I/O operations that the system can perform per second [6]. The default value is 200, but this can be increased on systems that use multiple disks or SSDs [6].

For PostgreSQL, disk IOPS is a key metric for measuring the performance of any disk system. It represents the number of read and write operations that can be performed per second [1]. Understanding and optimizing disk IOPS is crucial for achieving optimal performance, especially for write-intensive workloads [4].

The actual IOPS capacity can vary greatly depending on the specific hardware, configuration, and workload. It’s also important to note that higher IOPS values typically indicate better storage performance, allowing your database to handle more simultaneous read and write operations, resulting in faster data retrieval and improved overall efficiency [7] [3].

I hope this helps! If you have any more questions, feel free to ask. 😊

Source: Conversation with Copilot, 13/6/2024

mysql – I want to know about the IOPS (I/O Per Second) and How it …

Understanding Postgres IOPS: Why They Matter… | Crunchy Data Blog

14.8.8 Configuring InnoDB I/O Capacity – MySQL

Gain Unparalleled Insights into Your Postgres: IOPS Monitoring Made …

Storage IOPS in Azure Database for MySQL – Flexible Server

PostgreSQL High Availability Cookbook – Second Edition – Packt Subscription

Tuning Input/Output (I/O) Operations for PostgreSQL

Storage iops – Azure Database for MySQL – Flexible Server

How To Calculate IOPS and Throughput of Your … – Database Tutorials

You

. Example: If a customer’s server takes a long time to respond to a webhook, Please suggest how customer takes longer response time to wehook, can you give some examples ?

Copilot

Sure, there can be several reasons why a customer’s server might take a long time to respond to a webhook. Here are a few examples:

Server Overload: If the customer’s server is handling a large number of requests simultaneously, it might be overloaded, causing it to respond slowly to incoming requests. This is especially common during peak usage times.

Network Latency: The time it takes for data to travel from the webhook server to the customer’s server and back can add to the response time. This can be affected by factors such as the physical distance between the servers, the quality of the network connections, and network congestion.

Processing Time: If the customer’s server needs to perform complex processing or fetch data from a database in response to the webhook, this can add to the response time.

Server Downtime: If the customer’s server is down or undergoing maintenance, it might not respond to the webhook at all, or it might take longer than usual to respond.

Rate Limiting: Some servers implement rate limiting to protect against abuse or to manage resources. If the rate limit has been reached, the server might delay processing of additional requests until the rate limit resets.

Remember, these are just potential reasons and the actual cause could be a combination of multiple factors. It’s always a good idea to monitor the performance of webhooks and investigate any delays to ensure a smooth and efficient operation. I hope this helps! If you have any more questions, feel free to ask. 😊

You

Database Limitations: The worker pods can be scaled up to any limit, but due to limitations in the database’s Input/Output Operations Per Second (IOPS), scaling up the pods results in database degradation. Increasing the size of the database was a short-term fix, but not a long-term solution. Example: Let’s say you have 10 worker pods and a database that can handle 1000 IOPS. If you increase the worker you said can we not increase the read dabases ?

Copilot

Yes, you’re correct. Scaling up the worker pods increases the load on the database, and if the database’s Input/Output Operations Per Second (IOPS) limit is reached, it can result in database degradation.

Increasing the size of the database (also known as vertical scaling) can temporarily alleviate this issue by providing more IOPS capacity. However, this is often considered a short-term solution because there’s a limit to how much you can scale a database vertically.

As for your question about increasing read databases, yes, it’s possible and is a common strategy for scaling databases. This is known as horizontal scaling or scaling out. In this approach, you add more databases to distribute the load, rather than increasing the capacity of a single database.

For example, you could implement a strategy known as read-replica scaling. In this setup, you have one primary (or master) database that handles all write operations, and multiple read replicas that handle read operations. When data is written to the primary database, it’s asynchronously replicated to the read replicas. This allows you to distribute the read load across multiple databases, thereby increasing your overall read capacity.

However, it’s important to note that while read-replica scaling can significantly increase read capacity, it doesn’t increase write capacity. If your application performs a lot of write operations, you might need to consider other strategies, such as sharding, to distribute the write load across multiple databases.

Remember, scaling databases is a complex task that requires careful planning and consideration of your specific workload and performance requirements. It’s always a good idea to monitor your database performance and consult with a database expert or a solutions architect when planning your scaling strategy.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

I am mean to save follow the master and slave architecture? Like we increase the read database?

Copilot

Yes, you’re correct. The master-slave architecture is one way to increase the read capacity of a database system. In this setup:

The master database handles all write operations and replicates the data to one or more slave databases.

The slave databases handle read operations. This allows you to distribute the read load across multiple databases, thereby increasing your overall read capacity.

This architecture is also known as replication, and the slave databases are often referred to as read replicas.

However, it’s important to note that while this can significantly increase read capacity, it doesn’t increase write capacity. All write operations still go to the master database. If your application performs a lot of write operations, you might need to consider other strategies, such as sharding or partitioning, to distribute the write load across multiple databases.

Also, keep in mind that data replication from the master to the slaves is not instantaneous. There can be a lag, which means the data on the slaves might be slightly out-of-date compared to the master. This is typically not a problem for many applications, but it’s something to be aware of.

I hope this helps! If you have any more questions, feel free to ask. 😊

You

Challenges We Found While Scaling Up While we were onboarding new use cases and adopting services, we encountered a few scalability issues:

Prioritizing incoming load Reducing DB bottleneck Managing SLAs with unexpected response times from customer’s servers Reducing detection and resolution times Prioritizing Incoming Load There are two significant findings with the incoming requests:

Not all notification requests are equally critical. One type of notification should not impact the delivery of other notifications. To find the best fit for the above, we tried to understand the expectations of the consumers’ SLAs and the impact of each notification. Once we had a fair understanding of each notification and its impact and SLAs, we created priority queues and pushed the events in them based on their criticality.

P0: All the critical events with the highest impact on business metrics are pushed to this queue. This is an exclusive queue and a very few notification types go into this queue.

P1: This is a default queue and all notifications other than P0 go into this queue.

P2: All burst events (very high TPS in a short span) go to this queue.

Thus, by segregating the priorities, the impact of any rogue event on other events was reduced. But, what if one of the events in one of the queues goes rogue?. In such a case, this leads to SLA breaches. To reduce this impact, we introduced Rate Limiter to these queues and events.

Rate Limiter Workflow

For the P0 queue, all the events have an X limit. If this limit is breached for an event, that event is sent to the Rate Limit queue. Thus, it ensures that other events are not impacted. For the P1 queue, all the events have a Y limit. The number of events in this queue is much larger. As the events are not of high priority, their maximum limit is less than the P0. Further, the Rate Limits can be customised based on the consumers’ requests and the load pattern received for an event.

Event Prioritisation with Rate Limiting secured the system from different DOS attacks and helped us build a multi-tenant system. Challenges We Found While Scaling Up While we were onboarding new use cases and adopting services, we encountered a few scalability issues:

P0: All the critical events with the highest impact on business metrics are pushed to this queue. This is an exclusive queue and a very few notification types go into this queue.

P1: This is a default queue and all notifications other than P0 go into this queue.

P2: All burst events (very high TPS in a short span) go to this queue.

Rate Limiter Workflow

Event Prioritisation with Rate Limiting secured the system from different DOS attacks and helped us build a multi-tenant system.

Copilot

Sure, let’s break down the information:

Challenges:

Read Query Performance Degradation: As the data in the table increases, the performance of read queries degrades. Even with a replica database, the performance peaks as the number of write operations increases.

Example: Imagine a database table that stores all the transactions. As more transactions occur, more records are added to the table. When the table becomes large, it takes longer to read data from it.

Database Limitations: The worker pods can be scaled up to any limit, but due to limitations in the database’s Input/Output Operations Per Second (IOPS), scaling up the pods results in database degradation. Increasing the size of the database was a short-term fix, but not a long-term solution.

Example: Let’s say you have 10 worker pods and a database that can handle 1000 IOPS. If you increase the worker pods to 20, the database might not be able to handle the increased load, leading to slower response times.

Service Level Agreement (SLA) Challenges: In the Payments system, all operations must be performed within a specified SLA. However, variable factors like customers’ response time for webhooks and load transactions per second (TPS) posed a challenge to complete the operations within the SLAs.

Example: If a customer’s server takes a long time to respond to a webhook, it could delay the completion of the operation, causing it to exceed the specified SLA.

Unexpected Load Increases: The load also increases unexpectedly for specific events or days like Quarter Closing, Special market trading occasions like Festivals, Annual Flagship sales by e-commerce, World Cup, IPL, etc. During these days, the notifications were getting impacted.

Example: During a big sale event, there might be a sudden surge in transactions, leading to a higher load on the system. This could cause delays in sending notifications.

Potential Solutions:

Prioritizing Incoming Load: One approach could be to prioritize the incoming load based on certain criteria. This could help manage the system resources more effectively and ensure that the most critical operations are handled first.

Example: You could prioritize notifications related to payment success over other types of notifications, as these might be more critical for your business.

Reducing Database Bottleneck: To reduce the database bottleneck, you could consider optimizing your database queries, implementing caching, or using a more scalable database solution.

Example: You could use a NoSQL database like DynamoDB that can handle higher IOPS, or implement caching using Redis to reduce the load on the database.

Managing SLAs with Unexpected Response Times: To manage SLAs with unexpected response times from customer’s servers, you could consider implementing a more robust error handling and retry mechanism. This could help ensure that failed or delayed operations are retried in a timely manner.

Example: If a webhook fails to deliver, you could put it in a retry queue and attempt to send it again after a certain period of time.

Reducing Detection and Resolution Times: To reduce detection and resolution times, you could consider implementing more effective monitoring and alerting systems. This could help you detect and resolve issues more quickly, thereby minimizing the impact on your SLAs.

Example: You could use a monitoring tool like Prometheus to monitor your system’s performance in real-time, and an alerting tool like Alertmanager to send alerts when certain performance thresholds are breached.

Rate Limiter Workflow:

The image you provided shows a flowchart of how rate limiting is applied to different priority queues (P0, P1, and P2). Each queue has a different rate limit (X for P0, Y for P1), and if an event exceeds this limit, it is sent to a separate Rate Limit queue. This ensures that a rogue event in one queue does not impact the events in other queues.

You

postgresql iops

Copilot

Input/Output Operations Per Second (IOPS) is a key metric for measuring the performance of any disk system. It represents the number of read and write operations that can be

You

We introduced a stream between DB writes and workers in this flow. All the messages are pushed to the stream, and the DB writer writes the messages on the DB (all the messages pushed on kinesis are idempotent). This gave us control over the rate at which we write on the DB.

Copilot

The image you provided shows a revised architecture that introduces a stream (in this case, a Kinesis stream) between the database writes and the workers. Here’s how it works:

API and Live Traffic Queue: The process begins with the API, which sends data to the Live Traffic Queue.

Workers: The Workers pick up the data from the queue and send an “Attempt event” to both the Executors and the Kinesis stream.

Kinesis Stream: The Kinesis stream queues the messages, providing a buffer between the Workers and the database. This allows you to control the rate at which writes are made to the database.

DB Writer: The DB Writer consumes messages from the Kinesis stream and writes them to the MySQL Database. Because all the messages pushed to Kinesis are idempotent (meaning they can be applied multiple times without different outcomes), this setup ensures that no duplicate writes occur in the database.

Executors and Datalake Publisher: The Executors persist the transaction details related to the notification execution in two different data stores – a MySQL database for transactional data and a data lake for analytical purposes. The Datalake Publisher sends data to the data lake storage.

Scheduler: A Scheduler runs periodically, identifies notifications that need to be retried (due to failures or timeouts), and pushes them back to the Live Traffic Queue for reprocessing by the Workers.

By introducing the Kinesis stream, you’re able to manage the rate of database writes more effectively, which can help alleviate some of the database bottlenecks you were experiencing. This is a great example of how stream processing can be used to improve the scalability and performance of a system.

I hope this helps! If you have any more questions, feel free to ask. 😊 user Reducing DB Bottleneck We introduced a stream between DB writes and workers in this flow. All the messages are pushed to the stream, and the DB writer writes the messages on the DB (all the messages pushed on kinesis are idempotent). This gave us control over the rate at which we write on the DB. assistant’ The image you provided illustrates a strategy to reduce database bottlenecks by introducing a stream (in this case, a Kinesis stream) between the database writes and the workers. Here’s how it works:

API and Live Traffic Queue: The process begins with the API, which sends data to the Live Traffic Queue.

Workers: The Workers pick up the data from the queue and send an “Attempt event” to both the Executors and the Kinesis stream.

Kinesis Stream: The Kinesis stream queues the messages, providing a buffer between the Workers and the database. This allows you to control the rate at which writes are made to the database.

DB Writer: The DB Writer consumes messages from the Kinesis stream and writes them to the MySQL Database. Because all the messages pushed to Kinesis are idempotent (meaning they can be applied multiple times without different outcomes), this setup ensures that no duplicate writes occur in the database.

Executors and Datalake Publisher: The Executors persist the transaction details related to the notification execution in two different data stores – a MySQL database for transactional data and a data lake for analytical purposes. The Datalake Publisher sends data to the data lake storage.

Scheduler: A Scheduler runs periodically, identifies notifications that need to be retried (due to failures or timeouts), and pushes them back to the Live Traffic Queue for reprocessing by the Workers.

I hope this helps! If you have any more questions, feel free to ask. 😊 user Managing SLAs with unexpected response times from customer’s servers The SLAs for each event are defined based on the criticality of the event. The SLAs are defined in such a way that the system can handle the load at peak times. But, there are a few variable factors that are not in our control, like the response time from the customers’ servers. To handle such scenarios, we introduced DLQ (Dead Letter Queue) where all the events that could not be processed within the SLA are pushed to this queue and retried later. assistant’ The image you provided shows a strategy for managing Service Level Agreements (SLAs) with unexpected response times from customer’s servers. Here’s how it works:

SLAs for Each Event: The SLAs for each event are defined based on the criticality of the event. These SLAs are designed to ensure that the system can handle the load even at peak times.

Variable Factors: However, there are some variable factors that are not within your control, such as the response time from the customers’ servers. These unpredictable factors can sometimes cause events to not be processed within the defined SLA.

Dead Letter Queue (DLQ): To handle such scenarios, you’ve introduced a Dead Letter Queue (DLQ). The DLQ is a queue where all the events that could not be processed within the SLA are sent.

Retry Mechanism: Events in the DLQ are retried later, once the load on the system has decreased or the customer’s server is able to respond more quickly.

By using a DLQ and a retry mechanism, you’re able to ensure that all events are eventually processed, even if they can’t be handled within the originally defined SLA due to factors outside of your control. This is a great example of how to design a system for resilience and reliability, even in the face of unpredictable external factors.

I hope this helps! If you have any more questions, feel free to ask. 😊 user’ Reducing detection and resolution times We introduced a few metrics and alarms on these metrics to detect any anomaly in the system. We also built a few dashboards to monitor these metrics and take corrective actions immediately. This helped us reduce the detection and resolution times drastically. Please explain this with example assistant’ Sure, let’s break this down:

Reducing Detection and Resolution Times

In any system, being able to quickly detect and resolve issues is crucial for maintaining high availability and a good user experience. Here’s how you might do this:

Introduce Metrics: Metrics are measurements of specific aspects of your system’s behavior. For example, you might measure the number of requests per second your API is handling, the average response time of your API, the number of active users on your site, or the rate of errors occurring in your application.

Example: Let’s say you introduce a metric for the average response time of your API. This metric measures the time it takes for your API to respond to a request.

Set Alarms on Metrics: Once you have your metrics, you can set alarms on them. An alarm triggers a notification when a metric reaches a certain threshold. This allows you to be immediately alerted when something is going wrong.

Example: You could set an alarm on the average response time metric. If the average response time goes above 200ms, the alarm triggers a notification.

Build Dashboards: Dashboards provide a visual representation of your metrics. They allow you to monitor your system’s behavior in real-time and spot any anomalies quickly.

Example: You could create a dashboard that displays the average response time metric over time. If you see a sudden spike in response time, you know something is wrong and can investigate further.

Take Corrective Actions Immediately: Once an issue is detected, you can take immediate action to resolve it. This might involve debugging the issue, scaling up your resources, or rolling back a recent deployment.

Example: If your alarm notifies you that the average response time has spiked, you could immediately check your servers to see if they’re overloaded and need to be scaled up.

By introducing metrics, alarms, and dashboards, you can drastically reduce the time it takes to detect and resolve issues in your system. This helps ensure your system remains highly available and provides a good user experience.

I hope this helps! If you have any more questions, feel free to ask. 😊 user’ What is the kinesis stream? assistant’ Amazon Kinesis is a set of services offered by Amazon Web Services (AWS) to handle and process large streams of data in real-time. It’s designed to process large, distributed data streams, making it ideal for real-time data analytics, log and event data collection, and more.

There are several types of services within Amazon Kinesis:

Kinesis Data Streams: This service allows you to build custom applications that process or analyze streaming data. You can continuously add various types of data such as clickstreams, application logs, and social media to a Kinesis data stream from hundreds of thousands of sources.

Kinesis Data Firehose: This service is used to load streams of data to data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and more.

Kinesis Data Analytics: This service allows you to analyze streaming data with standard SQL queries. You can create real-time dashboards, generate alerts, drive real-time metrics, and more.

Kinesis Video Streams: This service allows you to securely stream video from connected devices to AWS for analytics, machine learning, and other processing.

In the context of your previous message, a Kinesis stream likely refers to a Kinesis Data Stream, which is being

Ref: https://sl.bing.net/brOGI2rB4P6

Post Views: 32

Follow Us

Understanding and Scaling Webhook Notifications in Razorpay

What are Webhooks?

The Existing Architecture

Challenges Faced

Potential Solutions

Understanding Daemon Threads and Non-Daemon Threads in Java

Building a Large File Upload System with Load Balancer and HDFS

Neelabh

About Author

You may also like

Mastering Distributed Locking for Seat Reservations with Java Spring Boot

Building a Large File Upload System with Load Balancer and HDFS