Real Project Example

High-Load Project Example

Our goal is to present an example project that addresses the process and challenges of high-load system architecture for one of our clients. This case study showcases the steps we took to meet their specific needs, identifying risks, challenges, and successful development strategies.
cheked
Processing 100,000 requests per second
cheked
Replying within 100 ms
cheked
Getting each request to perform complex searches and several logical operations

Main Challenges and Risks

The project was related to money and trading, where any glitch could result in tens of thousands of dollars lost within mere minutes of incorrect operation
The main risk was that our chosen technology stack would not be able to handle the large volume of requests within the required short response time.

High-Load System Development Process

01
Choosing The Tech Stack:
The chosen technological stack needed to fulfill the following requirements:
cheked

Non-blocking asynchronous I/O: Synchronous I/O would make it impossible to maintain a 100ms delay under such a load.

cheked

In-memory index for searching: This is done to save time on network interactions.

cheked

No internal state for the process that handles incoming requests: This allows for long-term horizontal scaling.

02
Making a Prototype
We developed a prototype incorporating our chosen technology stack.
03
Creating a Test Environment
We created a test environment for the high-load system based on wrk2 with Lua scripts  to load it with requests and measure the response time.
The most crucial point here was that we did not load the process with any random data. Instead, we used queries that would actually find something in the index, simulating real work.
Therefore, we developed a test data generator to populate the index itself and generate queries that matched the test data, resulting in approximately the same percentage of successful searches as expected in reality.
04
Running Tests and Optimizing
Once both components were ready, we began to run the actual tests. After analyzing the test results, we made several important improvements and achieved approximately 5,000 requests per second per process. Additionally, we ensured the ability to scale our processes horizontally.
05
Requirements for Building and Deployment
Once the risk was mitigated, we turned to actual deployment. For such systems, it is preferable to deploy in the cloud first. In this case, we used AWS with Kubernetes and a declarative infrastructure description in Terraform.
The deployment process looked like this. First, we built Docker images of all components on our GitLab CI, uploaded them to the private docker registry, and made the Terraform configuration that specified to the cloud where and what to deploy, including the quantity.
This method made it very convenient to roll out new versions. We simply built new images, updated their versions in the Terraform configuration, and applied the changes.
06
Additional Requirements for the System
In high-load systems, observability and a “safeguard” system are also super important.
Observability involves collecting metrics from all our components (such as resource loading, execution time of certain requests, number of errors or failures) and visualizing them in a convenient format for the system operator.
We used Prometheus for collecting metrics and Grafana for visualization. Alerts were also set up to send messages via instant messengers when a metric approached dangerous values, allowing for proactive responses.
As for the safeguard system, it prevents damage in the event of a failure. For example, if a component fails or degrades (let’s say, Clickhouse runs out of disk space and can no longer write data), other components will detect this and stop performing potentially dangerous operations related to monetary resources, preventing excessive spending. This system had nearly a dozen safeguards, reflecting the complexity and scale of the overall system.
07
Data Storages
The system contained three different types of data warehouses. This is because, for such load and response time requirements, it is essential to use specialized storage for each specific task.
When given a task to quickly save millions of new records, the best choice in this case would be to use Clickhouse. Clickhouse is optimized for fast data insertions, making it suitable as an intermediate storage for large amounts of data or as a data warehouse for analytics. However, it also has its downsides, making it slower at changing or deleting data.
The other task was to retrieve values quickly (faster than 10ms) by key from centralized storage. Aerospike is ideal for this purpose, as each instance can handle up to 100,000 operations per second. We had four instances in our system.
As for the third storage solution, we used standard PostgreSQL RDBMS to store various configurations.
08
Achieving a System Capable of Effectively Handling High Loads.
Once we confirmed that everything was operational, our next step was to migrate the system from the cloud to our own infrastructure to reduce costs significantly. Considering our use of Terraform and Kubernetes, this transition was quite straightforward.

To learn more about our expertise, review our custom high-load system development page

Read more

Transform Your Business Today –
Get Your Free Consultation

Connect with us and the first consultation will be provided free of charge.

Please, provide your name

Please, provide a valid email

Please, leave a message.

Contact us on messengers: