How Request processing has changed over the years

While browsing Twitter (it will never be X.com for me), I came across a tweet where it was being discussed how an engineer from a big tech company was unable to explain how async/await works after working on a particular language for 3+ years. Though I agree with the OP on this being an important topic, most engineers working in larger companies seldom get to work on the processing layer which may make it an opaque topic for most engineers out there. Back in 2016, I was working quite a lot with Ruby and Python services in an early age startup and I was trying to improve the requests per second metric. Somebody mentioned how NodeJS is magnitudes faster than Ruby and I was both hurt (being a Ruby fanboy) and intrigued on why was NodeJS so much faster. I came across Ryan Dahl’s (NodeJS creator) talk on how the event loop was architected to handle a far higher RPS compared other interpreted languages.

Anatomy of a request#

What are the tasks that are carried out during processing of a request?

Serialization/Deserialization
Validation
Mathematical calculations
High iteration count
File operations
Database read/write queries
Other API calls

Before we go deeper into the different processing architectures, we need to understand the different kinds of tasks and the implications they have. The above tasks can be categorised into Blocking or Non Blocking calls. Non blocking calls are those which don’t need to wait for or get blocked on other tasks. They carry on to the end of the computation and don’t release the CPU unless the kernel forcefully evicts them. Blocking calls are those tasks which needs to wait to get acknowledgement or result from another task. Some examples are matrix multiplication, loop iteration, serialisation/deserialisation, etc. Database calls, network calls, sleep calls, file operations are usually examples of those where the process is put into a waiting state unless the result is returned back by the other task. Since the process is in a waiting state, it itself gives up control of the CPU and waits for itself to get rerun when the result is returned back.

Cooperative vs Preemptive scheduling#

This is one of the more important topics in Computer Science whenever concurrency is discussed. From chatgpt,

Cooperative scheduling relies on tasks voluntarily yielding control back to the system, meaning a
running task must explicitly pause (e.g., using yield or await) for others to run. This approach is
simpler but can lead to issues if a task doesn’t yield, causing the system to become unresponsive.
Preemptive scheduling, on the other hand, allows the system to forcibly interrupt a task and switch
execution to another, ensuring fair CPU usage and preventing any single task from monopolizing resources.

We will circle back to this concept as we discuss the different request processing architectures.

Timeline of request processing#

Single process server#

I deployed my first Ruby on Rails app by just running rails s on a Ubuntu server. Rails came bundled with a WEBrick server, which should ideally be used only for development purposes. Running rails s ideally would spawn up a WEBrick server unless configured otherwise. WEBrick by default, starts a single process to handle the requests received by it. Deploying it to production using the same approach showed how slow the application was when multiple people used the app simultaneously. Of course there was no asset pipeline and productisation of other parts, but I had just started out with webdev and I was unaware of the pitfalls. To go into more details, why did the server slow down when there were multiple users? Having multiple users meant that there were multiple concurrent requests to the server. Since WEBrick is a single process server, anytime a request comes to the server, it reads the requests, processes it and sends back the response. Once the response is sent back, it can handle the next request. So it can only process one request at a time. This means that if the server takes 100ms on an average to complete the request processing, then the total time taken by the server to handle 100 requests is 10 seconds. Pretty bad performance. I remember I faced a nasty bug where a particular HTTP API was calling another API on the same port. It was mainly because the server was already blocked on answering the first request and the internal request was blocked because the process never got freed up.

Multi process server#

Then came Unicorn, which used to pre-fork multiple processes on booting up, depending on the configuration and the number of cores. This meant that there were multiple processes which could process concurrent requests at any time. Let’s take up the previous example and check how Unicorn improves upon it. If you have 4 cores on your machine, Unicorn would be able to start at least 4 processes to handle the requests which means service 100 requests of 100ms RTT would take around 2.5 seconds. This brings us to an important question. How many processes can you have on a system? The number of processes that can operate concurrently is the RPS that can be achieved on the system.

A quick segue to Context Switching#

Most kernels use preemptive scheduling to decide which processes to run. Usually, each CPU can run only one process at a time. This is where context scheduling comes into the picture. The kernel depending on the number of processes queued up forces certain processes to sleep for some time and runs another waiting process. This is to ensure that no single process can hog the resources for a prolonged duration.

Back to Multi processing. So if you have double the processes than the number of cores, there is a slight chance that there would be an increase in the requests processed per second. This enhancement comes because the request may be waiting on IO operations which means context switching the process in place of another process would help serve more requests. However, if the ratio of processes to cores is more than 4-5x, you would see no further improvement or deterioration since the time spent in context switching would actually be higher compared to the time spent waiting for a single process, which means more contention for a finite number of cores.

Let’s assume the best improvement can be seen when the ratio of processes to cores is 2. This brings our calculation to 1.25 seconds to process 100 requests. Still not as good.

Multi process multi thread server#

What would be the next level of optimisation possible? Adding more processes clearly has its limits as the load of context switching increases on the kernel. This is where Puma comes into the picture. Puma supported Multi Process Multi thread request processing. We already saw how multiprocessing was able to increase the throughput. Multi processing is the ability to spawn multiple threads within the process itself. Like the previous processing architecture, every thread can process a request. So, threads running within a process allows more concurrency for request processing. This means that for a 4 core machine, we can have 4 processes with 4 threads running, which gives a concurrency of 16 requests per second. This means 100 requests of 100ms can be finished within 625ms.

Why can we increase the number of threads but not the number of processes? Threads are more lightweight compared to processes. They consume lesser resources since the threads share certain sections of memory with the parent process itself. Since threads are lighter, the load of context switching is also signficantly lower. Since threads share the few of the same memory sections, the process also have more visibility and control over scheduling of the threads. This means that processes can cooperatively schedule threads to oeprate when the other thread is blocked or waiting on a resource.

Event loop server#

Now we get to the fun part. NodeJS. Whenever you would ask the reason behind NodeJS’s speed, everyone would say it uses the single threaded asynchronous event loop to process requests. What does an asynchronous event loop actually mean? NodeJS runs a single threaded server whose main responsibility is to handle the input and output of the requests and schedule IO operations. Why is it encouraged to use Promises and callbacks in Javascript for setInterval, setTimeout and http methods? Why is the concept of closures more widely seen in Javascript? (bias detected here)

Instead of executing the callback in the same thread where it received the request, NodeJS delegates the task to other threads. Once it delegates the task, it moves on to the other request without waiting for the answer from the delegated task. When the delegated task is complete, it returns back the response to the NodeJS main thread in the same IO queue. When the main thread encounters the returned response from the delegated task, it executes the callback defined in the original call. The main NodeJS thread only handles the IO mechanism of it. To delegate tasks, NodeJS uses libuv, which is an asynchronous processing abstraction layer available on major OSes.

Since the NodeJS runtime is single thread and mainly supposed for IO tasks, if you don’t use callbacks, it will block on the IO thread, thus disabling the ability to process further requests till the task has finished.

When an I/O operation (like reading a file or querying a database) starts, NodeJS does not block the main thread.
`libuv` registers the operation with `epoll`, which waits for the OS to signal when the operation is ready.
`epoll` notifies `libuv` when the operation is complete, and the NodeJS event loop processes the result.
A callback is executed in the JavaScript runtime, allowing NodeJS to continue handling requests efficiently.

Goroutine based processing#

When Golang released, there were lots of benchmarks done with NodeJS and it was observed that Go beat NodeJS in most of the benchmarks. Go was designed with concurrency as a first class object, leading to the inclusion of Goroutines, channels, waitgroups, etc from the very beginning. Goroutines are extremely light weight threads with a stack size of 2kB by default with the potential to grow as the need arises. Threads, comparatively, are far heavier with a default stack size of 1MB. This lightweight nature of Goroutines meant that the Go runtime could very easily support thousands of requests per second by default since it doesn’t have to limit its IO operations to a single thread as required by the NodeJS event loop.

Fun fact: Goroutines were cooperative in the first few versions of Go. Goroutines would be scheduled out of the CPU
only when IO operations, sleep, channel operations, etc were encountered. Go's fully preemptive scheduler was introduced in Go 1.14.

Bonus reading#

As discussed in this colorful and wonderful blog, most languages with a async and await keyword defines the need for the result to be returned, ie the stack needs to unwind completely for the function to return. The runtime usually doesn’t support running another task till the ongoing task is completed. This is always going to result in a bottleneck sometime or the other as anytime a task overshoots its usage, other tasks are going to be affected. C# implements await by letting the compiler break the function into multiple halves whenever it encounters the await keyword. Languages like Python are more cooperative and have a single thread listener. There are also GIL (Global Interpreter Lock) limitations with languages like Python, which prevents true concurrency.

References#

Hope you liked reading the article.

Please reach out to me here for more ideas or improvements.