Snapshotting in a high throughput shared nothing database

Snapshotting in a high throughput shared-nothing database#

While working on a Golang based in-memory database, I recently had to implement point in time snapshots for the datastore. The in-memory database has a shared nothing architecture allowing it to run multiple goroutines, usually based on the number of available cores and the keys are allocated accordingly to the shard goroutines.

Here are the requirements and considerations:

Copy data from memory to disk periodically
Ensure minimal disruption to the actual write operation
Take into consideration the availability of multiple shards and the change in the existing number of shards as well
The snapshot doesn’t have to keep up with the incoming writes after the snapshot process has started

Right off the top of my mind, the most simplest approach seems to be having another process/goroutine reading from the existing memory where the data is being stored.

Few problems with this approach is:

Data has to be copied entirely to another process
Memory can get to a corrupted state since the process is copying data while writes are still coming in.
High bursts of CPU and memory on both the writer and the snapshotter process

How does Redis do it?#

Redis, being a single threaded server, uses Copy on Write (COW) mechanisms between parent and child processes to perform snapshots using BGSAVE or SAVE.

Copy on write is a mechanism of sharing data across different actors where the data has to be copied only when there is a change in the original memory and only in the memory sections where the change has happened. This means that there is minimal change in the memory as write operations come in while snapshotting is taking place. This also prevents the need to copy data from one process to another at one go, which prevents high resource consumption spikes in both the processes.

Whenever a snapshot process has to be started, the Redis server calls a fork() and exec() which creates a child process which can then write it to the disk without affecting the parent process’s resources.

Testing CoW on a Golang process#

Since the in-memory database I am working on is in Golang, I tested the above hypotheses to check if this would work for snapshotting the data for a single thread/shard. We will come to the point where this may need to be modified to support multiple shards.

The test will contain the following steps:

Initiate a Go process which allocates a large chunk of memory, let’s say 1GB. Note the memory spike using free -m or htop or any other memory monitoring tool.
Call sycall.ForkExec() and operate on the same large chunk of memory, just print it or have some calculation on top of it. The memory should not spike apart from the newly allocated memory.
Start modifying the existing data in the parent/child process. As the size of the modified data starts increasing, the size of memory consumption should also see a similar spike.

Problems with Copy on Write operations#

When an object changes, there are 2 writes instead of 1 write. One write is copied to the child process’s memory and the other is written to the parent. This has the ability to be signficant in cases of high throughput systems
For every operation, an extra read operation has to be done to find the right block to read the updated data from
Varying implementations on different platforms like Windows, Unix and Linux.

Redirect on write (RoW)#

In RoW systems, instead of copying data, there is a layer of pointers which can be replaced to point to the snapshot or the underlying data block. The snapshot system keeps track of the locations of all blocks that make up a snapshot. In other words, it maintains a list of pointers and knows where each pointer’s corresponding block is stored. When a process requests access to a snapshot, it uses these pointers to retrieve the blocks from their original locations. Changes to blocks, which result in them being replaced and referenced by new pointers, have no impact on the snapshot process. In a redirect-on-write system, reading a snapshot incurs no computational overhead.

When modifying a protected block, the redirect-on-write approach requires only one-third of the I/O operations compared to other methods, and it does not add any extra computational cost when reading a snapshot. As a result, copy-on-write systems can significantly affect the performance of the protected entity. The more snapshots that are created and the longer they are retained, the greater the impact on performance. This is why copy-on-write snapshots are typically used as temporary backups—they are created, backed up, and promptly deleted.

In contrast, redirect-on-write snapshots are often generated every hour or even every few minutes and can be retained for days or even months, only being deleted when storage capacity becomes a concern. The longer a snapshot is kept, the more storage is needed to maintain previous versions of modified blocks.

References#

Portworx Redirect on write

I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!