~/posts/2025-02-20_how-safe-is-your-fsync.md
$

cat 2025-02-20_how-safe-is-your-fsync.md

πŸ“…

Have you ever wondered how durable are your writes? Do you expect calling fd.write would persist the data across crashes and reboots? Oh, you use fsync after your writes and you don't have any chance of losing your data now? This post is going to break your trust on fsync, just like it broke mine and it's going to be fun!

The journey of a write operation on Linux

Before getting into the details, let's understand the components which are involved in your write operation. When you issue a write command on your file descriptor, the data is mainly copied from the user space to the kernel space into the operating system's buffers. The kernel doesn't write the data directly to storage on receiving the write operation. It just marks the pages as dirty and returns back a success to the user. The kernel periodically detects that there is dirty data in the kernel's page buffers and writes the data lazily in batches trying to optimise the write throughput. While flushing data to the storage, the kernel uses the virtual file system ext4, xfs to write the data to the actual storage device (hdd, ssd). Once the VFS flushes the data to the storage, it returns a success response to the kernel. When the kernel receives the success response, it marks the dirty buffer as clean. The storage device also maintains a chache This is the journey of a successful write request in the kernel.

Different types of write configurations

There are cases when the application wants to ensure that the data is written to the storage device. There are different options for persisting the data to a certain layer as per the write operation journey. Also, different filesystems and operating systems may have different behaviours as well.

The following image is taken from this blog and it has far more information regarding the different configurations.

| Operation | Application Memory | Page Cache | Disk Cache | Disk Storage | |-----------------------------------|-------------------|-------------------|-------------------|------------------| | File Integrity | | | | | | write() | ● β†’ | β†’ | | | | O_DIRECT | ● --------------- | ................ | β†’ | | | + O_SYNC | ● --------------- | ................ | β†’ | β†’ | | fsync() | ● β†’ | | β†’ | β†’ | | Data Integrity | | | | | | O_DIRECT + O_DSYNC | ● --------------- | ................ | β†’ | β†’ | | fdatasync() | ● β†’ | | β†’ | β†’ | | sync_file_range (btrfs, zfs) | ● β†’ | | β†’ | β†’ | | sync_file_range (ext4, xfs) | ● β†’ | | β†’ | β†’ |

Interpreting the above table, when you perform a write, it is written only to the page cache layer. If you pass the O_DIRECT flag, it skips the page cache layer and directly writes it to the disk cache layer. O_SYNC ensures that all metadata and data contents of a file is synced to the disk. Calling O_SYNC ensures the disk storage is also updated. O_DSYNC ensures only the data content of a file is synced to the disk. The metadata of the file is not synced immediately.

When data is appended to a file, the size of the file increases and the page blocks representing the file also increases. sync_file_range ensures that the additional blocks are also synced to disk.

When fsync fails

We have seen above what happens when fsync is performed successfully.

The fsync man pages [9] report that fsync may fail for
many reasons: the underlying storage medium has insufficient
space (ENOSPC or EDQUOT), the file descriptor is not valid
(EBADF), or the file descriptor is bound to a file that does not
support synchronization (EINVAL)

Let's see what happens when fsync doesn't execute successfully.

Enter fsyncgate on PostgreSQL

In 2018, there was a critical fsync bug discovered by the PostgreSQL developers due to mishandling and vague understanding of what the fsync command does. You can read the entire thread here.

One PostgreSQL user mentioned that a storage error resulted in data corruption on XFS. What was observed during investigation was that the PG wrote some data to the kernel where the pages were dirtied which was then written to the storage device. The storage device returned an error which resulted in the writeback page being marked as failed (EIO) by the XFS layer. When the PG layer tries to call the fsync operation, it receives an EIO error to mention the previous error had failed. Once the error was received, the kernel clears AS_EIO page error. This means that when PG retries the checkpointing process, the fsync operation returns back a success response. This results in an error where the checkpointing returns back a success response without actually writing the data to the disk, thus leading to data loss.

The above problem would have been dealt with differently in ext4 with errors=remount-ro as the filesystem would have forced a remount when a storage device error is encountered.

PostgreSQL solved this issue subsequently by mimicking the ext4's implementation where any fsync error would crash the process, thus forcing it to reread from the checkpointed file with fresh memory pages without having to worry about the if the failed pages would still be in memory.

Redis periodically updates the aof file to keep track of the updates. When flushing data to the aof file, it doesn't check for the fsync status code itself, thus allowing corrupt keys to be stored in the page cache and in memory, whereas the aof file is not written to successfully. This will result in data corruption when the server has to restart the process and reads the aof file.

fsync Failure Analysis on different filesystems

This paper experiments different types of workfloads on different filesystems to uncover how fsync failures are handled.

We will cover the experiments on 3 filesystems

  • ext4
  • xfs
  • btrfs

When a write operation is done, the data is put in the page cache and ext4 marks the pages as dirty. On calling fsync, the data is written to the storage block and the metadata, which is the inode with the new updated at time, is updated. The pages are then marked as clean and no errors were encountered.

How different filesytems behave with a failed fsync

For ext4, when the fsync call fails, the metadata is not updated but then dirty pages are still marked as clean. Since the pages are marked as clean, the subsequent fsync is able to update the inode entry with the updated at time as well. If the application reads the data before booting up and the pages are still in the cache, it will see the newly updated information even though the fsync operation failed. If the application reads the same data after booting up after the pages are removed from the cache, the application would see the older data since the actual write was never persisted to the disk.

The xfs filesystem behaves similarly with ext4 except that when a fsync failure happens, it shuts down the filesystem entirely, thereby blocking all read and write operations. It also retries the metadata updates when it encounters a checkpointing fault.

btrfs which is a copy-on-write filesystem, writes to a log tree to record the fsync changes instead of updating the journal in-place. Instead of overwriting on the same block, btrfs creates a new block and then updates the block links in the root. Given that it maintains a different copy of the old and new data, btrfs is able to revert back to the old state when a fsync failure is encountered, unlike xfs and ext4. btrfs does not persist metadata after a data-block failure. However, because the process filedescriptor offset is incremented, future writes and fsyncs cause a hole in the middle of the file.

The FreeBSD VFS layer chooses to re-dirty pages when there is a failure (except when the device is removed) while Linux hands over the failure handling responsibility to the individual file systems below the VFS layer.

All the filesystems mentioned above were affected by fsync failures by either the wrong data being read, or incorrect state, or filesystem unavailability. Below is a tabulation of how the different filesystems were impacted by the fsync failures.

| Filesystem | Mode | Q1 (Which block failure causes fsync failure?) | Q2 (Is metadata persisted on data block failure?) | Q3 (Which block failures are retried?) | Q4 (Is the page dirty or clean after failure?) | Q5 (Does the in-memory content match disk?) | Q6 (Which fsync reports the failure?) | Q7 (Is the failure logged to syslog?) | Q8 (Which block failure causes unavailability?) | Q9 (What type of unavailability?) | Q10 (Holes or block overwrite failures? If yes, where?) | Q11 (Can fsck help detect holes or block overwrite failures?) | |-----------|---------|---------------------------------|-------------------------------------|---------------------------------|----------------------------------|----------------------------------|---------------------------------|------------------------------|----------------------------------------|------------------------------|-------------------------------------------------|-----------------------------------------------| | ext4 | ordered | data, jrnl | yes (A) | - | clean (B) | no (B) | immediate | yes | jrnl | remount-ro | NOB, anywhere (A) | no | | ext4 | data | data, jrnl | yes (A) | - | clean (B) | no (B) | next (C) | yes | jrnl | remount-ro | NOB, anywhere (A) | no | | XFS | - | data, jrnl | yes (A) | meta | clean (B) | no (B) | immediate | yes | jrnl, meta | shutdown | NOB, within (A) | no | | Btrfs | - | data, jrnl | no | - | clean | yes | immediate | yes | jrnl, meta | remount-ro | HOLE, within (D) | yes |

Notes

  • (A) Non-overwritten blocks (Q10) occur because metadata is persisted despite data-block failure (Q2).
  • (B) Marking a dirty page clean (Q4) even though the content does not match the disk (Q5) is problematic.
  • (C) Delayed reporting (Q6) of fsync failures may confuse application error-handling logic.
  • (D) Continuing to write to a file after an fsync failure is similar to writing to an offset greater than file size, causing a hole in the skipped portion (Q10).

Conclusion

While fsync is commonly trusted to ensure data durability, real-world cases like fsyncgate and studies on different filesystems show that its behavior is far from foolproof. The handling of fsync failures varies significantly across filesystemsβ€”some may silently lose data, others may shut down entirely, and a few, like btrfs, attempt to mitigate failures through copy-on-write mechanisms. This complexity underscores the need for applications to be aware of how their underlying storage behaves and to implement additional safeguards where necessary. Understanding these intricacies can help prevent unexpected data loss and improve system resilience in the face of storage failures.

References

  • https://www.usenix.org/system/files/atc20-rebello.pdf
  • https://transactional.blog/how-to-learn/disk-io
  • https://danluu.com/file-consistency/
  • https://danluu.com/fsyncgate/

Hope you liked reading the article.

Please reach out to me here for more ideas or improvements.