Have you ever wondered how durable are your writes? Do you expect calling fd.write
would persist the data across crashes
and reboots?
Oh, you use fsync
after your writes and you don't have any chance of losing your data now?
This post is going to break your trust on fsync
, just like it broke mine and it's going to be fun!
The journey of a write operation on Linux
Before getting into the details, let's understand the components which are involved in your write operation.
When you issue a write
command on your file descriptor, the data is mainly copied from the user space to the kernel space
into the operating system's buffers.
The kernel doesn't write the data directly to storage on receiving the write
operation. It just marks the pages as dirty
and returns back a success to the user.
The kernel periodically detects that there is dirty data in the kernel's page buffers and writes the data lazily in batches trying
to optimise the write throughput.
While flushing data to the storage, the kernel uses the virtual file system ext4
, xfs
to write the data to the actual
storage device (hdd
, ssd
).
Once the VFS flushes the data to the storage, it returns a success response to the kernel. When the kernel receives the
success response, it marks the dirty buffer as clean.
The storage device also maintains a chache
This is the journey of a successful write request in the kernel.
Different types of write configurations
There are cases when the application wants to ensure that the data is written to the storage device. There are different options for persisting the data to a certain layer as per the write operation journey. Also, different filesystems and operating systems may have different behaviours as well.
The following image is taken from this blog and it has far more information regarding the different configurations.
| Operation | Application Memory | Page Cache | Disk Cache | Disk Storage |
|-----------------------------------|-------------------|-------------------|-------------------|------------------|
| File Integrity | | | | |
| write()
| β β | β | | |
| O_DIRECT
| β --------------- | ................ | β | |
| + O_SYNC
| β --------------- | ................ | β | β |
| fsync()
| β β | | β | β |
| Data Integrity | | | | |
| O_DIRECT + O_DSYNC
| β --------------- | ................ | β | β |
| fdatasync()
| β β | | β | β |
| sync_file_range (btrfs, zfs)
| β β | | β | β |
| sync_file_range (ext4, xfs)
| β β | | β | β |
Interpreting the above table, when you perform a write
, it is written only to the page cache
layer.
If you pass the O_DIRECT
flag, it skips the page cache layer and directly writes it to the disk cache
layer.
O_SYNC
ensures that all metadata and data contents of a file is synced to the disk. Calling O_SYNC
ensures the disk
storage is also updated.
O_DSYNC
ensures only the data content of a file is synced to the disk. The metadata of the file is not synced immediately.
When data is appended to a file, the size of the file increases and the page blocks representing the file also increases.
sync_file_range
ensures that the additional blocks are also synced to disk.
When fsync
fails
We have seen above what happens when fsync
is performed successfully.
The fsync man pages [9] report that fsync may fail for
many reasons: the underlying storage medium has insufficient
space (ENOSPC or EDQUOT), the file descriptor is not valid
(EBADF), or the file descriptor is bound to a file that does not
support synchronization (EINVAL)
Let's see what happens when fsync
doesn't execute successfully.
Enter fsyncgate
on PostgreSQL
In 2018, there was a critical fsync
bug discovered by the PostgreSQL developers due to mishandling and vague understanding of what
the fsync
command does. You can read the entire thread here.
One PostgreSQL user mentioned that a storage error resulted in data corruption on XFS.
What was observed during investigation was that the PG wrote some data to the kernel where the pages were dirtied which was then written to the
storage device. The storage device returned an error which resulted in the writeback page being marked as failed (EIO) by the XFS layer.
When the PG layer tries to call the fsync
operation, it receives an EIO
error to mention the previous error had failed.
Once the error was received, the kernel clears AS_EIO
page error. This means that when PG retries the checkpointing process, the fsync
operation
returns back a success response.
This results in an error where the checkpointing returns back a success response without actually writing the data to the disk, thus leading to data loss.
The above problem would have been dealt with differently in ext4
with errors=remount-ro
as the filesystem would have forced a remount when a storage
device error is encountered.
PostgreSQL solved this issue subsequently by mimicking the ext4
's implementation where any fsync
error would crash the process, thus forcing it to
reread from the checkpointed file with fresh memory pages without having to worry about the if the failed pages would still be in memory.
Redis periodically updates the aof
file to keep track of the updates. When flushing data to the aof
file, it doesn't check for the fsync status code
itself, thus allowing corrupt keys to be stored in the page cache and in memory, whereas the aof
file is not written to successfully. This will result in data corruption
when the server has to restart the process and reads the aof
file.
fsync Failure Analysis on different filesystems
This paper experiments different types of workfloads on different filesystems to uncover how
fsync
failures are handled.
We will cover the experiments on 3 filesystems
ext4
xfs
btrfs
When a write operation is done, the data is put in the page cache and ext4
marks the pages as dirty. On calling fsync
, the data is written to the storage
block and the metadata, which is the inode with the new updated at time, is updated. The pages are then marked as clean and no errors were encountered.
How different filesytems behave with a failed fsync
For ext4
, when the fsync
call fails, the metadata is not updated but then dirty pages are still marked as clean. Since the pages are marked as clean, the subsequent
fsync
is able to update the inode entry with the updated at time as well. If the application reads the data before booting up and the pages are still in the
cache, it will see the newly updated information even though the fsync
operation failed. If the application reads the same data after booting up after the pages
are removed from the cache, the application would see the older data since the actual write was never persisted to the disk.
The xfs
filesystem behaves similarly with ext4
except that when a fsync
failure happens, it shuts down the filesystem entirely, thereby blocking all read and write
operations. It also retries the metadata updates when it encounters a checkpointing fault.
btrfs
which is a copy-on-write filesystem, writes to a log tree to record the fsync
changes instead of updating the journal in-place. Instead of overwriting on the same
block, btrfs
creates a new block and then updates the block links in the root.
Given that it maintains a different copy of the old and new data, btrfs
is able to revert back to the old state when a fsync
failure is encountered, unlike xfs
and ext4
.
btrfs
does not persist metadata after a data-block failure. However, because the process filedescriptor offset is incremented, future writes and fsyncs cause a hole in the middle of the file.
The FreeBSD VFS layer chooses to re-dirty pages when there is a failure (except when the device is removed) while Linux hands over the failure handling responsibility to the individual file systems below the VFS layer.
All the filesystems mentioned above were affected by fsync
failures by either the wrong data being read, or incorrect state, or filesystem unavailability.
Below is a tabulation of how the different filesystems were impacted by the fsync
failures.
| Filesystem | Mode | Q1 (Which block failure causes fsync failure?) | Q2 (Is metadata persisted on data block failure?) | Q3 (Which block failures are retried?) | Q4 (Is the page dirty or clean after failure?) | Q5 (Does the in-memory content match disk?) | Q6 (Which fsync reports the failure?) | Q7 (Is the failure logged to syslog?) | Q8 (Which block failure causes unavailability?) | Q9 (What type of unavailability?) | Q10 (Holes or block overwrite failures? If yes, where?) | Q11 (Can fsck help detect holes or block overwrite failures?) | |-----------|---------|---------------------------------|-------------------------------------|---------------------------------|----------------------------------|----------------------------------|---------------------------------|------------------------------|----------------------------------------|------------------------------|-------------------------------------------------|-----------------------------------------------| | ext4 | ordered | data, jrnl | yes (A) | - | clean (B) | no (B) | immediate | yes | jrnl | remount-ro | NOB, anywhere (A) | no | | ext4 | data | data, jrnl | yes (A) | - | clean (B) | no (B) | next (C) | yes | jrnl | remount-ro | NOB, anywhere (A) | no | | XFS | - | data, jrnl | yes (A) | meta | clean (B) | no (B) | immediate | yes | jrnl, meta | shutdown | NOB, within (A) | no | | Btrfs | - | data, jrnl | no | - | clean | yes | immediate | yes | jrnl, meta | remount-ro | HOLE, within (D) | yes |
Notes
- (A) Non-overwritten blocks (Q10) occur because metadata is persisted despite data-block failure (Q2).
- (B) Marking a dirty page clean (Q4) even though the content does not match the disk (Q5) is problematic.
- (C) Delayed reporting (Q6) of fsync failures may confuse application error-handling logic.
- (D) Continuing to write to a file after an fsync failure is similar to writing to an offset greater than file size, causing a hole in the skipped portion (Q10).
Conclusion
While fsync is commonly trusted to ensure data durability, real-world cases like fsyncgate and studies on different filesystems show that its behavior is far from foolproof. The handling of fsync failures varies significantly across filesystemsβsome may silently lose data, others may shut down entirely, and a few, like btrfs, attempt to mitigate failures through copy-on-write mechanisms. This complexity underscores the need for applications to be aware of how their underlying storage behaves and to implement additional safeguards where necessary. Understanding these intricacies can help prevent unexpected data loss and improve system resilience in the face of storage failures.
References
- https://www.usenix.org/system/files/atc20-rebello.pdf
- https://transactional.blog/how-to-learn/disk-io
- https://danluu.com/file-consistency/
- https://danluu.com/fsyncgate/
Hope you liked reading the article.
Please reach out to me here for more ideas or improvements.