After a recent SD Card failure on a Raspberry Pi, I decided to research storage devices and configurations to improve performance and device lifetime. This post contains the results of that research.
SD Card Types and Reliability
As a result of an enlightening comment chain on Hacker News about SD Card reliability I started researching common NAND flash storage technologies for representing bits in flash cells. In decreasing order of cost/reliability:
- Single-Level Cell (SLC)
- Stores one bit per cell.
- Multi-Level Cell (MLC)
- Stores two bits per cell.
- Triple-Level Cell (TLC)
- Stores three bits per cell.
Due to the high cost of SLC, there are some intermediate technologies which use MLC flash cells with firmware that only stores one bit per cell instead of two. This results in better reliability and longevity than traditional MLC at cheaper cost than SLC:
- advancedMLC (aMLC)
- ATP Electronics name for MLC with one bit per cell.
- SLC Lite (pSLC)
- Panasonic name for MLC with one bit per cell.
For my current project I decided to use an 8GB ATP aMLC card (AF8GSD3A or AF8GUD3A with an adapter - both are available from Digi-Key, Arrow, and other suppliers).
Logical Volumes and Filesystems
For my current project, power failures and hard resets are not uncommon. I need a storage configuration which performs well on an SD Card and is reasonably resistant to corruption after power failure. eMMC/SSD File System Tuning Methodology (2013) by Cogent Embedded, Inc. is a wonderful source of information for this purpose.
F2FS
The most performant configuration appears to be a single partition with F2FS, a filesystem which is optimized for flash storage. Unfortunately, as noted in the “Power-Fail Tolerance” section, F2FS is unsuitable in the presence of power failure. Although it now includes an fsck utility, “[the] initial version of the tool does not fix any inconsistency”.
BTRFS
lockheed on Unix SE provided a corruption-resistant configuration using BTRFS RAID. This approach looks promising, with the adjustment noted in the comments to use the BTRFS DUP Profile instead of RAID1. As I understand it, the primary difference is that the BTRFS DUP profile will only read one copy when not corrupted and that the distribution of the data copies on disk may differ. However, if the SD Card deduplicates data internally this approach will not actually result in any redundancy (as noted in the DUP Profiles on a Single Device section of the mkfs.btrfs man page). I do not think SD cards currently deduplicate data internally, but this is a significant concern.
Note that BTRFS DUP/RAID can be useful because the filesystem checksums indicate corruption. Using generic software RAID1 across partitions would not reduce corruption because it does not have a way to indicate which read is bad, so it was not considered.
ext4
ext4 is a very widely deployed filesystem and the default of most Raspberry Pi
distributions. “eMMC/SSD File System Tuning Methodology” notes that ext4
tolerated power failures quite well, while BTRFS did not. This result may
have changed due to BTRFS improvements since 2013 and with the use of DUP (or
RAID1 across partitions) as described above. It may also have different
results when using the ext4 metadata_csum
feature for metadata
checksums.
However, I have not conducted a comparison.
There are also other application-specific features to consider between ext4
and BTRFS. For example, BTRFS supports filesystem snapshots, subvolumes, and
compression. Also, ext4 is built-in to the Raspberry Pi Foundation-provided
kernel builds while BTRFS is not, thus necessitating an initramfs to boot from
a BTRFS root filesystem (see
raspberrypi/linux#1550,
raspberrypi/linux#1761).
Keeping such an initramfs updated to match the kernel is also complicated on
the Pi and requires custom scripting or manual filename changes on update (see
raspberrypi/firmware#608
and
RPi-Distro/firmware#1
- note that the referenced rpi-initramfs-tools
package has not yet been
created).
Conclusion: Use ext4 with metadata_csum
or BTRFS with DUP profile for
metadata (and data, if warranted) based on application-specific
considerations and willingness to deal with initramfs issues.
Read-Only Filesystems
Another option for reducing or mitigating corruption is to use a read-only
filesystem (or a writable filesystem mounted read-only). This can be done on
a per-directory basis (e.g. read-only root with read-write /var
) or using an
overlay filesystem such as unionfs with
either read-write partitions or tmpfs for ephemeral information. However,
this adds configuration complexity in addition to more complicated failure
scenarios.
Partition and Filesystem Alignment
For optimal performance and lifetime, partitions and filesystem structures
should be aligned to the erase
block
size. This size is occasionally listed on the spec sheet for the SD card.
More commonly the
preferred_erase_size
(or
discard_granularity
)
reported for the device in sysfs could be used. It is also often possible to
use flashbench
to empirically determine the erase block
size by measuring the device performance.
For the ext4 filesystem, there may be benefits to configuring the stride and/or stripe width to match the erase block size. Various methods for determining the ext4 stride and/or stripe size based on the flash media exist. I have insufficient understanding of the implications of stride and stripe size settings to know whether this is a good idea and haven’t seen any benchmarks to compare performance.
I/O Schedulers
Complete Fairness Queueing
(CFQ)
has been the Linux default I/O scheduler since
2.6.18.
It is a good default, and it provides some behavior optimizations on
non-rotational
media.
However, both “eMMC/SSD File System Tuning Methodology” and Phoronix Linux
3.16: Deadline I/O Scheduler Generally Leads With A
SSD
found that both noop
and deadline
outperformed cfq
. A caveat is that
neither deadline
nor noop
support I/O prioritization (e.g.
ionice
).
If prioritization is not required, some performance can be gained by changing
the I/O scheduler. This change can be accomplished to all non-rotational
media by placing the following content in a udev rule file (e.g.
/etc/udev/rules.d/60-nonrotational-iosched.rules
):
ACTION=="add|change", KERNEL=="mmcblk[0-9]", ATTR{queue/scheduler}="deadline"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"