Choosing Linux RAID5 chunk size (part 2) -- the internet's wisdom

by Ciprian Dorin Craciun (https://volution.ro/ciprian) on 

About choosing the "right" RAID5 chunk size on Linux systems, and related benchmarks. This part summarizes previous work published by others on the internet.

// permanent-link // Lobsters // HackerNews // index // RSS





Followup

This is the second part on this topic preceeded by:

Summary

In the previous part I've presented the reason why I am interested in the low-level task of identifying the "right" RAID5 chunk size.

However, after writing that part and starting to work on the next one which also takes into account the file-system, I felt the need to review more thoroughly what others have previously wrote about the subject. Thus the current part focuses on previous published work.

Unfortunately, although I found a few good sources about the subject, all of them are outdated (the newest one being from 2010), and most of them are single-faceted, focusing only on a single workload, or employing synthetic workloads.

The Linux RAID wiki

The "setup" page

The Linux "official" RAID wiki states that:

[...] Thus, for large writes, you may see lower overhead by having fairly large chunks, whereas arrays that are primarily holding small files may benefit more from a smaller chunk size.

[...] For optimal performance, you should experiment with the chunk-size, as well as with the block-size of the filesystem you put on the array.

[...] On RAID-5, the chunk size has the same meaning for reads as for RAID-0. Writing on RAID-5 is a little more complicated: when a chunk is written on a RAID-5 array, the corresponding parity chunk must be updated as well.

[...] If the writes are small and scattered all over the array, the RAID layer will almost always need to read in all the untouched chunks from each stripe that is written to, in order to calculate the parity chunk. This will impose extra bus-overhead and latency due to extra reads.

[...] A reasonable chunk-size for RAID-5 is 128 kB. A study showed that with 4 drives (even-number-of-drives might make a difference) that large chunk sizes of 512-2048 kB gave superior results.

[...] RAID-{4,5,10} performance is severely influenced by the stride and stripe-width options.

As a side-note, almost the same text can be found in the TLDP -- Software-RAID HOWTO -- Chunk sizes section. (In fact this how-to is marked as deprecated in the favor of the wiki.)

Moreover an earlier variant (from 1998) of the above, TLDP -- Software-RAID HOWTO -- Performance, Tools & General Bone-headed Questions section, provides a good technical insight (therefore a highly recommended read):

Q: How does the chunk size (stripe size) influence the speed of my RAID-0, RAID-4 or RAID-5 device?

A: The chunk size is the amount of data contiguous on the virtual device that is also contiguous on the physical device. [...] The stripe size affects both read and write latency (delay), throughput (bandwidth), and contention between independent operations (ability to simultaneously service overlapping I/O requests).

Assuming the use of the ext2fs file system, and the current kernel policies about read-ahead, large stripe sizes are almost always better than small stripe sizes, and stripe sizes from about a fourth to a full disk cylinder in size may be best. [...] The stripe size does not affect the read performance of small files. [...] Conversely, if very small stripes are used, and a large file is read sequentially, then a read will issued to all of the disks in the array. [...] Note, however, the trade-off: the bandwidth could improve almost N-fold for reading a single, large file, as N drives can be reading simultaneously. [...] But there is another, counter-acting trade-off: if all of the drives are already busy reading one file, then attempting to read a second or third file at the same time will cause significant contention. [...] Thus, large stripes will almost always lead to the best performance.

The "performance" page

Getting back to the Linux RAID wiki, there is also a dedicated page to performance which gives some useful insights:

The Google suggested articles...

Linux RAID Level and Chunk Size: The Benchmarks (from 2010)

The first article recommended by Google, Linux RAID Level and Chunk Size: The Benchmarks (from 2010), states that for RAID5 the best choice is 64 KiB chunks, more than twice "better" than 128 KiB, and almost 30% "better" than 1 MiB. However the benchmark uses XFS, and seems to employ a single dd with a block size of 1 MiB.

Moreover even the author states that:

Furthermore, the theoretical transfer rates that should be achieved based on the performance of a single drive, are not met. The cause is unknown to me, but overhead and the relatively weak CPU may have a part in this. Also, the XFS file system may play a role in this. Overall, it seems that on this system, software RAID does not seem to scale well. Since my big storage monster (as seen on the left) is able to perform way better, I suspect that it is a hardware issue because the M2A-VM consumer-grade motherboard can't go any faster.

(Given that this article is the least thorough of the bunch, I have a hard time figuring out why Google suggests it as Featured snippets in search... Anyway...)

A Comparison of Chunk Size for Software RAID-5 (from 2009)

A more thorough benchmark, A Comparison of Chunk Size for Software RAID-5 (presumably from 2009), tests only RAID5, at block-device level (i.e. without a file-system), with various chunk sizes, concurrency levels, and IO buffer sizes.

It concludes that:

Different chuck sizes have different performance characteristics for different I/O sizes and workload types. In general 128KB provides the most consistent and highest throughput, except when doing random reads using an I/O that is a multiple of the stripe sizes (e.g., 256KB I/Os on a 4+1 RAID-5 with 64KB chucks). For 3+1 and 4+1 RAID-5s, I recommend a chuck size of 128KB for the best overall throughput characteristics.

However the data-points are "all over the place", thus the "best choice" depends a lot on the workload type (i.e. sequential vs random, single- vs multi-threaded).

On the flip side, I think this is one of the better reads on the subject.

RAID5,6 and 10 Benchmarks on 2.6.25.5 (from 2008)

Another benchmark, RAID5,6 and 10 Benchmarks on 2.6.25.5 (from 2008), tests various RAID levels (including 5) again at block-device level, with various chunk sizes, and various IO schedulers (the last parameter being a unique trait amongst other benchmarks).

At least for RAID5 it seems to suggest that "left-symmetric" with CFQ scheduler are the best choices.

However regarding throughput, it seems that the larger the chunk the better for reads, but worst for writes. A good compromise seems to be 1 MiB or 2 MiB.

As with the previous benchmark, it too is one of the better reads on the subject.

The best of the rest...

There are also many other articles on the internet, most of them highly outdated, especially given how much the storage capacity, and consequently the increase of file sizes, have evolved in latest years. (As a side-note, please take into account that the mdadm manual page, as of the date of writing this part, states that the default chunk size is 512 KiB.)

For example:

And finally the democratic "overflow"...

And, to complicate things further more, there are countless random "blogs", "forums" and "stacks" / "faults" / "exchanges", that just "overflow" us with inaccurate and sometimes even incorrect information, and thus replace "good old logic" with "democratic up-votes"...

For example: