Choosing Linux RAID5 chunk size (part 1) -- block-device level benchmarks

by Ciprian Dorin Craciun (https://volution.ro/ciprian) on 

About choosing the "right" RAID5 chunk size on Linux systems, and related benchmarks. This first part presents a few benchmarks at the "block-device" level (i.e. without an actual file-system).

// permanent-link // Lobsters // HackerNews // index // RSS







Followup

This is the first part on this topic followed by:

The context...

Long story short, I want/need to replace my old (magnetic) drives with new (magnetic) ones. The main reason is that my current drives are quite old, on average 10 years of 24x7 operation, and on top of that one of the smaller (and older) ones just "died". (Moreover it's a good opportunity to consolidate and upgrade the available storage, both in terms of performance and capacity.)

So I decided to buy 3x 4 TiB drives -- Western Digital Gold (WD WD4002FYYZ), which are currently retired, and instead of which I would recommend the Western Digital Red Pro (WD WD4003FFBX) -- and migrate onto them the file-systems from my current 2x 2 TiB and 1x 1 TiB drives (plus a few other smaller ones).

However, I do want to have some redundancy, thus I thought of using RAID5 through the Linux md module -- especially since the "dead" drive incurred no data loss as its partitions were part of a RAID1 and RAID5 arrays. (Please note I said redundancy and not backup, because RAID is meant mainly for "operational reliability" and not for "disaster recovery". There are countless ways in which the whole array can get corrupted...)

The quest...

Unfortunately creating a RAID5 array requires choosing a very important parameter, namely the "chunk" size. (For a thorough explanation of RAID technology please consult Linux RAID Wiki and Wikipedia.)

Thus I tasked myself to benchmark RAID5 arrays using various chunk sizes. (And since I'm doing this, I also wanted to benchmark RAID0.)

Moreover the benchmark should be as simple as possible, and close to my intended use-cases which range from:

Thus given that I don't intend to use these disks in a database scenario, nor a NAS with lots of frequently used small files -- and given that random I/O (especially in highly concurrent workloads, as would be incurred by the previously mentioned scenarios) leads to disk thrashing and thus performance drops to almost 1 MiB/s (due to the high seek latency) -- my benchmark should try to cover mainly large files sequential access, or low concurrent scenarios of similar sequential access.

The methodology...

First of all, please note that this is not a "scientific" benchmark, nor a thorough one (especially since it covers only read operations). Instead it provides just a "baseline" performance indicator, one that suits no particular real-world workload.

Each benchmark has two variants, each with 5 phases:

Also taking into account that in the case of magnetic disks reads and writes behave similarly in terms of throughput and latency (perhaps with a small overhead), I decided to execute only read benchmarks. (Although writes in RAID arrays, especially in RAID5, would have a significant impact due to the redundancy, such a benchmark should be performed... However this is a story for another time...)

On the software side, given that I installed the disks in a machine without any other disks (and thus without an OS), I decided to use SystemRescueCD which features a Linux v4.14.15 and mdadm v3.3.

The md snippets

The following is the command used to create the various RAID's:

mdadm \
        --build \
        /dev/md/tests-raid0-{size} \
        \
        --level raid0 \
        --raid-devices 3 \
        --chunk {size} \
        \
        --auto md \
        --symlinks yes \
        --verbose \
        \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-aaa \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-bbb \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-ccc \
    #
mdadm \
        --create \
        /dev/md/tests-raid5-{size} \
        \
        --assume-clean \
        \
        --metadata 1.2 \
        --level raid5 \
        --layout left-symmetric \
        --chunk {size} \
        --bitmap internal \
        --bitmap-chunk 262144 \
        --data-offset 262144 \
        --raid-devices 3 \
        --spare-devices 0 \
        \
        --homehost=tests \
        --name tests-raid5-{size} \
        \
        --auto md \
        --config none \
        --symlinks yes \
        --verbose \
        \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-aaa \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-bbb \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-ccc \
    #

Note that:

The dd snippets

The following is the command used execute the reads for each phase:

dd \
        if=/dev/zzz \
        of=/dev/null \
        iflag=fullblock,direct \
        bs=16M \
        count=1024 \
        skip={offset} \
        status=progress \
    #

And the following is the command used to execute the reads for the "multiple concurrent readers" variant:

(
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip=0 status=progress 2>&1 | sed -r -e 's/^/  0% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-25} status=progress 2>&1 | sed -r -e 's/^/ 25% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-50} status=progress 2>&1 | sed -r -e 's/^/ 50% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-75} status=progress 2>&1 | sed -r -e 's/^/ 75% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-100} status=progress 2>&1 | sed -r -e 's/^/100% -- /' &
    wait
)

The results...

On how I "massaged" the numbers...

Before jumping to the "numbers", please note:

RAID0 benchmarks

RAID0 benchmarks (with a single reader)

As can be observed from the numbers, as expected in the case of RAID0 the chunk size has absolutely no influence on the resulting performance.

Moreover, we can observe that the overhead introduced by the RAID implementation is small enough (~4%).

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID0 / 1024     | 0.97 | 0.88 | 0.80 | 0.67 | 0.44 |
| RAID0 / 128      | 0.96 | 0.89 | 0.81 | 0.67 | 0.44 |
| RAID0 / 4        | 0.96 | 0.88 | 0.81 | 0.67 | 0.44 |
RAID0 sequential
RAID0 benchmarks (with multiple concurrent readers)

As in the case of the "single reader" benchmark, the chunk size had no influence over the results (when compared between RAID variants). There is however a slight larger overhead when compared with the "raw" disk performance.

However looking at the actual values -- thus not at the graph bar heights that have relative values -- we can see that the performance drops to almost ~15% due to the concurrent access.

The only other noteworthy observation, is that reading at the middle of the array seems to have better performance in RAID0 than expected (as compared to the "edges" of the disk). (I can't explain this, nor do I care because the actual performance is ~20 MiB/s, which compared with an SSD is "snail-slow" in such concurrent scenarios...)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID0 / 1024 (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 128  (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 4    (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
RAID0 parallel

RAID5 benchmarks

RAID5 benchmarks (with a single reader)

By looking at the numbers the pattern seems to be "larger chunks are better, however not too large", to which I would add "and all depends on the workload".

In more words, given that dd reads one 16 MiB buffer at a time, a 8 MiB chunk size seem to be "too much" (perhaps because md is not able to leverage the parallel reads in this case). Meanwhile starting with 1 MiB chunks and lower we see a drop in performance. (The worst case being 4 KiB chunks which yield a performance of 20%, thus it was dropped from the graphs.)

The only "outlier" is in the 512 KiB chunk case, and only reading at the beginning of the array, which yields better performance than in the 1 MiB chunk case. (Perhaps the 1 MiB at the beginning of the array benchmark was influenced by external factors?)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |
| RAID5 / 256      | 0.77 | 0.72 | 0.68 | 0.58 | 0.40 |
| RAID5 / 128      | 0.74 | 0.65 | 0.65 | 0.56 | 0.39 |
RAID5 sequential

RAID5 benchmarks (with multiple concurrent readers)

However when it comes to concurrent workloads, it seems that the reverse of the previous pattern is true: "larger chunks are always better".

My assumption is that given larger chunks, the md layer better "schedules" them onto the underlying layer. Moreover the reads happen 16 MiB at a time, thus favoring larger chunks.

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |
| RAID5 / 256  (m) | 0.12 | 0.13 | 0.13 | 0.14 | 0.11 |
| RAID5 / 128  (m) | 0.12 | 0.13 | 0.13 | 0.11 | 0.10 |
RAID5 parallel

All benchmarks side-by-side

All benchmarks (with a single reader)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID0 / 1024     | 0.97 | 0.88 | 0.80 | 0.67 | 0.44 |
| RAID0 / 128      | 0.96 | 0.89 | 0.81 | 0.67 | 0.44 |
| RAID0 / 4        | 0.96 | 0.88 | 0.81 | 0.67 | 0.44 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |
| RAID5 / 256      | 0.77 | 0.72 | 0.68 | 0.58 | 0.40 |
| RAID5 / 128      | 0.74 | 0.65 | 0.65 | 0.56 | 0.39 |
| RAID5 / 4        | 0.20 | 0.20 | 0.21 | 0.19 | 0.18 |
All sequential

All benchmarks (with multiple concurrent readers)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID0 / 1024 (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 128  (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 4    (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |
| RAID5 / 256  (m) | 0.12 | 0.13 | 0.13 | 0.14 | 0.11 |
| RAID5 / 128  (m) | 0.12 | 0.13 | 0.13 | 0.11 | 0.10 |
| RAID5 / 4    (m) | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |
All parallel

And the "winner" is...

In order to take an informed decision, I've gathered only the most promising candidates, namely the 8 MiB, 4 MiB, 2 MiB, 1 MiB and 512 KiB chunk sizes, and put them side-by-side, including the "single reader" and "multiple concurrent readers" variants.

Thus by looking at the graphs one would say that 2 MiB is the "best" choice given these benchmarks. Moreover taking into account that by using 3x drives, the stripe of a RAID5 using 2 MiB would be 4 MiB, which fits nicely into any alignment scheme.

Although in real-life, I think that both the 1 MiB and 512 KiB chunk sizes should work as well, and again they yield a stripe of 2 MiB and 1 MiB respectively (all powers of 2).

RAID5 overall (bars)
|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |
|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |

The fine-print...

However below is another look at the above figures, in which one can clearly see the performance drop in magnetic disk drives when reading from various "offsets" of the drive. (I assume this has to do with the physical layout of sectors on the spinning disks, which given the constant angular speed, cover more sectors in the same amount of time the closer they are to the outer edges of the disk.)

For example reading at the far end of the disk one gets only 50% of the performance at the beginning of the disk. Moreover looking at the graph, one sees that for 50% of the drive storage one would get a performance level of between 90% and 75% of the "peek" performance (i.e. at the beginning of the disk).

RAID5 overall (lines)

How does this impact our decisions:

Thus the final touch...

Given that a 4TB (i.e. 1000^4) drive translates only to about ~3.6 TiB (i.e. 1024^4), I'll opt into creating 7x 512 GiB partitions on each drive -- which if joined into RAID5 would yield 1 TiB, or by using RAID0 yielding 1.5 TiB -- thus be plentiful enough to place LVM's on-top of them.

For example:

Closing thoughts...

Further investigation is required by taking into account the following:

Moreover the choice of RAID chunk size is only one of the many other important choices like for example partition alignment, which has to be done right starting from the disk partitioning scheme, through the RAID layer, through the LVM layer, through the block-device encryption layer (i.e. dm-crypt), through the file-system layer (i.e. ext4), and finally taken into account by the application...

I really can't stress enough the fact that this benchmark is as unprofessional as possible! Do your own research, tailored to your particular use-case! And please don't blindly trust what you find on a random site over the internet!

I would also strongly suggest reading the following: