Choosing Linux RAID5 chunk size (part 1) -- block-device level benchmarks -- Volution Notes

Followup

This is the first part on this topic followed by:

The context...

Long story short, I want/need to replace my old (magnetic) drives with new (magnetic) ones. The main reason is that my current drives are quite old, on average 10 years of 24x7 operation, and on top of that one of the smaller (and older) ones just "died". (Moreover it's a good opportunity to consolidate and upgrade the available storage, both in terms of performance and capacity.)

So I decided to buy 3x 4 TiB drives -- Western Digital Gold (WD WD4002FYYZ), which are currently retired, and instead of which I would recommend the Western Digital Red Pro (WD WD4003FFBX) -- and migrate onto them the file-systems from my current 2x 2 TiB and 1x 1 TiB drives (plus a few other smaller ones).

However, I do want to have some redundancy, thus I thought of using RAID5 through the Linux md module -- especially since the "dead" drive incurred no data loss as its partitions were part of a RAID1 and RAID5 arrays. (Please note I said redundancy and not backup, because RAID is meant mainly for "operational reliability" and not for "disaster recovery". There are countless ways in which the whole array can get corrupted...)

The quest...

Unfortunately creating a RAID5 array requires choosing a very important parameter, namely the "chunk" size. (For a thorough explanation of RAID technology please consult Linux RAID Wiki and Wikipedia.)

Thus I tasked myself to benchmark RAID5 arrays using various chunk sizes. (And since I'm doing this, I also wanted to benchmark RAID0.)

Moreover the benchmark should be as simple as possible, and close to my intended use-cases which range from:

VM images (that are not important enough to be stored on an SSD);
secondary-level swap and scratch disks (especially for large files);
general purpose, read-mostly, file-storage (especially OpenAFS partitions);
on-line backups (or staging area for backups), especially exported over network via iSCSI;

Thus given that I don't intend to use these disks in a database scenario, nor a NAS with lots of frequently used small files -- and given that random I/O (especially in highly concurrent workloads, as would be incurred by the previously mentioned scenarios) leads to disk thrashing and thus performance drops to almost 1 MiB/s (due to the high seek latency) -- my benchmark should try to cover mainly large files sequential access, or low concurrent scenarios of similar sequential access.

The methodology...

First of all, please note that this is not a "scientific" benchmark, nor a thorough one (especially since it covers only read operations). Instead it provides just a "baseline" performance indicator, one that suits no particular real-world workload.

Each benchmark has two variants, each with 5 phases:

in each of the 5 phases, 16 GiB of data was read sequentially in blocks of 16 MiB;
in each of the phases, the reading started at a different offset of 0%, 25%, 50%, 75% and 100% respectively of the whole size of the RAID array; (see the last section why these offsets;)
in the "single reader" variant, each phase was run independently, one after another (i.e. sequentially);
in the "multiple concurrent readers" variant, each phase was run at the same time (i.e. parallel);
each operation is executed directly over the block device (i.e. no "file-system");
each benchmark variant was run only once; however given that the data amount was quite large, and the quickest execution was at 30 seconds (in case of RAID0), I would say it is "representative" enough;

Also taking into account that in the case of magnetic disks reads and writes behave similarly in terms of throughput and latency (perhaps with a small overhead), I decided to execute only read benchmarks. (Although writes in RAID arrays, especially in RAID5, would have a significant impact due to the redundancy, such a benchmark should be performed... However this is a story for another time...)

On the software side, given that I installed the disks in a machine without any other disks (and thus without an OS), I decided to use SystemRescueCD which features a Linux v4.14.15 and mdadm v3.3.

The `md` snippets

The following is the command used to create the various RAID's:

mdadm \
        --build \
        /dev/md/tests-raid0-{size} \
        \
        --level raid0 \
        --raid-devices 3 \
        --chunk {size} \
        \
        --auto md \
        --symlinks yes \
        --verbose \
        \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-aaa \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-bbb \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-ccc \
    #

mdadm \
        --create \
        /dev/md/tests-raid5-{size} \
        \
        --assume-clean \
        \
        --metadata 1.2 \
        --level raid5 \
        --layout left-symmetric \
        --chunk {size} \
        --bitmap internal \
        --bitmap-chunk 262144 \
        --data-offset 262144 \
        --raid-devices 3 \
        --spare-devices 0 \
        \
        --homehost=tests \
        --name tests-raid5-{size} \
        \
        --auto md \
        --config none \
        --symlinks yes \
        --verbose \
        \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-aaa \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-bbb \
        /dev/disk/by-id/ata-WDC_WD4002FYYZ-ccc \
    #

Note that:

RAID0 was created without the md super-block, meanwhile RAID5 was created with such a super-block and with an data offset of 256 MiB; however this should have no impact on the benchmark results;
both RAID0 and RAID5 were considered "clean" by md, thus no "rebuilding" was happening behind the scenes;

The `dd` snippets

The following is the command used execute the reads for each phase:

dd \
        if=/dev/zzz \
        of=/dev/null \
        iflag=fullblock,direct \
        bs=16M \
        count=1024 \
        skip={offset} \
        status=progress \
    #

And the following is the command used to execute the reads for the "multiple concurrent readers" variant:

(
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip=0 status=progress 2>&1 | sed -r -e 's/^/  0% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-25} status=progress 2>&1 | sed -r -e 's/^/ 25% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-50} status=progress 2>&1 | sed -r -e 's/^/ 50% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-75} status=progress 2>&1 | sed -r -e 's/^/ 75% -- /' &
    dd if=/dev/zzz of=/dev/null iflag=fullblock,direct bs=16M count=1024 skip={offset-100} status=progress 2>&1 | sed -r -e 's/^/100% -- /' &
    wait
)

The results...

On how I "massaged" the numbers...

Before jumping to the "numbers", please note:

I considered that the "baseline" performance is the raw disk (i.e. /dev/disk/by-id/ata-WDC_WD4002FYYZ-aaa) performance in MiB/s at the beginning of the disk (i.e. skip=0); (the actual value in my case is ~200 MiB/s;)
thus all the values were obtained by scaling them (i.e. dividing) by the chosen "baseline";
moreover, given that in RAID0 the reads are spread over 3x disks and in RAID5 over 2x disks, I further divided the RAID0 values by 3 and the RAID5 ones by 2; (I assumed that given the exact same model for each 3x drives, and the symmetry of RAID0 and RAID5, the actual load spreads evenly;)

RAID0 benchmarks

RAID0 benchmarks (with a single reader)

As can be observed from the numbers, as expected in the case of RAID0 the chunk size has absolutely no influence on the resulting performance.

Moreover, we can observe that the overhead introduced by the RAID implementation is small enough (~4%).

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID0 / 1024     | 0.97 | 0.88 | 0.80 | 0.67 | 0.44 |
| RAID0 / 128      | 0.96 | 0.89 | 0.81 | 0.67 | 0.44 |
| RAID0 / 4        | 0.96 | 0.88 | 0.81 | 0.67 | 0.44 |

RAID0 benchmarks (with multiple concurrent readers)

As in the case of the "single reader" benchmark, the chunk size had no influence over the results (when compared between RAID variants). There is however a slight larger overhead when compared with the "raw" disk performance.

However looking at the actual values -- thus not at the graph bar heights that have relative values -- we can see that the performance drops to almost ~15% due to the concurrent access.

The only other noteworthy observation, is that reading at the middle of the array seems to have better performance in RAID0 than expected (as compared to the "edges" of the disk). (I can't explain this, nor do I care because the actual performance is ~20 MiB/s, which compared with an SSD is "snail-slow" in such concurrent scenarios...)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID0 / 1024 (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 128  (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 4    (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |

RAID5 benchmarks

RAID5 benchmarks (with a single reader)

By looking at the numbers the pattern seems to be "larger chunks are better, however not too large", to which I would add "and all depends on the workload".

In more words, given that dd reads one 16 MiB buffer at a time, a 8 MiB chunk size seem to be "too much" (perhaps because md is not able to leverage the parallel reads in this case). Meanwhile starting with 1 MiB chunks and lower we see a drop in performance. (The worst case being 4 KiB chunks which yield a performance of 20%, thus it was dropped from the graphs.)

The only "outlier" is in the 512 KiB chunk case, and only reading at the beginning of the array, which yields better performance than in the 1 MiB chunk case. (Perhaps the 1 MiB at the beginning of the array benchmark was influenced by external factors?)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |
| RAID5 / 256      | 0.77 | 0.72 | 0.68 | 0.58 | 0.40 |
| RAID5 / 128      | 0.74 | 0.65 | 0.65 | 0.56 | 0.39 |

RAID5 benchmarks (with multiple concurrent readers)

However when it comes to concurrent workloads, it seems that the reverse of the previous pattern is true: "larger chunks are always better".

My assumption is that given larger chunks, the md layer better "schedules" them onto the underlying layer. Moreover the reads happen 16 MiB at a time, thus favoring larger chunks.

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |
| RAID5 / 256  (m) | 0.12 | 0.13 | 0.13 | 0.14 | 0.11 |
| RAID5 / 128  (m) | 0.12 | 0.13 | 0.13 | 0.11 | 0.10 |

All benchmarks side-by-side

All benchmarks (with a single reader)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID0 / 1024     | 0.97 | 0.88 | 0.80 | 0.67 | 0.44 |
| RAID0 / 128      | 0.96 | 0.89 | 0.81 | 0.67 | 0.44 |
| RAID0 / 4        | 0.96 | 0.88 | 0.81 | 0.67 | 0.44 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |
| RAID5 / 256      | 0.77 | 0.72 | 0.68 | 0.58 | 0.40 |
| RAID5 / 128      | 0.74 | 0.65 | 0.65 | 0.56 | 0.39 |
| RAID5 / 4        | 0.20 | 0.20 | 0.21 | 0.19 | 0.18 |

All benchmarks (with multiple concurrent readers)

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID0 / 1024 (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 128  (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
| RAID0 / 4    (m) | 0.11 | 0.12 | 0.12 | 0.12 | 0.11 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |
| RAID5 / 256  (m) | 0.12 | 0.13 | 0.13 | 0.14 | 0.11 |
| RAID5 / 128  (m) | 0.12 | 0.13 | 0.13 | 0.11 | 0.10 |
| RAID5 / 4    (m) | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |

And the "winner" is...

In order to take an informed decision, I've gathered only the most promising candidates, namely the 8 MiB, 4 MiB, 2 MiB, 1 MiB and 512 KiB chunk sizes, and put them side-by-side, including the "single reader" and "multiple concurrent readers" variants.

Thus by looking at the graphs one would say that 2 MiB is the "best" choice given these benchmarks. Moreover taking into account that by using 3x drives, the stripe of a RAID5 using 2 MiB would be 4 MiB, which fits nicely into any alignment scheme.

Although in real-life, I think that both the 1 MiB and 512 KiB chunk sizes should work as well, and again they yield a stripe of 2 MiB and 1 MiB respectively (all powers of 2).

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD              | 1.00 | 0.93 | 0.84 | 0.70 | 0.47 |
|------------------|------|------|------|------|------|
| RAID5 / 8192     | 0.86 | 0.80 | 0.73 | 0.62 | 0.43 |
| RAID5 / 4096     | 0.95 | 0.89 | 0.82 | 0.68 | 0.45 |
| RAID5 / 2048     | 0.95 | 0.92 | 0.85 | 0.72 | 0.52 |
| RAID5 / 1024     | 0.83 | 0.82 | 0.76 | 0.67 | 0.51 |
| RAID5 / 512      | 0.95 | 0.83 | 0.76 | 0.64 | 0.43 |

|                  |   s0 |  s25 |  s50 |  s75 | s100 |
|------------------|------|------|------|------|------|
| HDD          (m) | 0.15 | 0.15 | 0.14 | 0.13 | 0.13 |
|------------------|------|------|------|------|------|
| RAID5 / 8192 (m) | 0.18 | 0.19 | 0.21 | 0.20 | 0.18 |
| RAID5 / 4096 (m) | 0.17 | 0.19 | 0.21 | 0.20 | 0.16 |
| RAID5 / 2048 (m) | 0.17 | 0.20 | 0.24 | 0.21 | 0.16 |
| RAID5 / 1024 (m) | 0.14 | 0.16 | 0.18 | 0.16 | 0.14 |
| RAID5 / 512  (m) | 0.12 | 0.12 | 0.13 | 0.12 | 0.12 |

The fine-print...

However below is another look at the above figures, in which one can clearly see the performance drop in magnetic disk drives when reading from various "offsets" of the drive. (I assume this has to do with the physical layout of sectors on the spinning disks, which given the constant angular speed, cover more sectors in the same amount of time the closer they are to the outer edges of the disk.)

For example reading at the far end of the disk one gets only 50% of the performance at the beginning of the disk. Moreover looking at the graph, one sees that for 50% of the drive storage one would get a performance level of between 90% and 75% of the "peek" performance (i.e. at the beginning of the disk).

How does this impact our decisions:

given that one doesn't usually need/want an 8 TiB partition holding a single file-system, the best solution is partitioning each drive and building RAID's over matching partitions as required;
one should place more frequently accessed file-systems on the first few partitions (or RAID's made of these partitions), meanwhile leaving the last partitions for "cold storage" or "backup" purposes;

Thus the final touch...

Given that a 4TB (i.e. 1000^4) drive translates only to about ~3.6 TiB (i.e. 1024^4), I'll opt into creating 7x 512 GiB partitions on each drive -- which if joined into RAID5 would yield 1 TiB, or by using RAID0 yielding 1.5 TiB -- thus be plentiful enough to place LVM's on-top of them.

For example:

the first partition on each drive (where one gets above 90% of the performance) would be joined in RAID0 (in total 1.5 TiB, but providing no redundancy) for (second-level) swap, "scratch" disks and other "ephemeral" storage requirements; (obviously for such use-cases an SSD would work "better", however SSD's are small and expensive, thus not everything would fit onto them;)
the second partition on each drive, would be joined in RAID5 (in total 1 TiB) for VM images, and other "operational" storage requirements;
the third partition on each drive would be left unused; (if need be it can join the previous group, or the next one, as required;)
the forth partition (in the middle of the disks, where one gets around 80% of the performance), would be joined in RAID5 for general purpose storage (in my case an OpenAFS partition);
the fifth and sixth partitions would be left unused at the moment; (most likely the sixth would join the previous one, and the seventh the next one;)
the seventh and last partition (where one gets around 50% of the performance), would be joined in RAID5 for "cold-storage" or staging area for backups;

Closing thoughts...

Further investigation is required by taking into account the following:

the impact on using an actual file-system like ext4;
the impact of writes, especially in RAID5 arrays;

Moreover the choice of RAID chunk size is only one of the many other important choices like for example partition alignment, which has to be done right starting from the disk partitioning scheme, through the RAID layer, through the LVM layer, through the block-device encryption layer (i.e. dm-crypt), through the file-system layer (i.e. ext4), and finally taken into account by the application...

I really can't stress enough the fact that this benchmark is as unprofessional as possible! Do your own research, tailored to your particular use-case! And please don't blindly trust what you find on a random site over the internet!

I would also strongly suggest reading the following:

the Linux md accompanying manual pages: md (4), mdadm (8) and mdadm.conf (5);
the ArchLinux wiki page on RAID;