[snippet] Benchmarking "textual" file compression methods

by Ciprian Dorin Craciun (⁠ciprian.craciun@gmail.com⁠) on 

Trying to identify which compression tool and level yields the "best" outcome for "textual" archival purposes. (Spoiler: "it depends...")

// permanent-link // hacker-news // index // RSS

Snippets

For temporary storage:

zstd -z -3 -q --rsyncable

For archival storage:

lzip -6 -b 1048576

For compatibility with "good enough" compression while saving on CPU:

gzip -3 --rsyncable

For compatibility with "almost the best" compression (that gzip is capable of):

gzip -6 --rsyncable

All the snippets below have the following properties:

bzip2 -z -9
zstd -z -9 -q
lzip -9 -b 1048576
xz -z -3 -q -F xz -C sha256

Context

Throughout my career, when in my operations role, I was faced many times with the following simple task:

Given a large file, comprised of "textual" data, compress it with the "best" tool prior to archival.

By "textual" data I mean something like the following:

And unfortunately by "best" I mean contradicting requirements like:

So far I have used the following tools (usually at their -9 level):

However I have never done a thorough benchmark based on different use-cases, until today... :)

Benchmarking

Benchmarking scenarios

I have applied the above snippets (with various compression levels) on the following real-world scenarios:

Benchmarking conclusions

After a quick glance at the numbers I would say that:

As such my new choices, depending on the use-case are:

Benchmarking results

About the data

Bellow are the tables with the actual benchmarking data, whose columns mean:

Note that:

mysql-01.sql

 method | size | time |  memory | comp% | comp/s | comp |  diff%
   none |   37 |   ~~ |      ~~ |    ~~ |     ~~ |   35 |     ~~
 gzip:1 |   11 |  0.5 |   1,628 | 70.3% |   49.1 |   26 | -25.7%
 gzip:3 |   10 |  0.7 |   1,748 | 73.0% |   39.1 |   27 | -22.9%
 gzip:6 |    9 |  1.4 |   1,632 | 75.7% |   20.7 |   28 | -20.0%
 gzip:9 |    9 |  1.8 |   1,620 | 75.7% |   15.9 |   28 | -20.0%
bzip2:1 |    8 |  2.6 |   2,488 | 78.4% |   11.0 |   29 | -17.1%
bzip2:3 |    7 |  2.7 |   3,808 | 81.1% |   11.0 |   30 | -14.3%
bzip2:6 |    7 |  2.9 |   5,492 | 81.1% |   10.4 |   30 | -14.3%
bzip2:9 |    6 |  3.0 |   7,900 | 83.8% |   10.4 |   31 | -11.4%
 zstd:1 |    9 |  0.2 |  11,668 | 75.7% |  140.0 |   28 | -20.0%
 zstd:3 |    8 |  0.2 |  37,976 | 78.4% |  120.8 |   29 | -17.1%
 zstd:6 |    7 |  0.5 |  41,160 | 81.1% |   63.8 |   30 | -14.3%
 zstd:9 |    7 |  1.0 |  43,192 | 81.1% |   30.0 |   30 | -14.3%
 lzip:0 |    9 |  0.9 |   3,996 | 75.7% |   31.1 |   28 | -20.0%
 lzip:1 |    8 |  2.2 |  14,488 | 78.4% |   13.1 |   29 | -17.1%
 lzip:3 |    7 |  4.6 |  25,752 | 81.1% |    6.5 |   30 | -14.3%
 lzip:6 |    5 | 14.8 |  93,244 | 86.5% |    2.2 |   32 |  -8.6%
 lzip:9 |    5 | 22.7 | 150,836 | 86.5% |    1.4 |   32 |  -8.6%
   xz:0 |    9 |  1.6 |   4,532 | 75.7% |   17.7 |   28 | -20.0%
   xz:1 |    7 |  1.8 |  10,624 | 81.1% |   17.1 |   30 | -14.3%
   xz:3 |    6 |  4.3 |  33,480 | 83.8% |    7.2 |   31 | -11.4%
   xz:6 |    4 | 14.2 |  97,332 | 89.2% |    2.3 |   33 |  -5.7%
   xz:9 |    2 | 13.6 | 398,956 | 94.6% |    2.6 |   35 |   0.0%

mysql-02.sql

 method | size |  time |  memory | comp% | comp/s | comp |  diff%
   none |  663 |    ~~ |      ~~ |    ~~ |     ~~ |  617 |     ~~
 gzip:1 |  101 |   6.9 |   1,644 | 84.8% |   81.4 |  562 |  -8.9%
 gzip:3 |   93 |   7.1 |   1,736 | 86.0% |   79.9 |  570 |  -7.6%
 gzip:6 |   76 |  10.5 |   1,660 | 88.5% |   56.1 |  587 |  -4.9%
 gzip:9 |   74 |  18.7 |   1,628 | 88.8% |   31.5 |  589 |  -4.5%
bzip2:1 |   67 |  54.4 |   2,488 | 89.9% |   11.0 |  596 |  -3.4%
bzip2:3 |   56 |  62.9 |   3,724 | 91.6% |    9.7 |  607 |  -1.6%
bzip2:6 |   51 |  71.4 |   5,492 | 92.3% |    8.6 |  612 |  -0.8%
bzip2:9 |   49 |  77.1 |   7,852 | 92.6% |    8.0 |  614 |  -0.5%
 zstd:1 |   75 |   2.7 |  10,920 | 88.7% |  221.9 |  588 |  -4.7%
 zstd:3 |   72 |   3.2 |  36,832 | 89.1% |  182.4 |  591 |  -4.2%
 zstd:6 |   64 |   5.9 |  41,152 | 90.3% |  101.2 |  599 |  -2.9%
 zstd:9 |   59 |  10.2 |  43,144 | 91.1% |   59.5 |  604 |  -2.1%
 lzip:0 |   73 |  10.4 |   3,936 | 89.0% |   56.8 |  590 |  -4.4%
 lzip:1 |   74 |  30.5 |  14,484 | 88.8% |   19.3 |  589 |  -4.5%
 lzip:3 |   67 |  46.4 |  25,720 | 89.9% |   12.9 |  596 |  -3.4%
 lzip:6 |   52 | 122.6 |  93,356 | 92.2% |    5.0 |  611 |  -1.0%
 lzip:9 |   46 | 528.8 | 248,024 | 93.1% |    1.2 |  617 |   0.0%
   xz:0 |   70 |  16.2 |   4,600 | 89.4% |   36.7 |  593 |  -3.9%
   xz:1 |   63 |  19.3 |  10,620 | 90.5% |   31.1 |  600 |  -2.8%
   xz:3 |   58 |  36.8 |  33,612 | 91.3% |   16.5 |  605 |  -1.9%
   xz:6 |   49 | 135.1 |  97,324 | 92.6% |    4.5 |  614 |  -0.5%
   xz:9 |   48 | 194.5 | 691,240 | 92.8% |    3.2 |  615 |  -0.3%

mysql-03.sql

 method |  size |    time |  memory | comp% | comp/s |  comp |  diff%
   none | 5,228 |      ~~ |      ~~ |    ~~ |     ~~ | 5,007 |     ~~
 gzip:1 | 1,308 |    68.1 |   1,576 | 75.0% |   57.6 | 3,920 | -21.7%
 gzip:3 | 1,168 |    81.7 |   1,568 | 77.7% |   49.7 | 4,060 | -18.9%
 gzip:6 |   934 |   152.0 |   1,564 | 82.1% |   28.3 | 4,294 | -14.2%
 gzip:9 |   924 |   212.3 |   1,632 | 82.3% |   20.3 | 4,304 | -14.0%
bzip2:1 |   922 |   401.0 |   2,392 | 82.4% |   10.7 | 4,306 | -14.0%
bzip2:3 |   653 |   434.3 |   4,008 | 87.5% |   10.5 | 4,575 |  -8.6%
bzip2:6 |   524 |   484.7 |   5,496 | 90.0% |    9.7 | 4,704 |  -6.1%
bzip2:9 |   462 |   521.4 |   7,608 | 91.2% |    9.1 | 4,766 |  -4.8%
 zstd:1 |   528 |    20.4 |  11,748 | 89.9% |  230.8 | 4,700 |  -6.1%
 zstd:3 |   364 |    22.0 |  37,628 | 93.0% |  221.3 | 4,864 |  -2.9%
 zstd:6 |   332 |    46.5 |  42,584 | 93.6% |  105.2 | 4,896 |  -2.2%
 zstd:9 |   300 |    72.0 |  44,584 | 94.3% |   68.5 | 4,928 |  -1.6%
 lzip:0 |   762 |    96.1 |   3,992 | 85.4% |   46.5 | 4,466 | -10.8%
 lzip:1 |   439 |   261.1 |  14,484 | 91.6% |   18.3 | 4,789 |  -4.4%
 lzip:3 |   323 |   412.4 |  25,764 | 93.8% |   11.9 | 4,905 |  -2.0%
 lzip:6 |   248 |   829.6 |  93,244 | 95.3% |    6.0 | 4,980 |  -0.5%
 lzip:9 |   224 | 2,443.4 | 363,656 | 95.7% |    2.0 | 5,004 |  -0.1%
   xz:0 |   554 |   125.8 |   4,652 | 89.4% |   37.2 | 4,674 |  -6.7%
   xz:1 |   365 |   130.0 |  10,672 | 93.0% |   37.4 | 4,863 |  -2.9%
   xz:3 |   282 |   260.6 |  33,480 | 94.6% |   19.0 | 4,946 |  -1.2%
   xz:6 |   233 |   894.6 |  97,204 | 95.5% |    5.6 | 4,995 |  -0.2%
   xz:9 |   221 | 1,156.9 | 691,244 | 95.8% |    4.3 | 5,007 |   0.0%

json-01.json

 method | size |  time |  memory | comp% | comp/s | comp |  diff%
   none |  562 |    ~~ |      ~~ |    ~~ |     ~~ |  524 |     ~~
 gzip:1 |  163 |   8.6 |   1,636 | 71.0% |   46.3 |  399 | -23.9%
 gzip:3 |  157 |   8.8 |   1,616 | 72.1% |   46.0 |  405 | -22.7%
 gzip:6 |  131 |  12.4 |   1,592 | 76.7% |   34.8 |  431 | -17.7%
 gzip:9 |  131 |  14.2 |   1,744 | 76.7% |   30.4 |  431 | -17.7%
bzip2:1 |  139 |  40.7 |   2,432 | 75.3% |   10.4 |  423 | -19.3%
bzip2:3 |  103 |  43.0 |   3,636 | 81.7% |   10.7 |  459 | -12.4%
bzip2:6 |   87 |  46.7 |   5,492 | 84.5% |   10.2 |  475 |  -9.4%
bzip2:9 |   80 |  49.3 |   7,608 | 85.8% |    9.8 |  482 |  -8.0%
 zstd:1 |   99 |   2.4 |  10,860 | 82.4% |  195.4 |  463 | -11.6%
 zstd:3 |   83 |   3.1 |  36,972 | 85.2% |  156.5 |  479 |  -8.6%
 zstd:6 |   76 |   6.2 |  40,240 | 86.5% |   78.8 |  486 |  -7.3%
 zstd:9 |   70 |  11.1 |  41,168 | 87.5% |   44.3 |  492 |  -6.1%
 lzip:0 |  110 |  13.4 |   4,060 | 80.4% |   33.9 |  452 | -13.7%
 lzip:1 |   84 |  31.4 |  14,448 | 85.1% |   15.2 |  478 |  -8.8%
 lzip:3 |   74 |  48.0 |  25,712 | 86.8% |   10.2 |  488 |  -6.9%
 lzip:6 |   53 | 112.7 |  93,304 | 90.6% |    4.5 |  509 |  -2.9%
 lzip:9 |   47 | 344.5 | 206,340 | 91.6% |    1.5 |  515 |  -1.7%
   xz:0 |   93 |  19.1 |   4,604 | 83.5% |   24.5 |  469 | -10.5%
   xz:1 |   76 |  24.0 |  10,484 | 86.5% |   20.3 |  486 |  -7.3%
   xz:3 |   65 |  53.9 |  33,540 | 88.4% |    9.2 |  497 |  -5.2%
   xz:6 |   40 | 131.0 |  97,016 | 92.9% |    4.0 |  522 |  -0.4%
   xz:9 |   38 | 179.3 | 690,756 | 93.2% |    2.9 |  524 |   0.0%

json-02.json

 method |  size |    time |  memory | comp% | comp/s |  comp |  diff%
   none | 5,006 |      ~~ |      ~~ |    ~~ |     ~~ | 4,860 |     ~~
 gzip:1 |   597 |    44.8 |   1,640 | 88.1% |   98.5 | 4,409 |  -9.3%
 gzip:3 |   496 |    45.2 |   1,556 | 90.1% |   99.8 | 4,510 |  -7.2%
 gzip:6 |   371 |    59.7 |   1,620 | 92.6% |   77.7 | 4,635 |  -4.6%
 gzip:9 |   352 |    83.8 |   1,548 | 93.0% |   55.5 | 4,654 |  -4.2%
bzip2:1 |   333 |   455.5 |   2,136 | 93.3% |   10.3 | 4,673 |  -3.8%
bzip2:3 |   256 |   551.3 |   3,644 | 94.9% |    8.6 | 4,750 |  -2.3%
bzip2:6 |   227 |   639.2 |   5,572 | 95.5% |    7.5 | 4,779 |  -1.7%
bzip2:9 |   214 |   701.6 |   7,660 | 95.7% |    6.8 | 4,792 |  -1.4%
 zstd:1 |   306 |    17.8 |  10,584 | 93.9% |  263.9 | 4,700 |  -3.3%
 zstd:3 |   300 |    20.5 |  36,152 | 94.0% |  229.8 | 4,706 |  -3.2%
 zstd:6 |   273 |    35.5 |  40,092 | 94.5% |  133.4 | 4,733 |  -2.6%
 zstd:9 |   237 |    57.4 |  41,592 | 95.3% |   83.1 | 4,769 |  -1.9%
 lzip:0 |   339 |    62.2 |   3,912 | 93.2% |   75.0 | 4,667 |  -4.0%
 lzip:1 |   349 |   186.6 |  14,548 | 93.0% |   25.0 | 4,657 |  -4.2%
 lzip:3 |   304 |   262.1 |  25,764 | 93.9% |   17.9 | 4,702 |  -3.3%
 lzip:6 |   216 |   730.4 |  93,492 | 95.7% |    6.6 | 4,790 |  -1.4%
 lzip:9 |   157 | 2,802.4 | 363,680 | 96.9% |    1.7 | 4,849 |  -0.2%
   xz:0 |   331 |    91.2 |   4,536 | 93.4% |   51.3 | 4,675 |  -3.8%
   xz:1 |   283 |   100.8 |  10,604 | 94.3% |   46.9 | 4,723 |  -2.8%
   xz:3 |   251 |   139.8 |  33,476 | 95.0% |   34.0 | 4,755 |  -2.2%
   xz:6 |   186 |   674.5 |  97,132 | 96.3% |    7.1 | 4,820 |  -0.8%
   xz:9 |   146 |   814.6 | 691,032 | 97.1% |    6.0 | 4,860 |   0.0%

bgp-mrt.tsv

 method | size | time |  memory | comp% | comp/s | comp |  diff%
   none |   37 |   ~~ |      ~~ |    ~~ |     ~~ |   35 |     ~~
 gzip:1 |   11 |  0.5 |   1,628 | 70.3% |   49.1 |   26 | -25.7%
 gzip:3 |   10 |  0.7 |   1,748 | 73.0% |   39.1 |   27 | -22.9%
 gzip:6 |    9 |  1.4 |   1,632 | 75.7% |   20.7 |   28 | -20.0%
 gzip:9 |    9 |  1.8 |   1,620 | 75.7% |   15.9 |   28 | -20.0%
bzip2:1 |    8 |  2.6 |   2,488 | 78.4% |   11.0 |   29 | -17.1%
bzip2:3 |    7 |  2.7 |   3,808 | 81.1% |   11.0 |   30 | -14.3%
bzip2:6 |    7 |  2.9 |   5,492 | 81.1% |   10.4 |   30 | -14.3%
bzip2:9 |    6 |  3.0 |   7,900 | 83.8% |   10.4 |   31 | -11.4%
 zstd:1 |    9 |  0.2 |  11,668 | 75.7% |  140.0 |   28 | -20.0%
 zstd:3 |    8 |  0.2 |  37,976 | 78.4% |  120.8 |   29 | -17.1%
 zstd:6 |    7 |  0.5 |  41,160 | 81.1% |   63.8 |   30 | -14.3%
 zstd:9 |    7 |  1.0 |  43,192 | 81.1% |   30.0 |   30 | -14.3%
 lzip:0 |    9 |  0.9 |   3,996 | 75.7% |   31.1 |   28 | -20.0%
 lzip:1 |    8 |  2.2 |  14,488 | 78.4% |   13.1 |   29 | -17.1%
 lzip:3 |    7 |  4.6 |  25,752 | 81.1% |    6.5 |   30 | -14.3%
 lzip:6 |    5 | 14.8 |  93,244 | 86.5% |    2.2 |   32 |  -8.6%
 lzip:9 |    5 | 22.7 | 150,836 | 86.5% |    1.4 |   32 |  -8.6%
   xz:0 |    9 |  1.6 |   4,532 | 75.7% |   17.7 |   28 | -20.0%
   xz:1 |    7 |  1.8 |  10,624 | 81.1% |   17.1 |   30 | -14.3%
   xz:3 |    6 |  4.3 |  33,480 | 83.8% |    7.2 |   31 | -11.4%
   xz:6 |    4 | 14.2 |  97,332 | 89.2% |    2.3 |   33 |  -5.7%
   xz:9 |    2 | 13.6 | 398,956 | 94.6% |    2.6 |   35 |   0.0%

Benchmarking snippets

For the commands used in running the benchmark, see commands.txt.

For the outcomes files see either stats.tsv or stats.ods.