Binary to text encoding -- state of the art and missed opportunities -- Volution Notes

What is binary to text encoding?

What a better place to start than Wikipedia's article on binary to text encoding:

A binary-to-text encoding is encoding of data in plain text.
[A bit recursive perhaps?]
More precisely, it is an encoding of binary data in a sequence of printable characters.
These encodings are necessary for transmission of data when the channel does not allow binary data (such as email or NNTP) or is not 8-bit clean.
PGP documentation (RFC 4880) uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

OK, I've split their introduction into separate phrases. Setting aside the first phrase which is somewhat recursive, the second one is the technical correct definition, emphasizing the need to transform arbitrary data (nowadays arbitrary byte sequences) into printable data (nowadays most likely ASCII character sequences).

The other two sentences (mentioning "8-bit clean" or "PGP") are perhaps better suited for a section dedicated to the early computing history (in case of "8-bit cleanness") or the "good ideas that have failed to meet the market" history (in case of PGP)...

As a small example, below are the first 440 bytes of MBR partitioned disks, which is actually binary machine executable code part of the early (legacy) boot stages, as provided by the SYSLINUX project.

base64 < /usr/share/syslinux/mbr.bin

M8D6jtiO0LwAfInmBleOwPv8vwAGuQAB86XqHwYAAFJStEG7qlUxyTD2+c0TchOB+1WqdQ3R6XMJ
ZscGjQa0QusVWrQIzROD4T9RD7bGQPfhUlBmMcBmmehmAOg1AU1pc3Npbmcgb3BlcmF0aW5nIHN5
c3RlbS4NCmZgZjHSuwB8ZlJmUAZTagFqEInmZvc29HvA5AaI4YjFkvY2+HuIxgjhQbgBAooW+nvN
E41kEGZhw+jE/76+fb++B7kgAPOlw2ZgieW7vge5BAAxwFNR9geAdANAid6DwxDi80h0W3k5WVuK
RwQ8D3QGJH88BXUiZotHCGaLVhRmAdBmIdJ1A2aJwuis/3ID6Lb/ZotGHOig/4PDEOLMZmHD6HYA
TXVsdGlwbGUgYWN0aXZlIHBhcnRpdGlvbnMuDQpmi0QIZgNGHGaJRAjoMP9yJ2aBPgB8WEZTQnUJ
ZoPABOgc/3ITgT7+fVWqD4Xy/rz6e1pfB/r/5OgeAE9wZXJhdGluZyBzeXN0ZW0gbG9hZCBlcnJv
ci4NCl6stA6KPmIEswfNEDwKdfHNGPTr/QAAAAAAAAAAAAAAAAAAAAA=

Encoding schemes relevance

Setting computing history aside, some of the dinosaurs that still require them (like SMTP), and various esotericisms (like data: URI's), not many people use such encoding schemes.

There are however still a few cases where they are the proper tool, like for example:

hard-coding small pieces of arbitrary data into source code; (although nowadays modern languages provide better tools for such purposes, like Rust's include_bytes! macro or Go's embed package;)
exchanging small binary files via email or chat; (although perhaps attachments would be a better fit;)
copy-pasting small binary files from one terminal to another; (when it's simpler than just scp-ing or rsync-ing it;)
printing (on paper) small binary files, most likely for backup purposes;

Noteworthy encoding schemes

If one is curious of what human ingenuity came up with, Wikipedia's article on binary to text encoding (the one mentioned in the introduction) has a long list of well-established encoding schemes.

Here are two of them, that every software engineer should know about:

hex or Base16 (as it's formally known), -- specified in RFC 4848, which is undoubtedly the most widely known and used encoding;
Base64 -- specified in the same RFC 4848, perhaps the second most widely used scheme, providing perhaps the best balance between encoding efficiency (see the next sections) and availability in most programming languages;

Then, especially if one does web-development, there are a few variants that are best suited for this domain:

Base32 or Base32Hex, and Base64URL -- perhaps without padding, specified in the same documents as Base64 above;
(that's it? I can't think of other generally useful ones...)

Of special note are also the following:

Base85 -- as specified by RFC 1924, that tries to maximize the encoding efficiency;
Z85 -- another take on Base85, having the same efficiency while still being suitable for copy-pasting in source code or shell scripts without taking escaping precautions;
Bech32 -- which, I can't believe I'm saying this, is a positive and useful outcome of the whole crypto-currency "experiment"; what set's it apart are some unique qualities that makes it suitable for encoding small chunks of arbitrary data that can only be conveyed in text; (among other things it is especially QRCode friendly, and includes error-detection code, that could even provide error-correction);
Base58 -- which, believe it or not as I certainly don't, is another by-product of the crypto-currency "experiment"... it's almost as efficient as Base64, however it does away with padding, and removes some problematic characters, making it perhaps better suited for chat / email based communications (see the next sections);
(PGP words) -- (that I can't seem to find the proper name for) which is tailored to encoding small pieces of binary data and verify (or transmit) it verbally (in the original case via phones);
Proquint -- which is tailored to encoding small pieces of binary data as text that resembles words, so they can be more easily remembered and communicated; (similar to PGP words, but with much higher efficiency;) (this one isn't even mentioned on Wikipedia, although it's quite unique;)

Efficiency -- losing the wrong battle

If one reads the Wikipedia on the subject, in the encoding standards section of the mentioned article, one can find a large table of various encodings with a column dedicated to their efficiency, (in percentage) how smaller is the arbitrary original binary data when compared to the textual encoded data. For example:

Base85 (and thus Z85) is the winner with 80%;
Base64 (and close to it Base58) is at 74%;
Hex / Base16 is at 50%;
(and perhaps PGP words would be somewhere around 20%;)

However, all of these numbers are meaningless because:

cat < /etc/hosts | basenc --base64 | wc -c
#>> 2088

gzip -1 < /etc/hosts | basenc --base16 | wc -c
#>> 1160

wc -c < /etc/hosts
#>> 1543

Thus, although Base16 is only 50% efficient, meanwhile Base64 is at 74%, by just applying compression at the weakest level one can obtain ~130% efficiency.

In conclusion, looking at efficiency is perhaps meaningless, when a bit of pre-processing can be applied, and especially when other features are more important.

Missing features -- the right battles never fought

While most encoding schemes focus on efficiency (there is even one that uses emoji's, thus by counting Unicode code-points instead of ASCII characters it's a clear winner), others (like PGP words, Proquint or Bech32) focus on other more useful features.

In what follows, I'll try to list a few features that I think are many times more useful than efficiency, because in the end any binary to text encoding will output (much) more data than the initial input. And, needless to say, nobody would use any of these encodings for anything larger than a few handful of terminal screens...

Namely:

checksumming -- except for Bech32 (because apparently money does matter), any change applied to the textual encoding (like for example removing, adding, or swapping characters in the acceptable character set) would most likely result in successfully decoding invalid data; I know, encoding isn't encryption, and one can easily use md5sum to check the output, but do we really, or do we just:
```
paste | base64 > /etc/service.conf
```
select-copy-paste suitability -- if not in 100% of the cases, most times we feed or consume the encoded text from a terminal or another application via the old-tried copy-paste method; except for saltpack (better said it's encoding) and perhaps Bech32 and Base58 none of the other encoding schemes have though how modern applications or browsers behave with regard to selection and copy-paste; (for example double-clicking on a Base64 encoded text, might miss selecting the special characters that are treated as word or sentence separators;)
truncation or duplication of lines -- because as seen above most interactions are through select-copy-paste workflows, and because sometimes (at least in terminals) we need to do so in chunks, none of these formats (none as in neither of the previously mentioned ones) would help one in knowing where the previous chunk finished, and where the next one starts... (would it be that inefficient to include a damn line-number?) furthermore, obviously such truncation or duplication could be easily detected by checksumming, but we don't have that either...
hinting error locations -- let's say we do have checksumming implemented (but again, we don't) so we can know when something was mishandled, that isn't enough; would one prefer a programming language that looks at one's 1000 lines masterpiece and goes "computor says no" without saying where the error was found? I know I wouldn't! wouldn't instead be nice if our base42 tool would tell us "hey, on line 42, the third word on the left seems to be wrong"?
reflowing resiliency -- I know nobody uses email anymore, thus perhaps reflowing isn't an issue anymore, but it would be nice if the output was made of something that when applying such old reflowing algorithms, the copy-paste-ability still remains; only the saltpack format mentioned above seems to think about this; (many Base64 tools use ~76 characters long lines, which means at most 2-3 levels of quoting;)
compression -- see the section about efficiency above; although this isn't strictly related to binary to text encoding, none of the current tools offer the option to transparently compress / decompress on encoding / decoding;
all-or-nothing processing -- now this is a peculiar feature... in cryptography it's called AONT (all or nothing transform), and like with compression it isn't something strictly related to the encoding scheme but instead to the particular implementations of such schemes; it would be nice if the tool would first check that everything is all-right (again, I'm speaking of checksumming), and obviously if needed to let us know where the problem might be, and only then dump the output? but then, I bet nobody is as stupid as me to do the following:
```
curl -s https://... | base64 -d > /etc/service.conf
```

The future -- that may never be

Like there aren't enough competing alternatives already, here is another one:

z-tokens exchange armor < /usr/share/syslinux/mbr.bin

#01,  2ymHZ9BtGEtTd 2gUAFKhkdGG6G bpmVaU3c5LJo  3U9aUZP9CToy  42umy5e4T9XME
#02,  3TPuiaMuQdHab 3Je8iXRUn2aAD 4zg1wNnj3Mmn  2fwxaqQFaALZF 32cf2bmqKr8Ez
#03,  AFk2KBm56ujk  3DsRPK3Vd8UuU 2HDGG1rFap63P 3VB96ygabKH68 XUbLGpH1ddRR
#04,  34iM3hC9S1X7o 2iDftFE4SKcJp 3641UPkGoQRtf wRuC6S1WpJR   BtJJZ7KoJTZi
#05,  2NQuVKc3ikJgM 3L5MRHQqRH2WH 2YVDeALy63kUx pghKDbvw7rqE  3QZK3MSPsC4Gf
#06,  3hEWV6sm1gVXJ 2SMEeaZn3SiLz 3211KnwHHX42K GqXM56S1nLrN  2GVrZYS6z8y72
#07,  DBQ6gYpd6Vu5  3mTV76GnUwLbb Mm6P1gRs1cJx  2rG2v3Shu3VNJ 2UynV4L2FxHiK
#08,  THXSoaBVgKAu  jFPa1tzhSYqE  3t32YALoQNC1r 2hGAP4HKJ2m2p fwqsXuAJgUfj
#09,  2wD6VybecDXwo 2Fy4aq75Q52kF 2WdmPf5UK7Gz5 22SfKnQjbAdbH 3PyUawhEQg4Br
#10,  2qwmWNGkCkD5V 3DbLXJKqfEikW 2rbAtvoFwThfR 2U9P2QgCKyQHr g5R5fvRYGiHz
#11,  2CUChvS6D2Es  2gBBynWN3Andd 1gkYuuLNJCtU  wwmEVTML5Zgu  3pJS9q8jYZ1xG
#12.  YY

Obviously I'm selling my own snake-oil in the form of z-tokens open-source project, my own take on "all token related swiss army knife tool", that besides generating passwords / passphrases and other tokens, has this nice exchange armor / exchange dearmor sub-command that tries to put together some of the above features in a unique package.

Please note that at the moment the format is experimental, and most likely prone to backward incompatible changes!

But this is a topic for another article...