Binary to text encoding -- state of the art and missed opportunities

by Ciprian Dorin Craciun (https://volution.ro/ciprian) on 

Although many software engineers know about the topic, especially through their exposure to Base64, there are however many issues and missed opportunities not tackled by the broad community.

// permanent-link // Lobsters // HackerNews // index // RSS





What is binary to text encoding?

What a better place to start than Wikipedia's article on binary to text encoding:

A binary-to-text encoding is encoding of data in plain text.
[A bit recursive perhaps?]

More precisely, it is an encoding of binary data in a sequence of printable characters.

These encodings are necessary for transmission of data when the channel does not allow binary data (such as email or NNTP) or is not 8-bit clean.

PGP documentation (RFC 4880) uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

OK, I've split their introduction into separate phrases. Setting aside the first phrase which is somewhat recursive, the second one is the technical correct definition, emphasizing the need to transform arbitrary data (nowadays arbitrary byte sequences) into printable data (nowadays most likely ASCII character sequences).

The other two sentences (mentioning "8-bit clean" or "PGP") are perhaps better suited for a section dedicated to the early computing history (in case of "8-bit cleanness") or the "good ideas that have failed to meet the market" history (in case of PGP)...

As a small example, below are the first 440 bytes of MBR partitioned disks, which is actually binary machine executable code part of the early (legacy) boot stages, as provided by the SYSLINUX project.

base64 < /usr/share/syslinux/mbr.bin
M8D6jtiO0LwAfInmBleOwPv8vwAGuQAB86XqHwYAAFJStEG7qlUxyTD2+c0TchOB+1WqdQ3R6XMJ
ZscGjQa0QusVWrQIzROD4T9RD7bGQPfhUlBmMcBmmehmAOg1AU1pc3Npbmcgb3BlcmF0aW5nIHN5
c3RlbS4NCmZgZjHSuwB8ZlJmUAZTagFqEInmZvc29HvA5AaI4YjFkvY2+HuIxgjhQbgBAooW+nvN
E41kEGZhw+jE/76+fb++B7kgAPOlw2ZgieW7vge5BAAxwFNR9geAdANAid6DwxDi80h0W3k5WVuK
RwQ8D3QGJH88BXUiZotHCGaLVhRmAdBmIdJ1A2aJwuis/3ID6Lb/ZotGHOig/4PDEOLMZmHD6HYA
TXVsdGlwbGUgYWN0aXZlIHBhcnRpdGlvbnMuDQpmi0QIZgNGHGaJRAjoMP9yJ2aBPgB8WEZTQnUJ
ZoPABOgc/3ITgT7+fVWqD4Xy/rz6e1pfB/r/5OgeAE9wZXJhdGluZyBzeXN0ZW0gbG9hZCBlcnJv
ci4NCl6stA6KPmIEswfNEDwKdfHNGPTr/QAAAAAAAAAAAAAAAAAAAAA=

Encoding schemes relevance

Setting computing history aside, some of the dinosaurs that still require them (like SMTP), and various esotericisms (like data: URI's), not many people use such encoding schemes.

There are however still a few cases where they are the proper tool, like for example:

Noteworthy encoding schemes

If one is curious of what human ingenuity came up with, Wikipedia's article on binary to text encoding (the one mentioned in the introduction) has a long list of well-established encoding schemes.

Here are two of them, that every software engineer should know about:

Then, especially if one does web-development, there are a few variants that are best suited for this domain:

Of special note are also the following:

Efficiency -- losing the wrong battle

If one reads the Wikipedia on the subject, in the encoding standards section of the mentioned article, one can find a large table of various encodings with a column dedicated to their efficiency, (in percentage) how smaller is the arbitrary original binary data when compared to the textual encoded data. For example:

However, all of these numbers are meaningless because:

cat < /etc/hosts | basenc --base64 | wc -c
#>> 2088

gzip -1 < /etc/hosts | basenc --base16 | wc -c
#>> 1160

wc -c < /etc/hosts
#>> 1543

Thus, although Base16 is only 50% efficient, meanwhile Base64 is at 74%, by just applying compression at the weakest level one can obtain ~130% efficiency.

In conclusion, looking at efficiency is perhaps meaningless, when a bit of pre-processing can be applied, and especially when other features are more important.

Missing features -- the right battles never fought

While most encoding schemes focus on efficiency (there is even one that uses emoji's, thus by counting Unicode code-points instead of ASCII characters it's a clear winner), others (like PGP words, Proquint or Bech32) focus on other more useful features.

In what follows, I'll try to list a few features that I think are many times more useful than efficiency, because in the end any binary to text encoding will output (much) more data than the initial input. And, needless to say, nobody would use any of these encodings for anything larger than a few handful of terminal screens...

Namely:

The future -- that may never be

Like there aren't enough competing alternatives already, here is another one:

z-tokens exchange armor < /usr/share/syslinux/mbr.bin
#01,  2ymHZ9BtGEtTd 2gUAFKhkdGG6G bpmVaU3c5LJo  3U9aUZP9CToy  42umy5e4T9XME
#02,  3TPuiaMuQdHab 3Je8iXRUn2aAD 4zg1wNnj3Mmn  2fwxaqQFaALZF 32cf2bmqKr8Ez
#03,  AFk2KBm56ujk  3DsRPK3Vd8UuU 2HDGG1rFap63P 3VB96ygabKH68 XUbLGpH1ddRR
#04,  34iM3hC9S1X7o 2iDftFE4SKcJp 3641UPkGoQRtf wRuC6S1WpJR   BtJJZ7KoJTZi
#05,  2NQuVKc3ikJgM 3L5MRHQqRH2WH 2YVDeALy63kUx pghKDbvw7rqE  3QZK3MSPsC4Gf
#06,  3hEWV6sm1gVXJ 2SMEeaZn3SiLz 3211KnwHHX42K GqXM56S1nLrN  2GVrZYS6z8y72
#07,  DBQ6gYpd6Vu5  3mTV76GnUwLbb Mm6P1gRs1cJx  2rG2v3Shu3VNJ 2UynV4L2FxHiK
#08,  THXSoaBVgKAu  jFPa1tzhSYqE  3t32YALoQNC1r 2hGAP4HKJ2m2p fwqsXuAJgUfj
#09,  2wD6VybecDXwo 2Fy4aq75Q52kF 2WdmPf5UK7Gz5 22SfKnQjbAdbH 3PyUawhEQg4Br
#10,  2qwmWNGkCkD5V 3DbLXJKqfEikW 2rbAtvoFwThfR 2U9P2QgCKyQHr g5R5fvRYGiHz
#11,  2CUChvS6D2Es  2gBBynWN3Andd 1gkYuuLNJCtU  wwmEVTML5Zgu  3pJS9q8jYZ1xG
#12.  YY

Obviously I'm selling my own snake-oil in the form of z-tokens open-source project, my own take on "all token related swiss army knife tool", that besides generating passwords / passphrases and other tokens, has this nice exchange armor / exchange dearmor sub-command that tries to put together some of the above features in a unique package.

Please note that at the moment the format is experimental, and most likely prone to backward incompatible changes!

But this is a topic for another article...