The many flavors of hashing

by Ciprian Dorin Craciun (https://volution.ro/ciprian) on 

About the many types of hash functions, their use-cases, dos and don'ts, with suggestions for currently accepted algorithms.

// permanent-link // Lobsters // HackerNews // index // RSS





Prelude

In practical computer science hashing is a very important concept. It is used from simple data structures (like hash maps), highly complex data structures (like bloom filters or hyperloglog counters), database indices and sharding, storage and communication integrity, distributed storage, most password authentication and storage mechanisms, digital signatures, other cryptographic constructs based on Merkle trees (including Git or digital ledgers), and possibly many other use-cases I'm not even aware of right now.

However, not every hash algorithm is appropriate in all of these scenarios, and in fact, very few algorithms are usable in more than a couple of situations. Even worse, using the wrong algorithm will lead in the best case scenario to performance problems, but in the worst case scenario to security issues and even financial loss. Thus, knowing which algorithm to pick for which application is crucial.

Therefore I'll try to summarize how I approach the topic of hashing, including use-cases, recommended algorithms, and links to other articles.

And finally, if you want a second opinion -- you should always want one! -- you could also read the Programmers Don't Understand Hash Functions article, written in a similar fashion to (but one year before) my own, but focusing more on the cryptographic aspects.

A note of caution!

This is my personal view built upon practical experience while developing software, less focused on theoretical accuracy and details.

I use the word hash or hashing in the broadest sense, to mean from actual hash functions to cryptographic hash functions, including checksums and other fingerprinting.

I'm also not a computer scientist that has specialized in algorithms, nor am I a cryptographer or security expert. Although I do dabble in both subjects, mostly out of curiosity, and thus I'm not quite divining these ideas out of thin air.

If you think that any of what I describe is wrong (outdated or incomplete), please kindly send me an email (see comments section). (Pointers to existing articles that support your statements are highly appreciated; just opinions without foundation less so.)

Also, consider this to be a live document, one that I'll update as time passes and I find more examples or use-cases. I'll try to apply updates only for corrections or appending new information. If I find out that the article needs a more thorough restructuring I'll publish a new one, leaving this for reference.

What is hashing?

Wikipedia defines a hash function as:

A hash function is any function that can be used to map data of arbitrary size to fixed size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.

For my purpose, I see a hash function as any piece of code that:

Thus, in Python a hash function might look like this:

def hash (_input, _seed = None) :
    _hash = ... # code that churns the input
    return _hash

Related to the "looks random enough" aspect, because the input size is almost always larger than the output (which is fixed size), there are usually an infinite number of inputs for which the output is the same, thus these functions are not injective. When two inputs yield the same output it is called a collision. Few hash functions are collision-free, but most are designed so that collisions are minimized.

Hashing classification

As hinted in the beginning, hashing can be used for lots of things, and some algorithms can be used in multiple unrelated scenarios. Therefore I'll try to classify the hash algorithms by intended usage purpose (from a practical point of view) and for each category hint at some good choices and use-cases.

In summary, here are the classes of hash algorithms:

Shuffling hashes

The key property is performance for small inputs.

This is the most generic and simple form of a hash function: given an input, return a small number (16, 32, or 64 bits).

These are used mainly when implementing data structures (like hash tables, etc.) or for classifying inputs into a small number of buckets (like sharding). (Although when applied to storage, most implementations tend to use cryptographically strong hashes, like the ones described in the signature hashes section.)

Before listing some examples, here are some dos and don'ts:

One can't speak of this category without also mentioning the following project, github.com/rurban/smhasher, which is the state-of-the-art benchmark for all generic hash algorithms (including cryptographically strong hash functions).

Do you want to know which is the best algorithm in this category? There isn't one! Best can mean any of collision resistance, performance (on which CPU? on what kind of inputs? one-time or in a pipeline?), security (i.e. predictability or reversibility), etc.

For my own use-cases, at the moment I use the following:

The smhasher benchmark author has an interesting summary:

When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest FNV1A beats the fastest crc32_hw1 with Perl 5 hash tables. Even if those worse hash functions will lead to more collisions, the overall speed advantage and inline-ability beat the slightly worse quality.

Integrity hashes

The key property is detecting small changes in the input.

These are technically called checksums, but for my own classification (a single word to uniquely describe each category) I used the word integrity because their main purpose is identifying data corruption either during transmission or storage. The best-known example here is CRC32.

Their most important quality is that they guarantee to detect changes up to a certain number of flipped bits (the most common error in hardware); above that one gets only a probabilistic best effort detection.

If you are tempted to use these for use-cases described in the previous section, then don't! They are not optimized for collision avoidance or randomness.

I haven't used any of these myself, and I've always resorted to larger signature hashes (i.e. cryptographically strong hashes), and many have chosen to use the (more) secure variants of the shuffling hashes. I would also recommend reading the xxHash issue #229 about this topic.

Two families of algorithms stand out here:

Also for my personal usage, when it comes to file-system integrity checks I tend to use MD5. This is mostly due to it being generally available anywhere (through the md5sum tool), and because I already have a collection of fingerprints that are already MD5.

Permutation hashes

The key properties are small inputs, small outputs, but without collisions.

Granted, I came up with this category myself but I do find it useful sometimes, so here it goes the description:

Where would I use such a function? For example in logging. Imagine you are looking through a large and verbose log of a multi-threaded application; each line is prefixed with a thread identifier, which just happens to be almost sequential; ask yourself how easy it is to visually find the logs pertaining to one thread. Now imagine that instead of that sequential number we get a random-looking hex number; I bet it gets a lot easier now... (If you think I'm crazy, then know I'm not alone...)

Unfortunately, I know of only a single algorithm that solves this problem: Jeff Preshing's sequence of unique random integers (with this C implementation).

These might seem related to perfect hash functions, however, in the latter case the possible inputs are fewer than, for example, all the possible values for a 32 bit integer.

As a side note, if one has inputs and outputs that are multiple of 128 bits, then one could just use the AES block encryption with a fixed key (although this is not that efficient for such a purpose).

Signature hashes

The key property is irreversibility and collision resistance.

These are technically called cryptographic hash functions, but because they are mainly used in digital signature or HMAC schemes, I name them "signature hashes".

The dos and don'ts:

The obvious choices in this category are:

The obvious non-starters (due to weaknesses) are MD5 (and MD4), SHA1, and other "trust me this is the best crypto ever" snake-oil algorithms.

Password hashes

The key property is irreversibility and brute-force resistance.

I think there have been countless articles written about how to choose, input, update, exchange, and store passwords, therefore I won't repeat them here. Use your preferred search engine to do your own research, or even better, use an already existing framework for identity management.

But if you must, the obvious choices in this category are (in any order):

An honorable mention is PBKDF2, although by today's standards it is considered unsuitable.

See also the TOTP and HOTP algorithms mentioned in the miscellaneous section, as slightly related topics.

Token hashes

The key property is irreversibility and collision resistance.

These are technically called key derivation hash functions, whose main purpose is taking a known (and secret) key to generate any number of unguessable tokens. They should never be used with passwords as the known secret key.

When to use such an algorithm? The best answer is: you'll know when you'll need one.

However, the best explanation (with in-depth details, 99% of which I've already forgotten) I've seen is the Understanding HKDF article; thus if you do need one, you must read this article.

In essence, any keyed cryptographic hash function that accepts a key (or any HMAC function), could be used for such a purpose, however, there are a few standard algorithms:

See also the TOTP and HOTP algorithms mentioned in the miscellaneous section, as slightly related topics.

Encryption hashes

The key property is reversibility if provided with the seed, but irreversibility otherwise, and collision resistance.

Previously in the signature hashes section I've said that they don't encrypt the input, thus an attacker given a hash could guess the original input. Well, sometimes you do want to encrypt the input, and then be able to decrypt it back.

Which would be a use-case for such a procedure? Hiding small internal identifiers when exporting them as part of an API. Imagine you have a database of patients, where each patient has an incremental primary key; imagine you want to implement an API that lists patients given some query; however you don't want to export the actual sequential numbers, but instead want to export unique random looking identifiers, so that an attacker (or user) can't guess anything from that exported identifier, nor does it allow one to try to exfiltrate data if one fails to properly secure API endpoints.

Unfortunately, I haven't seen many articles about this topic... The current best practice is to throw UUIDv4's at the problem by storing them in mapping tables, or worse, by using them as primary keys (see here and here for some of the counterarguments)...

The only article I've found that touches on this subject was Plan B for UUIDs -- double AES-128. (I've quickly read through it, and not yet internalized its intricacies, thus I can't speak too much about it.)

Some long time ago I've even written my own open-source implementation of such an algorithm, available at github.com/cipriancraciun/java-token-obfuscator, however, I haven't touched it in a while (as in almost 8 years)... (Perhaps one day I'll write another implementation and article about this topic...)

Miscellaneous hashes

This category is for those types of hashes that are too obscure or too focused and thus don't fit in any of the previous categories: