Prelude
In practical computer science hashing is a very important concept. It is used from simple data structures (like hash maps), highly complex data structures (like bloom filters or hyperloglog counters), database indices and sharding, storage and communication integrity, distributed storage, most password authentication and storage mechanisms, digital signatures, other cryptographic constructs based on Merkle trees (including Git or digital ledgers), and possibly many other use-cases I'm not even aware of right now.
However, not every hash algorithm is appropriate in all of these scenarios, and in fact, very few algorithms are usable in more than a couple of situations. Even worse, using the wrong algorithm will lead in the best case scenario to performance problems, but in the worst case scenario to security issues and even financial loss. Thus, knowing which algorithm to pick for which application is crucial.
Therefore I'll try to summarize how I approach the topic of hashing, including use-cases, recommended algorithms, and links to other articles.
And finally, if you want a second opinion -- you should always want one! -- you could also read the Programmers Don't Understand Hash Functions article, written in a similar fashion to (but one year before) my own, but focusing more on the cryptographic aspects.
A note of caution!
This is my personal view built upon practical experience while developing software, less focused on theoretical accuracy and details.
I use the word hash or hashing in the broadest sense, to mean from actual hash functions to cryptographic hash functions, including checksums and other fingerprinting.
I'm also not a computer scientist that has specialized in algorithms, nor am I a cryptographer or security expert. Although I do dabble in both subjects, mostly out of curiosity, and thus I'm not quite divining these ideas out of thin air.
If you think that any of what I describe is wrong (outdated or incomplete), please kindly send me an email (see comments section). (Pointers to existing articles that support your statements are highly appreciated; just opinions without foundation less so.)
Also, consider this to be a live document, one that I'll update as time passes and I find more examples or use-cases. I'll try to apply updates only for corrections or appending new information. If I find out that the article needs a more thorough restructuring I'll publish a new one, leaving this for reference.
What is hashing?
Wikipedia defines a hash function as:
A hash function is any function that can be used to map data of arbitrary size to fixed size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
For my purpose, I see a hash function as any piece of code that:
- takes as input an object (be it a number, a string, a byte array, or anything native to the programming language); (note that some hash functions work only for numbers, others only for strings, some work for anything, etc.;)
- returns as output a fixed size number or byte array; (numbers and bytes are interchangeable in some cases);
- is deterministic; (for the same input there will be always the same output;)
- "looks random enough"; (this is actually a very important property!)
- optionally takes a second parameter, called a seed or key;
Thus, in Python a hash function might look like this:
def hash (_input, _seed = None) :
_hash = ... # code that churns the input
return _hash
Related to the "looks random enough" aspect, because the input size is almost always larger than the output (which is fixed size), there are usually an infinite number of inputs for which the output is the same, thus these functions are not injective. When two inputs yield the same output it is called a collision. Few hash functions are collision-free, but most are designed so that collisions are minimized.
Hashing classification
As hinted in the beginning, hashing can be used for lots of things, and some algorithms can be used in multiple unrelated scenarios. Therefore I'll try to classify the hash algorithms by intended usage purpose (from a practical point of view) and for each category hint at some good choices and use-cases.
In summary, here are the classes of hash algorithms:
- shuffling hashes (generic hashing functions);
- integrity hashes (checksums or error detection codes);
- permutation hashes (bear with me on this one);
- (cryptographic) signature hashes (cryptographic hash functions);
- (cryptographic) password hashes (password derivation functions);
- (cryptographic) token hashes (key derivation functions, non-password related);
- (cryptographic) encryption hashes (bear with me on this one);
- the best of the rest hashes (all those that don't fit into one of the previous categories);
Shuffling hashes
The key property is performance for small inputs.
This is the most generic and simple form of a hash function: given an input, return a small number (16, 32, or 64 bits).
These are used mainly when implementing data structures (like hash tables, etc.) or for classifying inputs into a small number of buckets (like sharding). (Although when applied to storage, most implementations tend to use cryptographically strong hashes, like the ones described in the signature hashes section.)
Before listing some examples, here are some dos and don'ts:
- must never be used when cryptography or security is involved;
- collisions should always be expected to happen regularly; (you could test if your code works correctly by using a hash function that always returns the value 42);
- must be used with caution when the input is under the control of a potential attacker (as in always, especially in networked services!); when this is the case then use seeded or keyed hash functions; (else suffer the same fate as NodeJS / V8 or Python based servers which could have been easily DoS'ed due to this;)
- if you need fewer bits than the function provides, choose another function that provides the closest (larger) size, then just drop the extra bits; don't just XOR or mix the bits yourself, else you'll break some of the function properties;
- when used for bucketing, choose either a power of 2 or a prime for the desired bucket count, then take the remainder; (also see the previous suggestion;)
- in general, especially with default
hash
function provided by many languages or runtimes, you should not depend on the output beeing the same cross-processes (for example after a restart, or in a different process), as many modern implementations seed their hash functions randomly at initialization; (if you need this property, choose a specific hash, and use it explicitly;)
One can't speak of this category without also mentioning the following project, github.com/rurban/smhasher, which is the state-of-the-art benchmark for all generic hash algorithms (including cryptographically strong hash functions).
Do you want to know which is the best algorithm in this category? There isn't one! Best can mean any of collision resistance, performance (on which CPU? on what kind of inputs? one-time or in a pipeline?), security (i.e. predictability or reversibility), etc.
For my own use-cases, at the moment I use the following:
xxh3
orxxh128
from the xxHash family (which supports seeding); (currently the fastest according to the benchmarks;)- highwayhash (which supports seeding); it was developed by Google and they claim it's faster than SipHash and comes with "security claims"; I would choose this one if attackers are a concern;
- SipHash, a tried-and-tested hash function (which supports seeding); it is present in most programming languages as the default choice for hash-table hashing function (from Rust to Python, Ruby, NodeJS and others); I would choose this one if the other two options aren't already implemented for my programming language;
- djb2, developed by D.J. Berstein, when simplicity is paramount (only 3 lines of code); (and when attackers are not a concern;)
The smhasher
benchmark author has an interesting summary:
When used in a hash table the instruction cache will usually beat the CPU and throughput measured here. In my tests the smallest
FNV1A
beats the fastestcrc32_hw1
with Perl 5 hash tables. Even if those worse hash functions will lead to more collisions, the overall speed advantage and inline-ability beat the slightly worse quality.
Integrity hashes
The key property is detecting small changes in the input.
These are technically called checksums, but for my own classification
(a single word to uniquely describe each category) I used the word integrity because their main purpose
is identifying data corruption either during transmission or storage. The best-known example here is CRC32
.
Their most important quality is that they guarantee to detect changes up to a certain number of flipped bits (the most common error in hardware); above that one gets only a probabilistic best effort detection.
If you are tempted to use these for use-cases described in the previous section, then don't! They are not optimized for collision avoidance or randomness.
I haven't used any of these myself, and I've always resorted to larger signature hashes (i.e. cryptographically strong hashes), and many have chosen to use the (more) secure variants of the shuffling hashes. I would also recommend reading the xxHash issue #229 about this topic.
Two families of algorithms stand out here:
- cyclic redundancy check (CRC's), which are used all over the place in the lower layers of storage and networking (especially in hardware);
- Bose–Chaudhuri–Hocquenghem codes (BCH codes), which I've stumbled upon from the Bech32 encoding format (similar to Base32 but with error detection built-in and tailored for QRCodes); (who knew that something good would stem out of the whole crypto-currency "thing"...)
Also for my personal usage, when it comes to file-system integrity checks I tend to use
MD5.
This is mostly due to it being generally available anywhere (through the md5sum
tool),
and because I already have a collection of fingerprints that are already MD5
.
Permutation hashes
The key properties are small inputs, small outputs, but without collisions.
Granted, I came up with this category myself but I do find it useful sometimes, so here it goes the description:
- given a small number, return another small number, that "looks random enough";
- take any two inputs, and their outputs must be different;
- preferably the size of the input and output is the same, thus the permutation designation;
Where would I use such a function? For example in logging. Imagine you are looking through a large and verbose log of a multi-threaded application; each line is prefixed with a thread identifier, which just happens to be almost sequential; ask yourself how easy it is to visually find the logs pertaining to one thread. Now imagine that instead of that sequential number we get a random-looking hex number; I bet it gets a lot easier now... (If you think I'm crazy, then know I'm not alone...)
Unfortunately, I know of only a single algorithm that solves this problem: Jeff Preshing's sequence of unique random integers (with this C implementation).
These might seem related to perfect hash functions, however, in the latter case the possible inputs are fewer than, for example, all the possible values for a 32 bit integer.
As a side note, if one has inputs and outputs that are multiple of 128 bits, or you don't mind padding, then one could just use the AES block encryption with a fixed key (although this is not that efficient for such a purpose).
Also, someone on HN pointed me to Correlated Multi-Jittered Sampling by Andrew Kensler, although I've not read it myself for the moment. In addition to that, someone else pointed that one could also use "reversible mixing functions" (from the mixing step of various stronger hashing algorithms) for this purpose.
Signature hashes
The key property is irreversibility and collision resistance.
These are technically called cryptographic hash functions, but because they are mainly used in digital signature or HMAC schemes, I name them "signature hashes".
The dos and don'ts:
- don't use any of these algorithms (directly) ever!
- really, don't use any of these algorithms! (you ask yourself what's the problem? read this article about hash length extension attacks and you'll know better;)
- don't think these encrypt or obfuscate the input! they just efficiently mangle the bits from the input; if the attacker sees a hash, and he could guess the possible inputs, then he can easily find out what the original input was; if the attacker knows the input, he can easily compute the hash, and thus he can detect if any of the inputs match the seen hashes; (I do sometimes use what I describe as encryption hashes below;)
- don't use these for passwords! use a password hash algorithm described below;
- always use libraries (like libsodium) that provide higher-level abstractions (like digital signatures and HMAC's);
- if you do feel like you need to use one of these, always use it keyed (you can put the key in the configuration file), or use a HMAC if the algorithm doesn't have a keyed variant;
- although you could use one of these for bucketing, that is wasteful in terms of CPU usage; use a shuffling hash algorithm described above;
The obvious choices in this category are:
- Blake3, which is the latest offer from the Blake family; (or Blake2b) if you must; I tend to use it due to its performance and flexibility (both a HMAC mode and a key derivation mode);
- SHA-3, which is the latest offer from the SHA family, standardized by NIST; (or SHA-2 if you must;) I would suggest using this one when either Blake3 is not available, or when you need the "stamp of approval" on your cryptographic primitives;
The obvious non-starters (due to weaknesses) are MD5
(and MD4
), SHA1
,
and other "trust me this is the best crypto ever" snake-oil algorithms.
Password hashes
The key property is irreversibility and brute-force resistance.
I think there have been countless articles written about how to choose, input, update, exchange, and store passwords, therefore I won't repeat them here. Use your preferred search engine to do your own research, or even better, use an already existing framework for identity management.
But if you must, the obvious choices in this category are (in any order):
- Argon2, which is the "winner" of the Password Hashing Competition;
- scrypt, which is the previous "best pick" of the security community, that is still a good choice; (here is a good article describing its parameters choice;)
- bcrypt, which should be used only when implementations for the previous two are not available;
An honorable mention is PBKDF2, although by today's standards it is considered unsuitable.
See also the TOTP and HOTP algorithms mentioned in the miscellaneous section, as slightly related topics.
Token hashes
The key property is irreversibility and collision resistance.
These are technically called key derivation hash functions, whose main purpose is taking a known (and secret) key to generate any number of unguessable tokens. They should never be used with passwords as the known secret key.
When to use such an algorithm? The best answer is: you'll know when you'll need one.
However, the best explanation (with in-depth details, 99% of which I've already forgotten) I've seen is the Understanding HKDF article; thus if you do need one, you must read this article.
In essence, any keyed cryptographic hash function that accepts a key (or any HMAC function), could be used for such a purpose, however, there are a few standard algorithms:
- Blake3, which offers a built-in key derivation mode;
- HKDF, which is based on any HMAC algorithm;
- PBKDF2, in case the one above is not implemented in your language of choice;
See also the TOTP and HOTP algorithms mentioned in the miscellaneous section, as slightly related topics.
Encryption hashes
The key property is reversibility if provided with the seed, but irreversibility otherwise, and collision resistance.
Previously in the signature hashes section I've said that they don't encrypt the input, thus an attacker given a hash could guess the original input. Well, sometimes you do want to encrypt the input, and then be able to decrypt it back.
Which would be a use-case for such a procedure? Hiding small internal identifiers when exporting them as part of an API. Imagine you have a database of patients, where each patient has an incremental primary key; imagine you want to implement an API that lists patients given some query; however you don't want to export the actual sequential numbers, but instead want to export unique random looking identifiers, so that an attacker (or user) can't guess anything from that exported identifier, nor does it allow one to try to exfiltrate data if one fails to properly secure API endpoints.
Unfortunately, I haven't seen many articles about this topic... The current best practice is to throw UUIDv4's at the problem by storing them in mapping tables, or worse, by using them as primary keys (see here and here for some of the counterarguments)...
The only article I've found that touches on this subject was Plan B for UUIDs -- double AES-128. (I've quickly read through it, and not yet internalized its intricacies, thus I can't speak too much about it.)
Some long time ago I've even written my own open-source implementation of such an algorithm, available at github.com/cipriancraciun/java-token-obfuscator, however, I haven't touched it in a while (as in almost 8 years)... (Perhaps one day I'll write another implementation and article about this topic...)
Also, someone on HN pointed me to How to Encipher Messages on a Small Domain -- Deterministic Encryption and the Thorp Shuffle by Ben Morris, Phillip Rogaway, and Till Stegers, although I've not read it myself for the moment.
Miscellaneous hashes
This category is for those types of hashes that are too obscure or too focused and thus don't fit in any of the previous categories:
- rolling hashes, which can be used for identifying duplicate chunks of data;
- perfect hash functions, which given known and finite set of possible inputs, are able to eliminate all collisions;
- message authentication codes (MAC), like for example Poly1305, which, somewhat related to checksums, should provide tamper resistance for data transmission and storage;
- Soundex, which can be used for identifying closely related words;
- perceptual hashing, which can be used for identifying similar media files (images, audio, videos, etc.);
- time (TOTP) and hash (HOTP) based one-time passwords, which are not actually passwords but instead the most common form of currently deployed two-factor-authentication;
- (nothing else comes to mind at the moment;)
Miscellaneous links
Here are a couple of interesting links to other topics related to hashing, especially to particularly useful ways to use hashes: