[remark] Pre-hashing large password files used with PBKDFs -- Volution Notes

This article was amended on 2024-03-15 to add the section on "Bounded Retrieval Model".

While working on my z-tokens exchange encrypt / exchange decrypt key derivation scheme -- about which I've written before in Experimenting with multi-factor encryption and Misusing random oracles for practical purposes -- I've made a small realization about all current password derivation schemes (including Argon2, Scrypt, and PBKDF2). (For brevity, I'll continue referring to these as just "password derivation".)

This realization is more of a "gotcha", which doesn't impact in any way the security of password derivation algorithms.

However, it might make a big difference if you make the wrong assumptions.

Thus, before explaining my observation, let me describe a possible use-case for such password derivation schemes with large byte sequences as passwords.

Say you are very (as in very-very-very) paranoid, and instead of using a strong password, say at least as strong as password123, you actually use a very large file (as in at least a few hundred MiB, even perhaps a few GiB).

What could this file be:

something you've dd if=/dev/urandom of=/etc/secret-key bs=1M count=1024;
something that is publicly archived (for example on the Internet Archive), that you can easily download when needed;

Where do you use this "password file"? To encrypt multiple (past and future) sensitive small files.

What is the advantage of using a very large file instead of using a very secure passphrase?

Imagine you have a machine whose sole purpose is to encrypt / decrypt these small sensitive files ("small" at least in relation to the "password file" size). Your primary concern now is to make sure that the exfiltration of this "password file" is either difficult (because nothing is impossible), or at least detectable.

You could, for example, have the machine air-gapped, and exchange encrypted / decrypted files only via removable storage (say USB-sticks or SD-cards) or perhaps even QR-codes (or other similar data-matrix). Thus, to make sure that the "password file" is not extractable, you commit to never use any removable storage that is larger than, say, 1% of the "password file". This way, even if your system is compromised, the attacker needs to execute the exfiltration in chunks, thus either delaying or detecting an exfiltration.

Or, perhaps a more practical approach, have the machine be network connected, but monitor and throttle the inbound / outbound network traffic. This way, at least you can observe if someone has tried to exfiltrate your "password file".

(This scheme is not bulletproof, but it's at least another roadblock for the attacker. Think of it as the "poor-man hardware-security-module". Also, it could be paired with other additional factors, as I've pondered in Experimenting with multi-factor encryption.)

So in essence, the previous use-case relies on the assumption that the attacker needs the entire contents of the "password file" to compromise (past or future) sensitive information, and thus the need to protect against the exfiltration of said "password file".

However, and here comes my observation, if your encryption scheme relies on any of the modern password derivation functions (including Argon2, Scrypt, and PBKDF2), then the attacker doesn't actually need the whole "password file".

The attacker needs only a "pre-hash" of the whole "password file". In the case of PBKDF2 and Scrypt (which relies internally on PBKDF2) it actually needs a plain SHA256 hash, and in the case of Argon2 it needs a partial internal state of the Blake2b hash function.

How did I reach this conclusion? See the next section.

Looking at the way Argon2 is defined in RFC 9106 -- Argon2 Memory-Hard Function for Password Hashing and Proof-of-Work Applications, the only place where the password P is processed is as part of H_0, where H^(64) is actually Blake2b.

H_0 = H^(64)(LE32(p) || LE32(T) || LE32(m) || LE32(t) ||
        LE32(v) || LE32(y) || LE32(length(P)) || P ||
        LE32(length(S)) || S ||  LE32(length(K)) || K ||
        LE32(length(X)) || X)

Thus, the attacker needs to compute the internal Blake2b state, up to the point where it feeds P, and exfiltrate that.

Looking at the way PBKDF2 is defined in RFC 2898 -- PKCS #5: Password-Based Cryptography Specification Version 2.0, where PRF is usually HMAC-SHA256, the password P is directly passed as the key to HMAC-SHA256:

PBKDF2 (P, S, c, dkLen) -> DK

   Options:        PRF        underlying pseudorandom function (hLen [...])

   Input:          P          password, an octet string
                   S          salt, an octet string
                   c          iteration count, a positive integer
                   dkLen      intended length in octets of the derived key [...]

   Output:         DK         derived key, a dkLen-octet string

DK = T_1 || T_2 ||  ...  || T_l<0..r-1>

where

T_1 = F (P, S, c, 1)
T_2 = F (P, S, c, 2)
...
T_l = F (P, S, c, l)

where

F (P, S, c, i) = U_1 xor U_2 xor ... xor U_c

where

U_1 = PRF (P, S || INT (i))
U_2 = PRF (P, U_1)
...
U_c = PRF (P, U_{c-1})

Similarly, in the Scrypt algorithm is defined in RFC 7914 -- The scrypt Password-Based Key Derivation Function as follows, the password P is again directly passed (twice) as the password to PBKDF2-HMAC-SHA256:

scrypt (P, S, N, r, p, dkLen) -> DK

   Input:
        P       Passphrase, an octet string.
        S       Salt, an octet string.
        N       CPU/Memory cost parameter, [...]
        r       Block size parameter.
        p       Parallelization parameter, [...]
        dkLen   Intended output length in octets of the derived key [...]

   Output:
        DK      Derived key, of length dkLen octets.

1. Initialize an array B consisting of p blocks of 128 * r octets each:
    B[0] || B[1] || ... || B[p - 1] =
      PBKDF2-HMAC-SHA256 (P, S, 1, p * 128 * r)

2. for i = 0 to p - 1 do
      B[i] = scryptROMix (r, B[i], N)

3. DK = PBKDF2-HMAC-SHA256 (P, B[0] || B[1] || ... || B[p - 1], 1, dkLen)

Finally, looking at the way HMAC is defined in RFC 2104 -- HMAC: Keyed-Hashing for Message Authentication (with minor editorial changes), and taking into account that our "password file" is larger than any value B that is normally used in HMAC constructs (the RFC states its 64 for all the standard constructs, thus including HMAC-SHA256), then the attacker thus needs the SHA256 of the large "password file":

HMAC (H, K, text) -> MAC

    that uses a cryptographic hash function, which we denote by H,
    and a secret key K

MAC = H((K XOR opad) || H((K XOR ipad) || text))

    ipad = the byte 0x36 repeated B times
    opad = the byte 0x5C repeated B times

[...] The authentication key K can be of any length up to B, the
block length of the hash function.  Applications that use keys longer
than B bytes will first hash the key using H and then use the
resultant L byte string as the actual key to HMAC. In any case the
minimal recommended length for K is L bytes (as the hash output
length). [...]

Thus, the protection against the large "password file" exfiltration is now useless...

In the case of Scrypt or PBKDF2 the attacker needs the SHA256 of the large file, and in the case of Argon2 the attacker needs to pre-compute the Blake2b state up to the point where the password is fed in.

Then the attacker just needs to exfiltrate a couple of bytes (32 for SHA256, and 128 for Blake2b); for which even a blinking LED would suffice to exfiltrate it in a couple of seconds...

This observation applies unless we change a bit the way the "password file" is processed before being passed through the password derivation function.

This is something I'm investigating at the moment; I do have a solution, but I want to think about it a bit more...

Hint: assuming that the "salt" used in the password derivation functions is unique (and unguessable by the attacker) for each encrypted sensitive file, we could use that to "mangle" the "password file" before passing it through the password derivation function.

This section was added on 2024-03-15.

After posting this article on Lobsters, fellow reader Joachim Hendrik Schipper kindly pointed out that what I am describing is in fact part of the larger problem of Bounded Retrieval Model (BRM) (first published in 2006).

Quoting from the mentioned paper on BRM:

This leads us to define a new formal model which we call the Bounded Retrieval Model since we assume a bound on the amount of a party's stored data that can be retrieved by the adversary. In practice, this bound would be due to both physical and logical considerations, as we now explain. With respect to internal attackers, this bound may result from the capabilities of a simple Intrusion Detection System (IDS), which can easily monitor any large and repeated access to the party's stored data. [...]

Going down the rabbit hole of citations, I've actually stumbled upon another interesting research paper, Intrusion-Resilience via the Bounded-Storage Model, that tackles exactly the problem I've tried to highlight above. Quoting from this paper:

More precisely, we will grant the adversary the power to break into the honest user's machine and take full control over it. We will assume that the adversary is able to perform arbitrary (efficient) computation on victim's data.

Although in all of these papers, the "large secret file" is not supposed to be used in its entirety each time, but depending on some factor (unguessable by the attacker), only parts of it are used (in order to reduce the overhead on normal users).