Collision attack for StreamHash5 // msm's home

StreamHash is an alternative family of cryptographic hash functions, designed for maximum speed, while being reasonably secure at the same time.

As far as I know, it’s not used in any production software, but is not completely fringe too - it was submitted to the SHA-3 competition and prompted a few academic publications.

StreamHash5 is the newest member of the family. It was created as an improved version of StreamHash4, after I implemented a practical collision attack for it ¹.

In this blog post, I will describe my impractical collision attack on StreamHash5 (and a practical partial attack). I stress that to manage expectations, but it’s enough to consider StreamHash5 broken cryptographically:

Hash size is 512 bits. Hence, the generic birthday attack complexity is 2**256.
My attack complexity is roughly 2**128 * 2**38 = 2**166.

While the project’s readme clearly states that There are currently no known collision attacks easier than the generic birthday attack ².

I’m also confident the attack complexity can be improved. But this research spent the last 2 years in my drawer and I think it’s time to finally publish it.

Partial collision example (line breaks added for clarity):

$ cat data.{0,1}.hex
626974636820646f20317420616c6f6ec0402c500c6570577b3c09fb4e966d2215ea0bb4a024bab440e907c9cc7e3bd0
696620796f752077616e6e61206372794b6dfb042aed03d14e8dbd4936536c3500000000000000000000000000000000

$ streamhash5sum data.0.bin
65bf759586d6b23e3ba7b36b9150920b  # collision
6f1daa35f658c4181fc25960d2b0ac61  # collision
71df8e636e38beecbf1d36ae7f9c26e4  # collision
e4121fda76468f48ae542cb70a31875e

$ streamhash5sum data.1.bin
65bf759586d6b23e3ba7b36b9150920b  # collision
6f1daa35f658c4181fc25960d2b0ac61  # collision
71df8e636e38beecbf1d36ae7f9c26e4  # collision
82ef6e3e544e910fc99b69c77ca58ceb

StreamHash5

Like one wise man with anger issues once said, “talk is cheap, show me the code”. StreamHash5 processes data in 16-byte blocks. The internal state has 4*16 bytes. After every round, the state is xor-ed with the number of bits processed so far. Simplified version (works when len(data) is divisible by 16) below:

def sh5(data):
    offset = 0
    state = list(consts)

    while offset + 16 <= len(data):
        offset += 16
        magic = struct.pack("<I", offset * 8).rjust(8, "\x00")
        state[3] = xor(state[3], magic * 2)
        block(data[offset-16:offset], state)

    return state[0] + state[1] + state[2] + state[3]

Of course, the real magic is in the block compression function. StreamHash can be described as four “almost-AES-CBC” encryptions going on in parallel.

The state is divided into four 16 byte chunks. First, every chunk is xored with a new block of hashed data. After that, every block is encrypted with two rounds of AES (every block uses a different but constant key). In Python:

def xorstate(data, state):
    state[0] = xor(data, state[0])
    state[1] = xor(data, state[1])
    state[2] = xor(data, state[2])
    state[3] = xor(data, state[3])


def halfblock(state):
    state[0] = aes_round(state[0], consts[0])
    state[1] = aes_round(state[1], consts[1])
    state[2] = aes_round(state[2], consts[2])
    state[3] = aes_round(state[3], consts[3])


def block(data, state):
    xorstate(data, state)
    halfblock(state)
    halfblock(state)

In StreamHash5, the halfblock function is called twice per block - this is the improvement over StreamHash4 introduced in the new version.

This can also be described with the following ASCII-art:

     data                                     bits
       |                                        |
       v                                        |
S0 -> xor -> aes_round(C0) -> aes_round(C0) --- | --> S0
S1 -> xor -> aes_round(C1) -> aes_round(C1) --- | --> S1
S2 -> xor -> aes_round(C2) -> aes_round(C2) --- v --> S2
S3 -> xor -> aes_round(C3) -> aes_round(C3) -> xor -> S3

Attack description

The attack itself is just a smart brute-force with a lot of precomputing.

The problem we’re trying to solve can be stated as follows:

AES(AES(A00 ^ A1D, C0), C0) = AES(AES(B00 ^ B1D, C0), C0)
AES(AES(A01 ^ A1D, C1), C1) = AES(AES(B01 ^ B1D, C1), C1)
AES(AES(A02 ^ A1D, C2), C2) = AES(AES(B02 ^ B1D, C2), C2)
AES(AES(A03 ^ A1D, C3), C3) = AES(AES(B03 ^ B1D, C3), C3)

Excuse my notation. C0, C1, C2 and C3. Are algorithm constants. AES is a single round of AES encryption. AnD and BnD describe nth data blocks (in message A and B respectively). And finally, Anm/Bnm describe mth state after processing nth block (remember, there are four states).

In other words, the relationship between the state variables is:

# state 0 in message A
A10 = AES(A00 ^ A1D, C0)
A20 = AES(A10 ^ A2D, C0)
A30 = AES(A20 ^ A3D, C0)
...

# state 1 in message B
B11 = AES(B00 ^ B1D, C1)
B21 = AES(B10 ^ B2D, C1)
B31 = AES(B20 ^ B3D, C1)
...

# etc

Finally, an example with ASCII art again:

       A1D                                     bits
        |                                        |
        v                                        |
A00 -> xor -> aes_round(C0) -> aes_round(C0) --- | --> A10
A01 -> xor -> aes_round(C1) -> aes_round(C1) --- | --> A11
A02 -> xor -> aes_round(C2) -> aes_round(C2) --- v --> A12
A03 -> xor -> aes_round(C3) -> aes_round(C3) -> xor -> A13

So, back to the problem again. I won’t try to solve all four equations. We’ll create a collision for three equations in 2**38 and observe that we can repeat this 2**128 times to get a collision for the fourth block ³. So we start with:

AES(AES(A00 ^ A1D, C0), C0) = AES(AES(B00 ^ B1D, C0), C0)
AES(AES(A01 ^ A1D, C1), C1) = AES(AES(B01 ^ B1D, C1), C1)
AES(AES(A02 ^ A1D, C2), C2) = AES(AES(B02 ^ B1D, C2), C2)

We expand AES round into suboperations - SubBytes, AddRoundKey, MixColumns and ShiftRows, and simplify:

SB(AR(MX(SR(SB(A00 ^ A1D))), C0)) ^ A2D = SB(AR(MX(SR(SB(B00 ^ B1D))), C0)) ^ B2D
SB(AR(MX(SR(SB(A01 ^ A1D))), C1)) ^ A2D = SB(AR(MX(SR(SB(B01 ^ B1D))), C1)) ^ B2D
SB(AR(MX(SR(SB(A02 ^ A1D))), C2)) ^ A2D = SB(AR(MX(SR(SB(B02 ^ B1D))), C2)) ^ B2D

Observe that SR(SB(A00 ^ A1D)) don’t mix bytes in any way. In fact, it’s equivalent to SR(SB(A00)) ^ SR(SB(A1D)). Use this trick to simplify equatoin again (Let A00sr = SR(SB(A00)) etc):

SB(AR(MX(A00sr ^ A1Dsr), C0)) ^ A2D = SB(AR(MX(B00sr ^ B1Dsr), C0)) ^ B2D
SB(AR(MX(A01sr ^ A1Dsr), C1)) ^ A2D = SB(AR(MX(B01sr ^ B1Dsr), C1)) ^ B2D
SB(AR(MX(A02sr ^ A1Dsr), C2)) ^ A2D = SB(AR(MX(B02sr ^ B1Dsr), C2)) ^ B2D

Move variables around, and let d = A2D ^ B2D

d ^ SB(AR(MX(A00sr ^ A1Dsr), C0))= SB(AR(MX(B00sr ^ B1Dsr), C0))
d ^ SB(AR(MX(A01sr ^ A1Dsr), C1))= SB(AR(MX(B01sr ^ B1Dsr), C1))
d ^ SB(AR(MX(A02sr ^ A1Dsr), C2))= SB(AR(MX(B02sr ^ B1Dsr), C2))

Now, introduce helper functions for four constant: let SSB_C0(x) = SB(AR(MX(x), C0))

d ^ SSB_C0(A00sr ^ A1Dsr) = SSB_C0(B00sr ^ B1Dsr)
d ^ SSB_C1(A01sr ^ A1Dsr) = SSB_C1(B01sr ^ B1Dsr)
d ^ SSB_C2(A02sr ^ A1Dsr) = SSB_C2(B02sr ^ B1Dsr)

Why? Because SB(AR(MX(x), C0)) is in a sweet spot, where it’s already hard to analyse mathematically, but there is still only a single MixColumns step - so we can operate on message dwords (32bit chunks) separately. This means we can precompute inverse functions like:

SSB_C0(x) = y
<=>
x ∈ INVSSB_C0(y)  # many inputs may give the same result

This will come in handy.

Now we can just brute-force A1Dsr quickly. After plugging in a specific value for A1D (remember, the first data block) we get:

d0 = SB_C0(B00sr ^ B1Dsr)
d1 = SB_C1(B01sr ^ B1Dsr)
d2 = SB_C2(B02sr ^ B1Dsr)

This means that:

d0 ^ d1 = SB_C0(B00sr ^ B1Dsr) ^ SB_C1(B01sr ^ B1Dsr)

Observe that B00sr and B01sr don’t depend on A1D (or even B1D) . This means that we can define a function:

SB_C0xC1(x) = SB_C0(B00sr ^ x) ^ SB_C1(B01sr ^ x)

And precompute its inverse:

SB_C0xC1(x) = y
<=>
x ∈ INVSB_C0xC1(y)

Going back to our equation, we can find solutions immediately. Pseudo-code:

# solve d0 ^ d1 = SB_C0(B00sr ^ B1Dsr) ^ SB_C1(B01sr ^ B1Dsr)
for B1Dsr in INVSB_C0xC1(d0 ^ d1):
    ...  # do more crypto magic

This takes care of the first two equations. But remember, we need to solve three of them at once:

d0 = SB_C0(B00sr ^ B1Dsr)
d1 = SB_C1(B01sr ^ B1Dsr)
d2 = SB_C2(B02sr ^ B1Dsr)

Sadly, this is as far as clever precomputing gets us. But the good thing is that we can work on 32bit input fragments independently. By picking a random A1D we have 1 / 2*32 chance of satisfying the third equation by pure chance. And when we do it’s over - the attack succeeded.

So the idea is to split a message block into four 32bit parts, find a collision for each part separately, and then combine them into a 128bit block.

The only bad news is that sometimes we won’t find any A1D that works, even after exhaustive checking of all 2**32 uint32 values. In this case, we have to pick other initial states and redo the attack.

That’s really it. I’ve glossed over a few details, but the gist of the attack and the most interesting part is just that - a big precomputed inverse for xor of two AES supersboxes.

Conclusions

Even though I didn’t present a full practical collision, I’m confident StreamHash5 shouldn’t be used in its current form. I see a few ways to improve my attack, and certainly focused work by experts and three-letter agencies would improve it even more.

The usual takeaway is: cryptography is hard - even when you’re a professional cryptographer, rolling your own crypto is risky.

You can read my slides or watch the video. Both are PL-only. ↩︎
https://github.com/mtrojnar/StreamHash5 ↩︎
Can we use the birthday paradox to bring it down to 2**64 somehow? I Don’t know. Maybe? But we don’t control the first three blocks, so it’s not obvious how. ↩︎