Uniform Distribution From Random Data – SHA1 Example
on October 20, 2017
SHA1 is dead (2005 - theoretical attacks; 2011 – deprecated by NIST; 2017), so don’t take this blog post as an endorsement that you should still use SHA1 in your projects. But a recent tinkering I did illustrates very well an important property of cryptographic hashes – uniformity.
Uniformity simply means that hash functions should map their inputs (messages, passwords, files, everything) evenly across their output (all possible digests given their output size).
SHA1, though long deprecated, gives a good example of this.
I was working on Troy Hunt’s compromised passwords list. The uncompressed text file containing all 320 million SHA1-hashed passwords weighs in at a hefty 13.5GB. I needed to use these hashes for a security project I had in mind, and I wanted to significantly speed up searching for a hash by splitting up this huge text file.
Since the SHA1 hashes appear as hexadecimal strings, the natural category for splitting it up was the first character of the hash (0-9, and A-F). That lets me split the 13.5GB text file into 16 parts, so each one would be far less than 10% of the original corpus (or, as Spock would say,
Exactly 6.25% of the original corpus)
In Linux, it’s pretty straightforward to accomplish this without any Python/PHP/C. Just pain old bash in a terminal does the trick:
grep '^0' pwned-passwords-1.0.txt > pwned_group_0.txt grep '^0' pwned-passwords-update-1.txt >> pwned_group_0.txt grep '^0' pwned-passwords-update-2.txt >> pwned_group_0.txt
That’s how I created a textfile that only has hashes starting with zero.
You’ll note that I actually work with 3 files; this is because Troy released 320M passwords in 3 batches, with updates 1 and 2 being considerably smaller (~5% of the first batch combined). For the purpose of this article, I’ll continue to disregard that small detail and pretend, as I did earlier, that there’s simply 1 giant file.
Anyway, so I continue doing the same thing, and create files for hashes that start with 1, 2, 3… until F:
grep '^1' pwned-passwords-1.0.txt > pwned_group_1.txt grep '^2' pwned-passwords-1.0.txt > pwned_group_2.txt . . . grep '^D' pwned-passwords-1.0.txt > pwned_group_D.txt grep '^E' pwned-passwords-1.0.txt > pwned_group_E.txt grep '^F' pwned-passwords-1.0.txt > pwned_group_F.txt
And this is what I got:
I also made a script to compress each one (for ease of transfer). You can see they’re pretty much still uniformly distributed in size:
That’s one interesting thing about cryptographic hash functions - whatever you throw at them, their output will follow a uniform distribution.