Wednesday, July 22, 2009

Know thy Hash

I work a lot with various kinds of hashes in my profession, and I am finding more and more that peoples understanding of hashes is haphazard at best. Hashes are defined to meet the following goals:

  • It is easy to compute a hash for a given length.
  • It is infeasible to find a message given a hash.
  • It is infeasible to change the message without changing the hash.
  • It is infeasible to find to messages with the same hash.

Oddly enough the middle two are more for cryptographic purposes, then they are for "general" hashes. Using a hash to store data into a Hash Table for example, there is no need to use a complex hash.

So this is the first observation, know what your hash was designed for. Hashes are typically designed for one of two purposes. The first is Cryptography examples include MD5, SHA-1, etc. These are generally the slowest to compute, have a large number of output bits. These hashes should be reserved for what they are designed for. Their strength and bit complexity does make they useful at uniquely identifying files.

The second is to discriminate some data from others quickly, generally in a hash table. The first lesson is that uniqueness is never guaranteed. The second is use a hash that is both fast and is very low from a collision point of view. You also want a hash whose result will be in the same number of bits as the native register. 32 or 64 bits right now. Here are some examples of fast hashes.

Make sure to deal with collisions, they are simply far more likely then you think. Also remember that this analysis assumes that the hash is strongly collision free, if that is not the case the results are worse.

Finally, there is a trend to use things that where never designed to be hashes as hashes. For example the CRC-32 algorithm. Using this algorithm as a hash is generally a bad idea because there are hashes that are much faster to compute with similar properties for being collision free.