Comparing string and md5 hash dissimilarity using hashlib and jellyfish

Demonstrating that similarity of pairs of strings is uncorrelated to Jaro and Damerau-Levenshtein similarity of their md5 hashes.

Background

The md5 hash generates 128-bit hashes (which can be represented as a 32-character hex string) that are known to be highly 'divergent' for similar inputs, making them a good candidate for checksums since even a tiny change will result in a completely different hash:

I had never actually verified this for myself so put together some Python code to investigate / prove whether md5 hashes are in any way correlated to the inputs in terms of similarity. To compare similarity I used the Jaro and Damerau-Levenshtein distance metrics.

Generating data

I generated pairs of random strings of length 64, with 2000 pairs in each run.

  • The first dataset contains characters from the ascii_lowercase set
  • The second contains just the hexadecimal characters [0-9a-f].

For each pair the md5 hashes are created, and the Jaro Similarity and Damerau-Levenshtein Distance will be calculated for the pair of strings and the pair of hashes.

Plotting the results as a scatter plot and calculating the correlation

With the two sets of data generated, I calculated the correlation and scatter plotted the similarity metrics for (pairs of strings, pairs of hashes).

The scatter plots seem to show no relationship between the string distance and the hash distance, and the Pearson correlation coefficient bears this out, since it is close to zero.

(To 'prove' this more robustly, of course more different samples of data and different types of strings would need to be used, so this should be taken more as indicative.)

Similarly, the second set of data (hexadecimal characters only) shows the same (lack of) relationship.

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).