Comparing string and md5 hash dissimilarity using hashlib and jellyfish

Demonstrating that similarity of pairs of strings is uncorrelated to Jaro and Damerau-Levenshtein similarity of their md5 hashes.

Background

The md5 hash generates 128-bit hashes (which can be represented as a 32-character hex string) that are known to be highly 'divergent' for similar inputs, making them a good candidate for checksums since even a tiny change will result in a completely different hash:

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw 20230113B.ipynb hosted with ❤ by GitHub

I had never actually verified this for myself so put together some Python code to investigate / prove whether md5 hashes are in any way correlated to the inputs in terms of similarity. To compare similarity I used the Jaro and Damerau-Levenshtein distance metrics.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw 20230113C.ipynb hosted with ❤ by GitHub

Generating data

I generated pairs of random strings of length 64, with 2000 pairs in each run.

  • The first dataset contains characters from the ascii_lowercase set
  • The second contains just the hexadecimal characters [0-9a-f].

For each pair the md5 hashes are created, and the Jaro Similarity and Damerau-Levenshtein Distance will be calculated for the pair of strings and the pair of hashes.

import string
import random
import hashlib
import jellyfish
random.seed(23)
text_length = 64
num_samples = 2000
results1 = []
characters = string.ascii_lowercase
for i in range(num_samples):
a = ''.join(random.choice(characters) for i in range(text_length))
b = ''.join(random.choice(characters) for i in range(text_length))
c = hashlib.md5(a.encode()).hexdigest()
d = hashlib.md5(b.encode()).hexdigest()
results1.append((jellyfish.jaro_similarity(a, b), jellyfish.jaro_similarity(c, d), jellyfish.damerau_levenshtein_distance(a, b), jellyfish.damerau_levenshtein_distance(c, d)))
results2 = []
characters = '0123456789abcdef'
for i in range(num_samples):
a = ''.join(random.choice(characters) for i in range(text_length))
b = ''.join(random.choice(characters) for i in range(text_length))
c = hashlib.md5(a.encode()).hexdigest()
d = hashlib.md5(b.encode()).hexdigest()
results2.append((jellyfish.jaro_similarity(a, b), jellyfish.jaro_similarity(c, d), jellyfish.damerau_levenshtein_distance(a, b), jellyfish.damerau_levenshtein_distance(c, d)))
view raw 20230113D.py hosted with ❤ by GitHub

Plotting the results as a scatter plot and calculating the correlation

With the two sets of data generated, I calculated the correlation and scatter plotted the similarity metrics for (pairs of strings, pairs of hashes).

The scatter plots seem to show no relationship between the string distance and the hash distance, and the Pearson correlation coefficient bears this out, since it is close to zero.

(To 'prove' this more robustly, of course more different samples of data and different types of strings would need to be used, so this should be taken more as indicative.)

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw 20230113E.ipynb hosted with ❤ by GitHub

Similarly, the second set of data (hexadecimal characters only) shows the same (lack of) relationship.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw 20230113F.ipynb hosted with ❤ by GitHub

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).