Johnny ( Chianing ) Wang : MinHash for Data Set Similarity ( Jaccard Similarity )

Johnny ( Chianing ) Wang

Background

MinHash

Wednesday, January 7, 2015

General Concept

Many Hashes MinHash

Single Hash MinHash

Proof

Summary

PS:

MinHash for Data Set Similarity ( Jaccard Similarity )

P[hmin(A) = hmin(B)] = J(A,B) probability --> y/k.

reference:

Share Movie Quote:

I recently studied the solutions for finding the data (files/documents) similarity and trying to improve the performance for my research which is archiving / syncing data on SWIFT (OpenStack ObjectStorage). The reason why I think group similar data sets (chunks) helps SWIFT deduplication has potential pros and cons as below.

Pros:

Grouping the similar data helps reduce the hash searching cross over all hashes for all the data since we just need to specific group of hashes instead of search every hash.
Grouping the similar data helps increase deduplication ratio since we assume the file similar will have common chunks then the key of the chunk (hash) will have higher possibility to show up again in the specific group.

Cons:

Since file will check against only specific group follow the similarity. This might be cause there might be the same chunk existing in other group which might be not guide via MinHash, which means the space consumption might be higher than w/o grouping. (PS: grouping data via similarity sacrifices the space but improve the performance, it's a little bit contriversy in this statement)
People said the hash searching cost extremely less on memory (ram / disk), thus searching all hashes in non-grouping deduplication doesn't hurt the system performance too much. but I have different assumptions here and list the statements to my assumption.

a. First of all, without grouping, the container on SWIFT will be the chunk and name by hash, there has limitation on SWIFT which is container is only 10,000 (This is what's I remember can't grantee it's correct at this moment) and object per container is vary. Some people said 1 million but it's vary depending on H/W equipment. Therefore, I assume the deduplication over SWIFT might be not same with convention storage deduplication.

b. SWIFT is not a Database, it supports no querying or processing of data on the servers. All you can do is list the objects within a given container. There is no way to search based on object metadata. You need to keep your own external searching indexes. Thus, I think Deduplication over SWIFT might be slightly different from pure H/W transaction at Server-end since the hash searching is pretty much rely on object list on manifest object.

Here, I would like to explain a little bit about why MinHash can have "likely" Jaccard Similarity and we can group these similar data set into a "group" to improve the deduplication sync over the internet.

PS: I wrote the python code to prove two of metadata index I abstract from two files can follow the minHash rule and be grouped together to reduce extra hash search over the internet. (Check my gitHub: https://github.com/chianingwang/PyMinHash)

MinHash uses the magic of hashing to quickly estimate Jaccard Similarities. First of all I explain the concept for the equation.[hmin(A) = hmin(B)] = J(A,B)

h – a hash function mapping the member of A(original text) and B(query text) to distinct integers.

hmin(S) – the member x of a set S with the minimum value of h(x).To calculate hmin(S), you pass every member of S through the hash function h, and find the member that gives “lowest result”.

Calculate hmin(S) for the set A and B.

Suppose it turns out that for our chosen h, hmin(A)=hmin(B) (call the value HM). It is true : HM = hmin(A ∪ B)
Since HM is by definition a member of both Sets A and B, it must therefore also be a member of the intersection of these sets, It is true : A ∩ B
P[hmin(A) = hmin(B)] = P[hmin(A ∪ B) present in A ∩ B]. Therefore has situation where HM = hmin(A ∪ B), essentially a randomly chosen element from A ∪ B due to the random nature of h, is also present in A ∩ B. --> iff hmin(A)=hmin(B)
P[hmin(A ∪ B) present in A ∩ B] = |A ∩ B| / |A ∪ B| => J(A,B). The a prior probability of the element’s presence is simply the ratio of the sizes of these sets:
P[hmin(A) = hmin(B)] = J(A,B) which is of course the Jaccard Similarity.

Therefore:P[hmin(A) = hmin(B)] = J(A,B)

P: Permutation
If min hash of set A = min hash of set B , then follow Jaccard Similarities, set of A similar with set of B.
This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar. (wiki)

The only restriction on the hash function h is that it maps the members of A and B to distinct integers. To be fair of the justification, we should choose different hash function to show the idea works either or. Here we choose k different hash functions, and calculate the k value of hmin given by these function for both A and B.

To estimate P[hmin(A) = hmin(B)] = J(A,B), and therefore J(A,B), we compare the results. Let y be the number of hash functions for which hmin(A) = hmin(B). We can take k as the number of “trials”, y as the number of “successes”, and so y/k as the estimate for the probability and J(A,B).

Instead compute the result of a single hash function for each member of each set, and compare multiple results from this function. via Random Subset.

Let h be a hash function

Define h(k)(S) as the subset of the k members of S with the smallest values of h.

Define signature of S as h(k)(S), and estimate the similarity of two sets by comparing their signatures.

The Jaccard Similarity as “the probability that a random element from the union of two sets is also in their intersection”

Let X = h(k)(h(k)(A) ∪ h(k)(B)). In English, X is the set found by: Finding the k members of A that give the smallest values of h, and then the same for B. We have defined these as the signatures of A and B. Taking the union of these two signatures, and finding the k members of this set that give the smallest values of h.

X must therefore also be equal to h(k)(A ∪ B), as the k smallest values from (A ∪ B) must be drawn from the k smallest values from A and from B, and no other values.

Since h is a random function, X is a random subset ( or Simple Random Sample ) of k elements of (A ∪ B)

Y = X ∩ h(k)(A) ∩ h(k)(B). Is the set of members of X that also belong to A ∩ B

We have taken a random sample of k members of (A ∪ B) (K”trials”). |Y| of these members also belong to A ∩ B (|Y| successes). K = (X + Y)
|Y|/k =~ P[random element from A ∪ B is also in A ∩ B]

= |A ∩ B|/|A ∪ B| --> |A ∩ B|=|A ∪ B|
= J(A,B)

Alright, now I bet you must very confusing above proof for MinHash, same with me :). Here, let me try to use a example to show how it works and how the similarity closed to each other. I follow the equations as above to prove the MinHash =~ Jaccard Similarity (J(A,B) =~ |Y|/k).

Here, I have 4 data sets which are Set 1 = S1, Set 2 = S2, Set 3 = S3 and Set 4 = S4. The union of all shingles are assumed 0, 1, 2, 3 and 4. There has two hash functions H1 and H2, the hash value with input X = Shingle as below e.g. : Shingle 0, H1 = 0+1 mod 5 = 1 and H2 = (3x0+1) mod 5 = 1.

Then we check each shingle's hash value and pass the large value and reserve minimum hash value. Here are the process in each shingle and represent them in table. e.g.: Shingle 0 only happen on S1 and S4, then we put two hash value into S1 and S2 column.

Shingle 0

Hash\Sets	S1	S2	S3	S4
H1	1			1
H2	1			1

Shingle 1

Hash\Sets	S1	S2	S3	S4
H1	1		2	1
H2	1		4	1

Shingle 2

Hash\Sets	S1	S2	S3	S4
H1	1	3	2	1<3
H2	1	2	4	1<2

Shingle 3

Hash\Sets	S1	S2	S3	S4
H1	1<4	3	2	1
H2	0<1	2	0<4	0<1

e.g.: We know shingle 3 hash value are 4 and 0, however if we compare with S1's existing H1 and H2 value which are 1 and 1. Since 4(new) > 1(existing) then we pick H1 is still 1 but 1(existing) > 0(new), then we have to pick 0 which is lower hash value as H2 value.

Shingle 4

Hash\Sets	S1	S2	S3	S4
H1	1	3	0<2	1
H2	0	2	0	0

Final Result of Minimum Hash of each set

Hash\Sets	S1	S2	S3	S4
H1	1	3	0	1
H2	0	2	0	0

Regarding the MinHash result above, let's check each set to see how's MinHash ratio versus Jaccard Similarity Ratio. The fix integer k we pick is 4 which are the results of two sets's two hashes. Y is the hash result we can find in both data set. Then we have result as below table.

Of course, it's not a perfect case since we used only two hash functions and 5 shingle as sample might be way too small. But as we can see the result as below, 5 sets has 2 sets same and rest ratios are "likely". Thus, we can say the MinHash can avoid lousy permutation shingles of each set as the identity for data set similarity. J(A,B) =~ |Y|/k

Ratio\Sim	S(S1, S2)	S(S1, S3)	S(S1, S4)	S(S2, S3)	S(S2, S4)	S(S3, S4)
MinHash	0 = (0/4)	1/2=(2/4)	1=(4/4)	0=(0/4)	0=(0/4)	1/2=(2/4)
Jaccard	0 = (0/3)	1/4=(1/4)	2/3=(2/3)	0=(0/4)	1/3=(1/3)	1/5=(1/5)

However, if you don't have interest to dive into those logic, you can just leverage this idea to start the experiment. Here, I provide the summary as below. You would just using the idea to find out the data grouping via similarity.

Estimate Jaccard Similarity between 2 sets A and B by Single Hash MinHash.

Choose a hash function h and fixed integer k --> for random subset
Find the signature h(k)(S) for each set, where h(k)(S) is the subset of the k members of S with the smallest values of h
Find X = h(k) (h(k)(A) ∪ h(k)(B))
Find Y = X ∩ h(k)(A) ∩ h(k)(B) --> ( h(k) (h(k)(A) ∪ h(k)(B)) ) ∩ h(k)(A) ∩ h(k)(B)
J(A,B) =~ |Y|/k

It can be shown using Chernoff Bounds that both the Single and Many Hash algorithms have expected error O(1/sqrt(k)). You would therefore require k=400 to estimate J(A,B) with expected error <= 0.05.(wiki)

I wrote the python code to proof the concept works on my dedup chunk metadata and allow me to group the similar files into the same group to avoid full chunks(hash) checking. https://github.com/chianingwang/PyMinHash

One day your life is going to flash before your eyes; make sure it's worth watching! - (The Bucket List), 2007

Shingles\Sets

H1=(X+1) mod 5

H2=(3X+1) mod 5