Background
I recently studied the solutions for finding the data (files/documents) similarity and trying to improve the performance for my research which is archiving / syncing data on SWIFT (OpenStack ObjectStorage). The reason why I think group similar data sets (chunks) helps SWIFT deduplication has potential pros and cons as below.
Pros:
- Grouping the similar data helps reduce the hash searching cross over all hashes for all the data since we just need to specific group of hashes instead of search every hash.
- Grouping the similar data helps increase deduplication ratio since we assume the file similar will have common chunks then the key of the chunk (hash) will have higher possibility to show up again in the specific group.
Cons:
- Since file will check against only specific group follow the similarity. This might be cause there might be the same chunk existing in other group which might be not guide via MinHash, which means the space consumption might be higher than w/o grouping. (PS: grouping data via similarity sacrifices the space but improve the performance, it's a little bit contriversy in this statement)
- People said the hash searching cost extremely less on memory (ram / disk), thus searching all hashes in non-grouping deduplication doesn't hurt the system performance too much. but I have different assumptions here and list the statements to my assumption.
b. SWIFT is not a Database, it supports no querying or processing of data on the servers. All you can do is list the objects within a given container. There is no way to search based on object metadata. You need to keep your own external searching indexes. Thus, I think Deduplication over SWIFT might be slightly different from pure H/W transaction at Server-end since the hash searching is pretty much rely on object list on manifest object.
Here, I would like to explain a little bit about why MinHash can have "likely" Jaccard Similarity and we can group these similar data set into a "group" to improve the deduplication sync over the internet.
PS: I wrote the python code to prove two of metadata index I abstract from two files can follow the minHash rule and be grouped together to reduce extra hash search over the internet. (Check my gitHub: https://github.com/chianingwang/PyMinHash)
PS: I wrote the python code to prove two of metadata index I abstract from two files can follow the minHash rule and be grouped together to reduce extra hash search over the internet. (Check my gitHub: https://github.com/chianingwang/PyMinHash)
MinHash
General Concept
MinHash uses the magic of hashing to quickly estimate Jaccard Similarities. First of all I explain the concept for the equation.[hmin(A) = hmin(B)] = J(A,B)
h – a hash function mapping the member of A(original text) and B(query text) to distinct integers.
Calculate hmin(S) for the set A and B.
- Suppose it turns out that for our chosen h, hmin(A)=hmin(B) (call the value HM). It is true : HM = hmin(A ∪ B)
- Since HM is by definition a member of both Sets A and B, it must therefore also be a member of the intersection of these sets, It is true : A ∩ B
- P[hmin(A) = hmin(B)] = P[hmin(A ∪ B) present in A ∩ B]. Therefore has situation where HM = hmin(A ∪ B), essentially a randomly chosen element from A ∪ B due to the random nature of h, is also present in A ∩ B. --> iff hmin(A)=hmin(B)
- P[hmin(A ∪ B) present in A ∩ B] = |A ∩ B| / |A ∪ B| => J(A,B). The a prior probability of the element’s presence is simply the ratio of the sizes of these sets:
- P[hmin(A) = hmin(B)] = J(A,B) which is of course the Jaccard Similarity.
- P: Permutation
- If min hash of set A = min hash of set B , then follow Jaccard Similarities, set of A similar with set of B.
- This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar. (wiki)
Many Hashes MinHash
The only restriction on the hash function h is that it maps the members of A and B to distinct integers. To be fair of the justification, we should choose different hash function to show the idea works either or. Here we choose k different hash functions, and calculate the k value of hmin given by these function for both A and B.
P[hmin(A) = hmin(B)] = J(A,B) probability --> y/k.
- To estimate P[hmin(A) = hmin(B)] = J(A,B), and therefore J(A,B), we compare the results. Let y be the number of hash functions for which hmin(A) = hmin(B). We can take k as the number of “trials”, y as the number of “successes”, and so y/k as the estimate for the probability and J(A,B).
Single Hash MinHash
Instead compute the result of a single hash function for each member of each set, and compare multiple results from this function. via Random Subset.
- Let h be a hash function
- Define h(k)(S) as the subset of the k members of S with the smallest values of h.
- Define signature of S as h(k)(S), and estimate the similarity of two sets by comparing their signatures.
- The Jaccard Similarity as “the probability that a random element from the union of two sets is also in their intersection”
- Let X = h(k)(h(k)(A) ∪ h(k)(B)). In English, X is the set found by: Finding the k members of A that give the smallest values of h, and then the same for B. We have defined these as the signatures of A and B. Taking the union of these two signatures, and finding the k members of this set that give the smallest values of h.
- X must therefore also be equal to h(k)(A ∪ B), as the k smallest values from (A ∪ B) must be drawn from the k smallest values from A and from B, and no other values.
- Since h is a random function, X is a random subset ( or Simple Random Sample ) of k elements of (A ∪ B)
- Y = X ∩ h(k)(A) ∩ h(k)(B). Is the set of members of X that also belong to A ∩ B
- We have taken a random sample of k members of (A ∪ B) (K”trials”). |Y| of these members also belong to A ∩ B (|Y| successes). K = (X + Y)
- |Y|/k =~ P[random element from A ∪ B is also in A ∩ B]
- = |A ∩ B|/|A ∪ B| --> |A ∩ B|=|A ∪ B|
- = J(A,B)
Proof
Alright, now I bet you must very confusing above proof for MinHash, same with me :). Here, let me try to use a example to show how it works and how the similarity closed to each other. I follow the equations as above to prove the MinHash =~ Jaccard Similarity (J(A,B) =~ |Y|/k).
Here, I have 4 data sets which are Set 1 = S1, Set 2 = S2, Set 3 = S3 and Set 4 = S4. The union of all shingles are assumed 0, 1, 2, 3 and 4. There has two hash functions H1 and H2, the hash value with input X = Shingle as below e.g. : Shingle 0, H1 = 0+1 mod 5 = 1 and H2 = (3x0+1) mod 5 = 1.
Then we check each shingle's hash value and pass the large value and reserve minimum hash value. Here are the process in each shingle and represent them in table. e.g.: Shingle 0 only happen on S1 and S4, then we put two hash value into S1 and S2 column.
Regarding the MinHash result above, let's check each set to see how's MinHash ratio versus Jaccard Similarity Ratio. The fix integer k we pick is 4 which are the results of two sets's two hashes. Y is the hash result we can find in both data set. Then we have result as below table.
Of course, it's not a perfect case since we used only two hash functions and 5 shingle as sample might be way too small. But as we can see the result as below, 5 sets has 2 sets same and rest ratios are "likely". Thus, we can say the MinHash can avoid lousy permutation shingles of each set as the identity for data set similarity. J(A,B) =~ |Y|/k
However, if you don't have interest to dive into those logic, you can just leverage this idea to start the experiment. Here, I provide the summary as below. You would just using the idea to find out the data grouping via similarity.
Here, I have 4 data sets which are Set 1 = S1, Set 2 = S2, Set 3 = S3 and Set 4 = S4. The union of all shingles are assumed 0, 1, 2, 3 and 4. There has two hash functions H1 and H2, the hash value with input X = Shingle as below e.g. : Shingle 0, H1 = 0+1 mod 5 = 1 and H2 = (3x0+1) mod 5 = 1.
Shingles\Sets
|
S1
|
S2
|
S3
|
S4
|
H1=(X+1) mod 5
|
H2=(3X+1) mod 5
|
0
|
1
|
0
|
0
|
1
|
1
|
1
|
1
|
0
|
0
|
1
|
0
|
2
|
4
|
2
|
0
|
1
|
0
|
1
|
3
|
2
|
3
|
1
|
0
|
1
|
1
|
4
|
0
|
4
|
0
|
0
|
1
|
0
|
0
|
3
|
Then we check each shingle's hash value and pass the large value and reserve minimum hash value. Here are the process in each shingle and represent them in table. e.g.: Shingle 0 only happen on S1 and S4, then we put two hash value into S1 and S2 column.
Shingle 0
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1
|
1
|
||
H2
|
1
|
1
|
Shingle 1
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1
|
2
|
1
|
|
H2
|
1
|
4
|
1
|
Shingle 2
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1
|
3
|
2
|
1<3
|
H2
|
1
|
2
|
4
|
1<2
|
Shingle 3
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1<4
|
3
|
2
|
1
|
H2
|
0<1
|
2
|
0<4
|
0<1
|
e.g.: We know shingle 3 hash value are 4 and 0, however if we compare with S1's existing H1 and H2 value which are 1 and 1. Since 4(new) > 1(existing) then we pick H1 is still 1 but 1(existing) > 0(new), then we have to pick 0 which is lower hash value as H2 value.
Shingle 4
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1
|
3
|
0<2
|
1
|
H2
|
0
|
2
|
0
|
0
|
Final Result of Minimum Hash of each set
Hash\Sets
|
S1
|
S2
|
S3
|
S4
|
H1
|
1
|
3
|
0
|
1
|
H2
|
0
|
2
|
0
|
0
|
Regarding the MinHash result above, let's check each set to see how's MinHash ratio versus Jaccard Similarity Ratio. The fix integer k we pick is 4 which are the results of two sets's two hashes. Y is the hash result we can find in both data set. Then we have result as below table.
Of course, it's not a perfect case since we used only two hash functions and 5 shingle as sample might be way too small. But as we can see the result as below, 5 sets has 2 sets same and rest ratios are "likely". Thus, we can say the MinHash can avoid lousy permutation shingles of each set as the identity for data set similarity. J(A,B) =~ |Y|/k
Ratio\Sim
|
S(S1, S2)
|
S(S1, S3)
|
S(S1, S4)
|
S(S2, S3)
|
S(S2, S4)
|
S(S3, S4)
|
MinHash
|
0 = (0/4)
|
1/2=(2/4)
|
1=(4/4)
|
0=(0/4)
|
0=(0/4)
|
1/2=(2/4)
|
Jaccard
|
0 = (0/3)
|
1/4=(1/4)
|
2/3=(2/3)
|
0=(0/4)
|
1/3=(1/3)
|
1/5=(1/5)
|
However, if you don't have interest to dive into those logic, you can just leverage this idea to start the experiment. Here, I provide the summary as below. You would just using the idea to find out the data grouping via similarity.
Summary
Estimate Jaccard Similarity between 2 sets A and B by Single Hash MinHash.- Choose a hash function h and fixed integer k --> for random subset
- Find the signature h(k)(S) for each set, where h(k)(S) is the subset of the k members of S with the smallest values of h
- Find X = h(k) (h(k)(A) ∪ h(k)(B))
- Find Y = X ∩ h(k)(A) ∩ h(k)(B) --> ( h(k) (h(k)(A) ∪ h(k)(B)) ) ∩ h(k)(A) ∩ h(k)(B)
- J(A,B) =~ |Y|/k
It can be shown using Chernoff Bounds that both the Single and Many Hash algorithms have expected error O(1/sqrt(k)). You would therefore require k=400 to estimate J(A,B) with expected error <= 0.05.(wiki)
PS:
I wrote the python code to proof the concept works on my dedup chunk metadata and allow me to group the similar files into the same group to avoid full chunks(hash) checking. https://github.com/chianingwang/PyMinHashreference:
- http://en.wikipedia.org/wiki/MinHash
- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
- http://robertheaton.com/2014/05/02/jaccard-similarity-and-minhash-for-winners/
- http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html
Share Movie Quote:
One day your life is going to flash before your eyes; make sure it's worth watching! - (The Bucket List), 2007
No comments:
Post a Comment