Friday, October 21, 2016

Searching w/ Hashing Techniques ( Python )

I learned these techniques from either school or some programming classes ( Python specifically ). Recently I have chance to recap them and put all of it together in this post. This is more like a personal note but even though, if there has any comment or advice will be welcome and be appreciated. 


1. Linear Searching w/o Hashing


This is most general searching nearly like brute force scanning against with data structure like array , link-list ... etc.

e.g this is un-sort array or set


0
1
2
3
4
5
6
7
15
8
4
9
31
54
60
21


if we want to find 21 the, we need to scan from array(0) ~ array(7), then can find 21. Thus time complexity is O(n) , n = 8 (0 ~ 7) in this case.


2. Grouping w/ Hashing


This technique is leveraging grouping concept or categorize the data and speed up searching only targeting on specific group to reduce scanning all of the data set.

e.g we categorize above un-sort array into below, which are odd and even groups.

odd
0
1
2
3
15
9
31
21

even

4
5
6
7
8
4
54
60

here is quick sudo code to represent the concept.

if x mod 2 == 1:
    move data into odd array
else:
    move data into even array 

Above searching can reduce the time complexity into a half. Then divide the data set into two group is a kinds of hashing technique since f(x) = x mod 2 => This is exactly hashing concept.

3. Hashing w/ Chainning 


This case is we leverage hash table w/ hashing technique but the hashing has collision which has to be extend by the data set with other types of data structure such as link-list.

e.g: here is the above example but we put the data into the hash table by f(x) = x mod 4

0
1
2
3
4
5
6
7
15
8
4
9
31
54
60
21
3
0
0
1
3
2
0
1

Then you can see above table there has collision happen when we just use 4 bits hash table which showing in third rows. Thus the chaining can be compose by above table.


0
8(1)
4(2)
60(6)
1
9(3)
21(7)

2
54(5)


3
15(0)
31(4)


Last you can see the hash table above there has chaining happening when there has collision in hash table row [0, 1, 3]. The time complexity will be O(1) + len of link-list which is 1 + 2 = 3 in above case.


4. Open Addressing


Open addressing is a technique about fully utilize the hash table instead of using chaining which means when the collision happen then move the pointer to the next available hash table slots and so on.

e.g here is the example when hash table has 8 bits w/ hash function f(x) = x mod 8

0
1
2
3
4
5
6
7
15
8
4
9
31
54
60
21
7
0
4
1
7
6
4
5

Then you see above table there has collision when we using 8 bits hash table in third rows. Thus open addressing can be composed by above table.

0
8(1)
31(4) go find next (1)

1
9(3)
31(4) go find next (2)

2
31(4)
<- find available

3
60(4)

<- find available
4
4(2)

60(4) go find next (3)
5
21(5)


6
54(6)


7
15(0)
31(4) go find next (0)


In sum, you can see 31 and 60 has collision happen, the technique is move the data into next available slots which is 31 from 7 row to 2 row and 60 from 4 row to 3 row. The time complexity is pretty much in chaining case which is O(1) + pointer scanning. The worst case might be O(1) + O(n) but open address can leverage the space properly than chaining since we don't need extra data structure to record collision.

5. Two ( Multiple ) Hashing Comparison - Bi-Hashing


This technique is leverage two or multiple hashing functions to assign the data into hash table to avoid collision happen. The famous example from this concept is Bloom Filter which I explain in my previous post (http://chianingwang.blogspot.com/2016/08/how-bloom-filter-works-poc-via-python.html)

e.g we use fa(x) = x mod 16 and fb(x) = x mod 16 - 1 and composed the data set from above as below.

0
1
2
3
4
5
6
7
15
8
4
9
31
54
60
21
15
  8  
4
9
15
6
12
5
14
   7   
3
8
14
5
11
4

The row 3 is the fa(x) hashing result and row 4 is fb(x) hashing result. Thus when we assign data into hash table we assign first value via fa(x) first then if collision happen then leverage fb(x)as assign rule.

Thus we can have hash table as below.


0


1


2


3


4
4(2)

5
21(7)

6
54(5)

7


8
8(1)

9
9(3)

10


11


12
60(6)

13


14
31(4)

15
15(1)
(31 mod 16 = 15) - 1 = 14

As above what you can see is collision only happen in 31 and fa(31) = 15 then we notice collision is there then we calculate fb(31) = 14 to locate the value 31.

The time complexity for this searching is O(1) + O(1) since there has two hashing and each hashing has O(1) only. This technique is way fast than above but need more CPU resource for computing hashes.

6. Hash Relative Technique for Bi-Hash


As you can see this section title, we still using the bi or multiple hash but the way to design the hash function is co-relative which means hash fa(x) can be the base for fb(x) which can be reduce sort of computing resource but still take advantage from bi-hashing. One of the example for hash relative technique is rolling checksum which is using in rsync (https://en.wikipedia.org/wiki/Rsync).

e.g we use the example from above but this time fa(x) = x mod 16 and fb(x) = (fa(x) + 1 << 1) mod 16 ( PS: << 1 is left shift 1 ). Then the hash table can be quick composed as below.

Let's try to calculate the value fb(31) in binary way as below.


2^5
2^4
2^3
2^2
2^1
2^0
Value
0
0
1
1
1
1
fa(31) = 15
0
1
  0  
0
0
0
15 + 1 = 16
1
0
  0  
0
0
0
16 << 1 = 32


As above table has result 32 and mod by 16 = 0

Thus fb(31) = 0 fit into 0 slot in hash table.

AS you can see the 31 is still collision since it's the same result like previous which is fa(31) = 15 = fa(15). Thus we calculate fb(31) = (fa(31) + 1 << 1) mod 16 = 0, then locate 31 into slot 0. 

In this case time complexity is still O(1) + O(1) but computing time might the less since we use fa(x) as base for composing fb(x) such as fb(x) is fa(x) + 1 left shift 1 and module 16.


0
31(4)

1


2


3


4
4(2)

5
21(7)

6
54(5)

7


8
8(1)

9
9(3)

10


11


12
60(6)

13


14


15
15(1)
((31 mod 16 = 15) + 1 >> 1 ) mod 16 = 0

The time complexity is fairly the same with bi-hash but computing resource / time is much less since we re-use hash in hash-relative technique.


7. Memory Addressing + Double Size Hash Table


In this section, for speeding up the searching with traditional hash table, we need to design a new data type for cooperate with it. For example like Python's object data structure as below.


Ref Count
Type (Data)
Object Size
Pointer or Key ( e.g char *)


The we continue leverage the example as above but now we give hash table size 20 bits, then the hash table should be as below since hash f(x) = x mod 20 now.


0
1
2
3
4
5
6
7
15
8
4
9
31
54
60
21
15
8  
4
9
11
14
0
1


if we combine the object data structure into above example then we can have data structure as below.



0
1
2
3
4
5
6
7
Ref Count
1
1
1
1
1
1
1
1
Type (Data)
int
int
int
int
int
int
int
int
Object Size
4
4
4
4
4
4
4
4
Pointer or Key ( e.g char *)
0x104347c10
0x104347c14
0x104347c18
0x104347c22
0x104347c26
0x104347c30
0x104347c34
0x104347c38
Physical Value
( not part of data structure
15
8
4
9
31
54
60
21


Thus our hash table can be as below within two columns which is hash# and pointer ( key ) .


No.
Pointer ( Key )
hash
0
0x104347c34
0(6)
1
0x104347c38
1(7)
2


3


4
0x104347c18
4(2)
5


6


7


8
0x104347c14
8(1)
9
0x104347c22
9(3)
10


11
0x104347c26
11(4)
12


13


14
0x104347c30
14(5)
15
0x104347c10
15(0)
16


17


18


19


20




The advantage from above above during the searching can be process as below pattern.

1. if pointer ( memory address ) is the same  then we know the data is the same, then we find it. - in object data structure
2. else if hash is the same then we know the data is the same, then we find it. - in hash table hash #
3. worst case is we don't know either memory space or # hash thus we compare byte by byte like linear search in 1st section.

8. Memory Addressing + Double Size Hash Table + add "Value" Info.


As you can see above section, the searching process is 1. checking memory address , 2. check hash # then 3. check byte by byte in leaner searching. There is a quick way to avoid 3 which is add value in hash table then we can skip byte by byte comparison but his might increase a little bit space for keep this extra info in hash table.


No.
Pointer ( Key )
hash
value
0
0x104347c34
0(6)
60
1
0x104347c38
1(7)
21
2



3



4
0x104347c18
4(2)
4
5



6



7



8
0x104347c14
8(1)
8
9
0x104347c22
9(3)
9
10



11
0x104347c26
11(4)
31
12



13



14
0x104347c30
14(5)
54
15
0x104347c10
15(0)
15
16



17



18



19



20




In sum, even we don't have pointer / hash# info but we can still rely on hash table to search quick.

9. Sparse Index Pointer w/ Compact & Ordering Dictionary 


As you can see above it's great to do the searching via above object data structure and hash table ( dictionary in Python ) but the hash table has lots of empty slots which didn't utilize space efficiency.
Thus we would like to make hash table ( dictionary ) more condense for space. For this reason we abstract the pointer into a sparse index table which is more align with reality case.

e.g: In sparse indexing, if the memory address not been used then we mark it as -1


sparse indexing
-1
0x104347c10
0x104347c14
0x104347c18
-1
0x104347c22
0x104347c26
0x104347c30
-1
-1
0x104347c34
0x104347c38
-1

With it we can compact the table by ordering as below.


No.
Pointer ( Key )
hash
value
0
0x104347c10
15
15
1
0x104347c14
8
8
2
0x104347c18
4
4
3
0x104347c22
9
9
4
0x104347c26
11
31
5
0x104347c30
14
54
6
0x104347c34
0
60
7
0x104347c38
1
21

In sum this section, the compact & ordering hash table doesn't have empty slot but spares indexing do have. Instead of having whole empty rows in hash table we would rather to have a -1 in sparse indexing table.

The following two techniques are more abstract for specific use case.


10. Key Sharing Hash Table ( Dictionary ) by Mark Sharron 


This design allow dictionary which are used as attribute dictionaries ( the dunnder dict __dict__ attribute of an object) to share keys with other attribute dictionaries of instances of the same class.

e.g, name, age and salary are attributes in dictionary and can be abstract into attribute index as a share key for cross referencing in the same class. The data structure can be as below example.



Name
Age
Salary
Pointer ( Key )
Name
Age
Salary
Value
Johnny
40
100000
Hash
0x1234
0x4567
0x7890


Name
Age
Salary
Pointer ( Key )
0x104347c10
0x104347c14
0x104347c18


Name
Age
Salary
Pointer ( Key )
Name
Age
Salary
Value
Winnie
30
120000
Hash
0x0987
0x7654
0x4321

As above example, when search Johnny or Winnie's Salary can search share key index first to find the attribute which is name = Johnny or Winnie and Salary as result.


11. Extra Version info in Object Data Structure

This purpose is carry extra version info in object data structure which allow python skip checking when version # is the same which means there is no change happen before since version # is the same.


Ref Count
Type (Data)
Object Size
Pointer or Key ( e.g char *)
Version #