Tuesday, August 23, 2016

What is Bloom Filter ? How It Work ? Proof Of Concept in Python, C++ and Java

The first time I heard "bloom filter" is from the data domain file system paper by Ken Li (https://www.usenix.org/legacy/event/fast08/tech/full_papers/zhu/zhu_html/). The paper leverage bloom filter to reduce the need for on-disk index lookup during the deduplication process. It might be an old school technique for data screening or searching but it was surprised me when I heard algorithm logic since it's very straightforward and easy to understand. 


Today I would like to share how to quick implement Bloom Filter by Python command line as POC and eventually compose into python class/method in module to allow to reuse in the future. Other than python, I use C++ to implement the Bloom Filter and last, use Google Guava Java object directly to implement the Bloom Filter.

Outline

  • Bloom Filter Basic Concept Step by Step
  • How Bloom Filter Work in Detail include ( False-Positive Rate )
  • Python Command Line POC
  • Bloom Filter Python Module (Class/Methods)
  • Bloom Filter C++ Demo
  • Bloom Filter via Google Guava ( Java )

Bloom Filter Basic Concept

Before we start I give an quick example as below. Here is a list or tupple or set ( array ... whatever the container you can think of it. ) with len(set) = 15.

I provided two hash functions, let's assume hash1() and hash2() and I got two kinds of fruit, Apple, and Banana.


  • Hash1('Apple') = 4
  • Hash2('Apple') = 12
  • Hash1('Banana') = 14
  • Hash2('Banana') = 11


 A B A B
01234567891011121314


Now I got Orange, it has Hash1 and Hash2 value as below.


  • Hash1('Orange') = 13
  • Hash2('Orange') = 1


 O A B A O B
01234567891011121314

Does Orange in the set ? Answer is No, It's very obviously.

Apple and Banana these two fruits has 4 hash values in total 15 space's set. And there will has two hash functions as probes to increase possibility ( detect rate )

Thus the probability of a False Positive: (4/15)*(4/15) = 16/225 ~ 0.07 which means only 7% might be fail even the Orange has two hash in the set but it's not either "Apple" or "Banana".

In sum, # of hash functions we call k = 2, n is total # of keys ( filter ) in set and length of set we call m = 15. The false positive rate we call c = 7% in above example. 
  • k = 2
  • n = 4
  • m = 15
  • c = (4/15)^2*100 ~ 7% 


How Bloom Filter Work ?

Detail Explanation

A Bloom filter consists of two components: a set of k hash functions  ( we call k filters ) and a bit set of a given length. We choose the length of the bit set m and the number of hash functions depending on how many keys we want to add to the set and how high an error rate ( false-positive rate ) we are willing to put up with.

All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit set. For example, if a set is 200 bits long, the hash functions return a value between 1 and 200. It's important to use high-quality hash functions in the filter to guarantee that output is equally distributed over all possible values -- "hot spots" in a hash function would increase our false-positive rate.

To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit set, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There's no mechanism for turning bits off in a Bloom filter.

As an example, let's take a look at a Bloom filter with tree hash functions ( 3 filters ) and a bit set of length 15. We'll use spaces and "x" to represent the bit set, to make it easier to follow along. As you might expect, an empty Bloom filter starts out with all the bits turned off, as seen in Figure 1.

01234567891011121314
                     Figure 1. An empty Bloom filter.

Let's now add the string apple into our filter. To do so, we hash apple through each of our three hash functions and collect the output:


Then we turn on the bits at the corresponding positions in the set -- in this case bits 3, 11, and 12, as shown in Figure 2.
hash1("apples") = 3
hash2("apples") = 12
hash3("apples") = 11


 x x x
01234567891011121314
                     Figure 2. A Bloom filter with three bits enabled.

To add another key, such as banana, we repeat the hashing procedure:
And again turn on the appropriate bits in the set, as shown with highlights in Figure 3.
hash1("banana") = 11
hash2("banana") = 1
hash3("banana") = 8


 x x x x x
01234567891011121314
                     Figure 3. The Bloom filter after adding a second key.

Notice that the bit at position 11 was already turned on -- we had set it when we added apple in the previous step. Bit 11 now does double duty, storing information for both apple and banana. As we add more keys, it may store information for some of them as well.

This overlap is what makes Bloom filters so compact -- any one bit may be encoding multiple keys simultaneously. 

This overlap also means that you can NEVER take a key out of a filter, because you have no guarantee that the bits you turn off don't carry information for other keys. If we tried to remove apple from the filter by reversing the procedure we used to add it in, we would inadvertently turn off one of the bits that encodes bananaThe only way to strip a key out of a Bloom filter is to rebuild the filter from scratch, leaving out the offending key.

Checking to see whether a key already exists in a filter is exactly analogous to adding a new key. We run the key through our set of hash functions, and then check to see whether the bits at those offsets are all turned on. If any of the bits is off, we know for certain the key is not in the filter. If all of the bits are on, we know the key is probably there.
I say "probably" because there's a certain chance our key might be a false positive. For example, let's see what happens when we test our filter for the string orange. We run orange through the set of hash functions:
And then examine the bits at those offsets, as shown in Figure 4.
  • hash1("orange") = 8
  • hash2("orange") = 3
  • hash3("orange") = 12

 x x x x x
01234567891011121314
                     Figure 4. A false-positive in the Bloom filter.

All of the bits at positions 3, 8, and 12 are on, so our filter will report that orange is a valid key.

Of course, orange is not a valid key -- the filter we built contains only apple and banana. The fact that the offsets for orange point to enabled bits is just "coincidence". We have found a false positive -- a key that seems to be in the filter, but isn't really there.

As you might expect, the false-positive rate depends on the bit set length and the number of keys stored in the filter


How to calculate your false positive rate ?

The roomier the bit set, the smaller the probability that all k bits we check will be on, unless the key actually exists in the filter. The relationship between the number of hash functions and the false-positive rate is more subtle. 

If you use too few hash functions, there won't be enough discrimination between keys; but if you use too many, the filter will be very dense, increasing the probability of collisions. You can calculate the false-positive rate for any filter using the formula as below.

Formula for False-Positive Rate: c = ( 1 - e^(-kn/m) )^k

PS: the formula came conduct from Probability_of_false_positives but here I can give a quick explanation but most of the content from wiki.




















and we know , the we can assume 1 - ( 1 - m ) ^ m ( kn / m ), and replace  1 - ( 1 - m ) ^ m by e ( 2.71828 ... ) as below.








Then we can have the formula ( 1 - e^(-kn/m) )^k








In previous simple example we calculate physically is 7, but use formula we got =(1-(2.7183)^(-2*4/15))*2 ~ 0.1709 which lower than physical calculation.

In the case this section which we have three fruits , we got = (1-(2.7183)(-3*5/15))*3 ~ 0.2526


What's the best m we should set up ?

When using Bloom filters, we very frequently have a desired false-positive rate in mind and we are also likely to have a rough idea of how many keys we want to add to the filter. We need some way of finding out how large a bit set is to make sure the false-positive rate never exceeds our limit. The following equation will give us set length from the error rate and number of keys:



Formula for find the # of bit set: m = -kn / ( ln( 1 - c ^ 1/k ) )

In previous POC example with 2 fruits, we calculate 0.07, but use formula we got 0.1709, thus =-2*4/(LN(1-0.1709^1/2)) ~ 136 which match our m 15 is way lower than 136.
In this section's 3 fruit case we got =-3*5/(LN(1-0.2526^1/3)) ~ 256 as well.

In sum, we found the k, n, m, c as below for k=3 example here.

  • k = 3
  • n = 5
  • c = 0.2526 ~ 25% failure positive rate, unless we can increase m to 256
  • m = 15  ~ the proper size of set , m should be around 256.

How about find k and n ?

Where c is the false positive rate, k is the number of hash functions, n is the number of keys in the filter, and m is the length of the filter in bits.


The way to find out m and c, if we have given n or m we can get the k via the formula as below.

k = m/n ln 2, which gives 2^(-k) ~ 0.6185^(m/n)
e.g in above case we have n=12, m=56, then k could be 0.6185^(15/3) ~ 0.23

Same if you know k and m, then you might be able to find out n follow equation as below.







Sum of the c ( Error Rate ), k ( Keys ), Required Size ( m )

To give you a sense of how error rate and number of keys affect the storage size of Bloom filters, Table 1 lists some sample set sizes for a variety of capacity/error rate combinations.


           Table 1
Error Rate
Keys
Required Size
Bytes/Key
1%
1K
1.87 K
1.9
0.1%
1K
2.80 K
2.9
0.01%
1K
3.74 K
3.7
0.01%
10K
37.4 K
3.7
0.01%
100K
374 K
3.7
0.01%
1M
3.74 M
3.7
0.001%
1M
4.68 M
4.7
0.0001%
1M
5.61 M
5.7



Python Command Line POC

In Python Command Line POC, we leverage the ipython note ( now call jupyter ) as a demo tool and leverage python random build-in function as hash function for generate k hash.

We can give an assumption here assume the power ball game we allow 1 ~ 56 numbers and select only 6 between 1 ~ 56 , thus we can know as below as this moment.
  • k = 6
  • n = ?
  • m = 56
  • c = ?
After 'apple' and 'banana' two samples then we got the ipython note output as below.



  • k = 6
  • n = 12
  • m = 56
  • c = 0.9.6819% ( but from formula we got
       (1-2.7183^(-6*12/56))^6 ~ 
    0.143485 ~ 14% )

Bloom Filter Python Module

Python Module (Class/Methods)

we wrote bloom.py as below, we decide three methods in this class "BloomFilter", 


  • 1st of all is dunder init for loading population_size(m) and probes(k) 
  • 2nd is add sample into set. 
  • Last is check whether this value e.g orange in the set or not.
    • It can be dunder contains since in python we can use like # bf = BloomFilter() and "orange" in bf to see whether bloom filter can find it.



=============================================
""" Implement a Bloom Filter from scratch.
        Space-efficient probablistic membership tester.
        No false negatives, some small chance of false positives.
"""

from __future__ import division, print_function
from random import seed, sample


class BloomFilter(object):
    "Probablistic membership tester."

    def __init__(self, population_size=56, probes=6):
        self.pool = set()
        self.population = xrange(population_size)
        self.probes = probes

    def add(self, key):
        seed(key)
        lucky = set(sample(self.population, self.probes))
        self.pool.update(lucky)

    def __contains__(self, key):
        "Implements:  'kanai' in bf"
        seed(key)
        lucky = set(sample(self.population, self.probes))
        return lucky <= self.pool
=============================================

Here is the test output.

In [1]: import bloom
In [2]: bf=bloom.BloomFilter()
In [3]: bf.pool
Out[3]: set()
In [4]: bf.add('apple')
In [5]: bf.add('banana')
In [6]: bf.pool
Out[6]: {0, 3, 6, 15, 17, 26, 32, 35, 39, 42, 46, 50}
In [7]: 'apple' in bf
Out[7]: True
In [8]: 'banana' in bf
Out[8]: True
In [9]: 'orange' in bf
Out[9]: False

As you can see above, we can't find orange when we has k=6 probes and m=56 set size.

we can double check ipython output here.

In [11]: from random import seed, random, randrange, choice, sample
In [12]: seed('orange'); lucky_numbers = set(sample(xrange(56), 6))
In [13]: bf.pool
Out[13]: {0, 3, 6, 15, 17, 26, 32, 35, 39, 42, 46, 50}
In [14]: lucky_numbers
Out[14]: {0, 6, 25, 28, 33, 39}
In [15]: lucky_numbers in bf.pool
Out[15]: False
In [16]: lucky_numbers & bf.pool
Out[16]: {0, 6, 39}
In [17]: lucky_numbers <= bf.pool
Out[17]: False
In [18]: # if we use {0, 6 , 39) only
In [19]: {0, 6, 39} <= bf.pool
Out[19]: True

Is it fun ? 

Bloom Filter C++ Demo

I wrote a Bloom Filter via crc32 checksum in C++ since CRC leverage polynomial concept can speed up the hash calculation. You can check the code in there. https://github.com/chianingwang/BloomFilter

CRC Example

Polynomial functions for common CRC's
CRC-160x8005x16 + x15 + x2 + 1
CRC-CCITT0x1021x16 + x12 + x5 + 1
CRC-DNP0x3D65x16 + x13 + x12 + x11 + x10 + x8 + x6 + x5 + x2 + 1
CRC-320x04C11DB7x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x1 + 1

C++ demo code
===

























===
/*
 * BloomFilter.cpp
 *
 *  Created on: Feb 18, 2015
 *      Author: Johnny
 */


/*
 * C++ Program to Implement Bloom Filter
 */
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cstring>

using namespace std;

#define POLYNOMIAL 0x04C11DB7L  // Standard CRC-32 polynomial
#define M 32   // Number of bits in the Bloom filter
#define K 4   // Number of bits set per mapping in filter
typedef unsigned short int word16;
typedef unsigned int word32;

static word32 CrcTable[256];       // Table of 8-bit CRC32 remainders
char BFilter[/ 8];      // Bloom filter array of M/8 bytes
word32 NumBytes;            // Number of bytes in Bloom filter

void gen_crc_table(void);
word32 update_crc(word32 crc_accum, char *data_ptr, word32 data_size);
void mapBloom(word32 hash);
int testBloom(word32 hash);
/*
 * Main Function
 */
int main()
{
    FILE *fp1;
    FILE *fp2;
    char inFile1[256];
    char inFile2[256];
    char inString[1024];
    word32 crc32;
    int retCode;
    word32 i;
    cout<<"------------------------------------------ "<<endl;
    cout<<"-  Program to implement a general Bloom filter\n";
    cout<<"------------------------------------------ "<<endl;
    // Determine number of bytes in Bloom Filter
    NumBytes = M / 8;
    if ((% 8) != 0)
    {
        cout<<"*** ERROR - M value must be divisible by 8 \n";
        exit(1);
    }
    // Clear the Bloom filter
    for (i = 0; i < NumBytes; i++)
        BFilter[i] = 0x00;
    // Initialize the CRC32 table
    gen_crc_table();
    cout<<"File name of input list to add to filter ===========> ";
    cin>>inFile1;
    fp1 = fopen(inFile1, "r");
    if (fp1 == NULL)
    {
        cout<<"ERROR in opening input file #1 ("<<inFile1<<") *** \n";
        exit(1);
    }
    cout<<"File name of input list to check for match =========> ";
    cin>>inFile2;
    fp2 = fopen(inFile2, "r");
    if (fp2 == NULL)
    {
        cout<<"ERROR in opening input file #1 ("<<inFile2<<") *** \n";
        exit(1);
    }

    // Read input file #1 and map each string to Bloom filter
    while (1)
    {
        fgets(inString, 1024, fp1);
        if (feof(fp1))
            break;
        for (i = 0; i < K; i++)
        {
            crc32 = update_crc(i, inString, strlen(inString));
            mapBloom(crc32);
        }
    }
    fclose(fp1);

    // Output the Bloom filter
    cout<<"-------------------------------------------------------- \n";
    cout<<"Bloom filter is (M = "<<M<<" bits and K = "<<K<<" mappings)... \n";
    for (i = 0; i < NumBytes; i++)
        printf("%2d", i);
    cout<<endl;
    for (i = 0; i < NumBytes; i++)
        printf("%02X", BFilter[i]);
    cout<<endl;

    // Output results header
    cout<<"-----------------------------------------------------\n";
    cout<<"Matching strings are... \n";

    while(1)
    {
        fgets(inString, 1024, fp2);
        if (feof(fp2))
            break;
        for (i = 0; i < K; i++)
        {
            crc32 = update_crc(i, inString, strlen(inString));
            retCode = testBloom(crc32);
            if (retCode == 0)
          break;
    }
    if (retCode == 1)
        cout<<inString;
  }
  fclose(fp2);
}

/*
 * Function to initialize CRC32 table
 */
void gen_crc_table(void)
{
    register word32 crc_accum;
    register word16 i, j;
    // Initialize the CRC32 8-bit look-up table
    for (i = 0; i < 256; i++)
    {
        crc_accum = ((word32) i << 24);
        for (j = 0; j < 8; j++)
        {
            if (crc_accum & 0x80000000L)
                crc_accum = (crc_accum << 1) ^ POLYNOMIAL;
            else
                crc_accum = (crc_accum << 1);
        }
        CrcTable[i] = crc_accum;
    }
}
/*
 * Function to generate CRC32
 */
word32 update_crc(word32 crc_accum, char *data_blk_ptr, word32 data_blk_size)
{
    register word32 i, j;
    for (j = 0; j < data_blk_size; j++)
    {
        i = ((int) (crc_accum >> 24) ^ *data_blk_ptr++) & 0xFF;
        crc_accum = (crc_accum << 8) ^ CrcTable[i];
    }
    crc_accum = ~crc_accum;

    return crc_accum;
}
/*
 * Function to map hash into Bloom filter
 */
void mapBloom(word32 hash)
{
    int tempInt;
    int bitNum;
    int byteNum;
    unsigned char mapBit;
    tempInt = hash % M;
    byteNum = tempInt / 8;
    bitNum = tempInt % 8;

    mapBit = 0x80;
    mapBit = mapBit >> bitNum;

    // Map the bit into the Bloom filter
    BFilter[byteNum] = BFilter[byteNum] | mapBit;
}
/*
 * Function to test for a Bloom filter match
 */
int testBloom(word32 hash)
{
    int tempInt;
    int bitNum;
    int byteNum;
    unsigned char testBit;
    int retCode;
    tempInt = hash % M;
    byteNum = tempInt / 8;
    bitNum = tempInt % 8;

    testBit = 0x80;
    testBit = testBit >> bitNum;
    if (BFilter[byteNum] & testBit)
        retCode = 1;
    else
        retCode = 0;
    return retCode;
}

Bloom Filter via Google Guava ( Java )

I wrote a deduplication screen code via google guava's bloom filter java object (http://google.github.io/guava/releases/snapshot/api/docs/com/google/common/hash/BloomFilter.html). If you are interesting, feel free to check it out it.







































output:








==================
package pkgBFTest;

import com.google.common.base.*;
import com.google.common.hash.*;

import java.util.ArrayList;
import java.util.List;

public class clsBFTest {
@SuppressWarnings("serial")
public static void main(String args[]) {

class DedupID {
  String id;
}

Funnel<DedupID> personFunnel = new Funnel<DedupID>() {
  @Override
  public void funnel(DedupID dedupid, PrimitiveSink into) {
    into
      .putString(dedupid.id, Charsets.UTF_8);
  }
};

BloomFilter<DedupID> friends = BloomFilter.create(personFunnel, 500, 0.01);

DedupID J = new DedupID();
J.id = "Johnny";

DedupID W = new DedupID();
W.id = "Winnie";

DedupID T = new DedupID();
T.id = "Tom";

List<DedupID> dedupidList = new ArrayList<>();
dedupidList.add(J);
dedupidList.add(W);
dedupidList.add(T);


for(DedupID friend : dedupidList) {
  friends.put(friend);
}
DedupID dude = new DedupID();
dude.id = "Win";

if (friends.mightContain(dude)) {
  System.out.println("yes find "+dude.id );
}else{
  System.out.println("no can not find "+dude.id);
}

}



}


Sum

I explain the rough concept in first section and add more detail for false-positve rate = c ; # of filter = k, size of set = m ... etc. And show you the equations about what's the proper k, m and n for your idea c when you really want to use bloom filter.

After that, I show you three kinds of way to implement the bloom filter which are Python with Random module, C++ with CRC hashes and Google Guava Java object.

Again, hopefully you enjoy the blog and feel free to drop me a message if you have any comment or suggestion.

==============================

Quote from Movie

待在不懂珍惜的人身邊不是忠誠,而是愚蠢。
Staying with someone who doesn't appreciate you isn't loyalty, but stupidity.
火線淘寶 (War Dogs), 2016