Collections – Bloom filters

Introducing Bloom filters
Exploring the Classes
Examples
References

1. Introducing Bloom filters

1.1 Understanding Bloom filters

Bloom filters were introduced to Apache commons-collections in version 4.5. A Bloom filter:

Is a probabilistic data structure defined by Burton Bloom in 1970 [Bloom];
Can be considered as a bit vector representing a set of objects;
Is constructed by hashing data multiple times;
Identifies where things are not ;
Will yield false positives but never false negatives.
Is typically used where hash table solutions are too memory intensive and false positives can be addressed; for example a gating function to a longer operation (e.g. determine if a database lookup should be made);

1.2 Defining a Bloom filter

Bloom filters are constrained by: the number of elements in the set, the number of hash functions, the number of bits in the vector, and the probability of false positives.

\(p\) - is the probability of false positives,
\(n\) - is the number of elements in the set represented by the Bloom filter,
\(m\) - is the number of bits, and
\(k\) - is the number of hash functions.

Mitzenmacher and Upfal [Mitzenmacher] have shown that the relationship between these properties is: \(p = (1- e^{-kn/m})^k\)

Thomas Hurst provides a good calculator [Hurst] where the interplay between these values can be explored.

1.3 Constructing a Bloom filter

A simple Bloom filter is constructed by hashing an object \(k\) times, and using each hash value to turn on a single bit in a bit vector comprising \(m\) bits.

Result: A populated bit vector

byte[][] buffers // the list of buffers to hash

bit[m] bitBuffer;

for buffer in buffers do

for i=1 to k do

long h = hash( buffer, i );

int bitIdx = h mod m;

bitBuffer[bitIdx] = 1;

end for

Algorithm 1 How to construct a Bloom filter

Building a bloom filter with the Apache Commons-Collections implementation looks like:

// select a hash function
HashFunction hFunc = new Murmur128x86Cyclic();
// define the shape - 7 items. 1 in 1000 probability of collisions
Shape shape = new Shape ( hFunc , 7, 1.0/1000);
// create a hasher.
DynamicHasher hasher = new DynamicHasher.Builder( hFunc )
    .with( "banana" ).with( "apple" ).with( "pear" ).build();
// create the bloom filter
BloomFilter filter = new BitSetBloomFilter ( hasher , shape );

This looks more complex than the simple Hadoop Bloom filters. Creating the same filter using the Hadoop library look like:

//define the filter
BloomFilter hadoopBloomFilter = new BloomFilter (1 _000 , 7, Hash.MURMUR_HASH );
// add the fruits
hadoopBloomFilter.add( new Key( "banana".getBytes() ) );
hadoopBloomFilter.add( new Key( "apple".getBytes() ) );
hadoopBloomFilter.add( new Key( "pear".getBytes () ) );

However, the Commons Collection Bloom filter library is designed to meet most Bloom filter use cases. Covering the wide variety of use cases means that the library provides more options and therefore appears more complex. Any application that wants a simple, consistent, cross platform Bloom filter with a tested implementation can build it very quickly from the Commons Collections Bloom filter library.

2. Exploring the Classes

The Commons Collections Bloom filter library has five (5) major components.

HashFunctionIdentity - An interface that defines a hash function.
HashFunction - An interface that extends HashFunctionIdentity to include a the hash function implementation. The hash function converts a byte[] into a long value using a seed.
Hasher - A class that, when provided with a Shape , converts one or more byte[] into an iterator<int> representing the bits that are enabled in the Bloom filter.
Shape - A class that describes the shape of the Bloom filter. The shape is defined by:
- HashFunctionIdentity
- The probability of false positives ( \(p\) )
- The number of elements in the set represented by the Bloom filter ( \(n\) )
- The number of bits ( \(m\) )
- The number of hash functions ( \(k\) )
BloomFilter - An interface that defines the standard functions that a Bloom filter must provide. Implementations generally take a Hasher and a Shape and converts the iterator<int> from the Hasher and preserves them in an internal representation.

2.1 Exploring the HashFunctionIdentity and HashFunction

The HashFunctionIdentity is an interface that defines three (3) basic attributes of a HashFunction :

The common hash function name. This is a value like MD5 or Murmur3-128-x86
The signedness of the calculation. When a byte[] is interpreted as a numerical value is it assumed to be signed or unsigned?
The process type. In a traditional Bloom filter calculation the hash function is called multiple times with different seeds. This technique is called iterative hashing. However Kirsch and Mitzenmacher [kirsch] have shown that it is possible to more quickly generate the necessary values by utilizing two (2) values from the hashing. This is called cyclic hashing.

The HashFunctionIdentity is used in situations where the attributes of the HashFunction are required but the implementation is not. For example, a Bloom filter need only know what the attributes of the HashFunction that generated the bits was, it does not require the implementation.

The HashFunctionIdentity also has a signature created by hashing the string produced by: String.format( "%s-%s-%s", getName(), getSignedness(),getProcess() ).getBytes( "UTF-8" ) with a seed of 0 (zero). The signature is used in quick comparison checks to ensure function compatibility.

The HashFunction is an interface that extends HashFunctionIdentity with an method to access the hash function implementation. HashFunction implementations are often used where the HashFunctionIdentity is required as they are the only provided implementations of HashFunctionIdentity .

HashFunction implementations include:

MD5Cyclic - MD5 hash used in a cyclic manner.
Murmur128x86Cyclic - Murmur3 128-bit x86 hash implementation used in a cyclic manner.
Murmur32x86Iterative - Murmur3 32-bit x86 hash implementation used in an iterative manner.
ObjectsHashIterative - Java Objects hash used in an iterative manner.

Additional HashFunction implementations can be created to support the hashing strategies often found in bioinformatics where k-mer sequences are indexed into Bloom filters. In some cases the data about the k-mer sequence is used as the seed value for the hash.

2.2 Exploring the Shape

As noted earlier the Shape describes the shape of the resulting Bloom filter. The constructor for Shape always takes a HashFunctionIdentity as well as many of the other Shape properties necessary to solve \(p = (1- e^{-kn/m})^k\). These combinations are:

The probability of collisions ( \(p\) ), the number of bits ( \(m\) ), and the number of hash functions ( \(k\) ).
The number of elements ( \(n\) ), and the probability of collisions ( \(p\) ).
The number of elements ( \(n\) ), number of bits ( \(m\) ), and number of hash functions ( \(k\) ).

2.3 Exploring the Hasher

The Hasher converts converts one or more byte[] into an iterator<int> representing the bits that are enabled in the Bloom filter. The bits indexes are in the range [0, \(m\) -1]. The Hasher is, in a sense, a primitive Bloom filter. There are two implementations of Hasher provided:

DynamicHasher - calls the HashFunction as needed to generate the list of bits.
StaticHasher - contains a list of bits that were enabled for a specific Shape and simply replays them. This method is faster than the Dynamic when the Hasher is reused but can only be used for filters with the same Shape as the Hasher.

Other implementations of Hasher are possible. Consider the Hasher as the class that converts one or more byte[] into Bloom filters of any shape.

2.4 Exploring the BloomFilter

The BloomFilter is an interface that describes the functionality of the Bloom filter. Along with the AbstractBloomFilter there are three (3) concrete implementations of BloomFilter provided:

BitSetBloomFilter - Stores the enabled bits in a BitSet object.
HasherBloomFilter - Uses the Hasher implementation as the storage for the bits. This is not efficient but often convenient if few Bloom filter operations are required.
CountingBloomFilter - Uses a sparse list ( HashMap ) of counts for the enabled bits. CountingBloomFilters are an extension to the normal Bloom filter and enable removal of Bloom filters from the set.

2.5 Why Hasher, Shape and BloomFilter?

Other libraries tend to merge the functionality of the Hasher , Shape , and BloomFilter into one or two classes. By separating the responsibility we have developed classes that focus on a single task. This permits a wider range of uses for the classes, for example:

It becomes possible to create client/server style applications where the client creates a Hasher , sends it across the wire, and the server then builds Bloom filters of various Shape depending on the requested functionality.
The library can support Bloom filters that represent the enabled bits in different ways in order to support specific functionality. For example some applications utilize Bloom filters that are very large (large \(m\) value) but have few bits enabled (low \(t\) and \(n\) values). When storing a large number of these it becomes important to have bit vector compression. However, for Bloom filters with a higher number of enabled bits the compression is just excess overhead. Applications can swap out the BloomFilter implementation without impacting the Hasher or Shape selections.
The library can support implementations of attenuated, spectral and other exotic Bloom filter varieties.

2.6 Build your own BloomFilter

The AbstractBloomFilter requires implementation of four methods:

getBits() - Get the enabled bits as an array of long values.
getHasher() - Creates a StaticHasher that returns the indexes of the enabled bits and contains the shame Shape as the Bloom filter.
merge(BloomFilter) - Merges enabled bits from the other Bloom filter into this Bloom filter.
merge(Hasher) - Enables the bits specified by the Hasher in this Bloom filter.

Other methods in AbstractBloomFilter should be overridden to take advantage of the bit encoding the new implementation provides.

3 Examples

3.1 How to use a Bloom filter as a gatekeeper

A ”gatekeeper” is a Bloom filter that determines if a longer running method should be executed. In this example we create a Bloom filter from a list of bad IDs. We then check a candidate ID against the Bloom filter. If the ID is in the Bloom filter we make the expensive call to a remote server to verify the ID. For purposes of illustration server.getBlackList() returns a list of IDs as Strings , and server.checkBlackList( String id ) verifies that the ID is on the black list.

/* setup environment */
// select the hash function
HashFunction hFunc = new Murmur128x86Cyclic();
// 10000 elements, 1/million probability of collision
Shape shape = new Shape( hFunc , 10000, 1.0/1000000);

/* build the gatekeeper */
BloomFilter gateKeeper = new BitSetBloomFilter( shape );
for ( String id : server.getBlackList() ) {
    DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build();
    gateKeeper.merge( hasher );
}

/* use the gatekeeper */
String id = // perhaps entered by user during login
DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build();
boolean inBlackList = gatekeeper.contains( hasher );
if ( inBlackList )
{
    /* it might be in the blacklist, so execute the server call to find out. */
    inBlackList = server.checkBlackList( id );
}
// inBlackList is now true if the ID is in the black list, false otherwise.

3.2 Data sharding

In this example we create a Bloom filter for each data item we are going to save. When the data is inserted in the storage each gatekeeper Bloom filter is checked to see which is ”closest” to the filter being inserted and the object is inserted in the associated container. When searching for data each gatekeeper Bloom filter is checked for the presence of the data Bloom filter if it is in the gatekeeper then the associated container is searched for the object.

/* setup environment */

// Set some limits
int MAX_SHARD_SIZE = 100000;

// select the hash function
HashFunction hFunc = new Murmur128x86Cyclic();

// 100000 elements, 1/million probability of collision
Shape shape = new Shape( hFunc , MAX_SHARD_SIZE , 1.0/1000000);

// create the sharding container
Map < CountingBloomFilter , Map < String , T >> shards = new HashMap <>();

// populate the minimum shards (5)
for ( int i =0; i <5; i ++) {
    shards.put( new CountingBloomFilter( shape ), new HashMap < String , T >>() );
}


/* insert a data object */

T data = // get the data object.
DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( data.getId() ).build();
BloomFilter dataBloom = new BitSetBloomFilter( hasher , shape );
int distance = Integer.MAX_VALUE ;
BloomFilter collectionFilter = null
for ( BloomFilter candidate : shards.keySet() ) {
    if ( SetOperations.hammingDistance( candidate , dataBloom ) < distance ) {
        distance = SetOperations.hammingDistance( candidate , dataBloom );
        collectionFilter = candidate ;
    }
}
if ( collectionFilter == null ) {
    // no available collection so create one.
    collectionFilter = new CountingBloomFilter( shape );
    shards.put( collectionFilter , new HashMap < String , T >() );
}
Map < String , T > collection = shards.get( collectionFilter );
if ( collection.size() >= MAX_SHARD_SIZE ) {
    // no space in collection so create a new one.
    collectionFilter = new CountingBloomFilter( shape );
    collection = new HashMap < String , T >();
    shards.put( collectionFilter , collection );
}
collectionFilter.merge( dataBloom );
collections.put( data.getId(), data );


/* search for a data object */

T result = null ;
String id = // get the data ID.
DynamicHasher hasher = new DynamicHasher.Builder( hFunc ).with( id ).build();
BloomFilter dataBloom = new BitSetBloomFilter( hasher , shape );
for ( Map.Entry < BloomFilter , Collection < T >> entry : shards ) {
    if ( shard.key().contains( dataBloom )) {
        T candidate = shard.value().get( id );
        if ( candidate != null ) {
            result = candidate ;
        }
    }
}
// result is either null or contains the desired value.

3.3 Future directions

The separation and definition of the HashFunctionIdentity , HashFunction , and Hasher classes as well as the development of the Cyclic hash function, means that it is possible to construct hashers that can be sent across the wire. This opens up the possibility of sending query requests without exposing the values or properties being queried and may provide the ability to query encrypted data. [Bellovin, Goh, Tajima, Warren1]

Implementations of multidimensional Bloom filters [Crainiceanu] (AKA Bloom filter indexes) may also be enhanced as the Hasher is not bound to a shape. So multidimensional filters can us a Bloom filter of a different Shape from the ones being stored to reduce the number of filters it actually checks for. [Warren2]

Implementations of attenuated, spectral and other exotic Bloom filter varieties is possible using the commons-collection library.

4 References

[Bellovin] Steven M Bellovin and William R Cheswick”. Privacy-Enchanced Searched Using Encrypted Bloom Filters”. 2004. url: https://mice.cs.columbia.edu/getTechreport.php?techreportID=483 (Accessed 26 Jan 2020).

[Bloom] Burton H. Bloom. “Space/Time Trade-offs in Hash Coding with Allowable Errors”. In: Communications of the ACM 13.7 (July 1970), pp. 422–426.

[Crainiceanu] Adina Crainiceanu and Daniel Lemire. Bloofi: Multidimensional Bloom Filters. Accessed on 11-Jan-2020. 2016. url: https://arxiv.org/pdf/1501.01941.pdf (Accessed 26 Jan 2020).

[Goh] Eu-Jin Goh. Secure Indexes. 2004. url: https://crypto.stanford.edu/~eujin/papers/secureindex/secureindex.pdf (Accessed 26 Jan 2020).

[Hurst] Thomas Hurst. Bloom filter calculator. 2019. url: https://hur.st/bloomfilter/ (Accessed 26 Jan 2020).

[Kirsch] Adam Kirsch and Michael Mitzenmacher. Building a Better Bloom Filter. 2008. url: https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf (Accessed 26 Jan 2020).

[Mitzenmacher] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge, Cambridgeshire, UK: Cambridge University Press, 2005, pp. 109–111, 308. isbn: 9780521835404.

[Tajima] Arisa Tajima, Hiroki Sato, and Hayato Yamana. Privacy-Preserving Join Processing over outsourced private datasets with Fully Homomorphic En- cryption and Bloom Filters. 2018. url: https://db-event.jpn.org/deim2018/data/papers/201.pdf (Accessed 26 Jan 2020).

[Warren1] Claude N. Warren Jr. Bloom Encrypted Index. 2019. url: https://github.com/Claudenw/BloomEncryptedIndex (Accessed 26 Jan 2020).

[Warren2] Claude N. Warren Jr. Multidimentional Bloom Filter Implementations. 2019. url: https://github.com/Claudenw/MultidimentionalBloom (Accessed 26 Jan 2020).

Contents