Bloom Filters Part 1: An IntroductionBloom filters are the magical elixir often used to reduce search space and time. They have other interesting properties that make them applicable in many situations where knowledge of the approximate size of a set, union, or intersection is important, or where searching vast datasets for small matching patterns is desired, and even in cases where it is desirable to search for data without disclosing the data being searched for, or the actual data found, to 3rd parties. In this series of blog posts we’ll do a deep dive into Bloom filters. In this, the first post, we will touch on the mathematics behind the filters and work through an example of their use, and explore some of their properties. In later posts we will explore the Apache Commons Collections® implementation that is due out in version 4.5 of that library, discuss using Bloom filters for data sharding, and explore some of the unusual Bloom filters, like counting Bloom filters, stable Bloom filters, and layered Bloom filters, before diving in to multidimensional Bloom filters and encrypted data indexing. Bloom filters are probably used on websites and applications you use every day. They are used to track articles you’ve read, speed up bitcoin clients, detect malicious web sites, and improve the performance of caches. We will get back to these later. What?But let us start at the beginning. Bloom filters are a probabilistic data structure frequently used to represent sets of objects. They were invented by Burton Bloom1 in 1970. A Bloom filter is an array of bits (bit vector) into which a set of values has been hashed, so some of the bit values are on (value one), or “enabled”, and others are off (value zero), or “disabled”. Multiple Bloom filters can be merged together by creating a union of the two bit vectors (V1 or V2). The resulting Bloom filter (B) is said to contain both items. The equation for combining can be expressed as: \( B = V1 \cup V2 \) When searching Bloom filters we generate a Bloom filter for the item we are looking for (the target, T), and then calculate the intersection of T with the bit vector of the Bloom filter that may contain the value (the candidate, C). If the result is equal to T, then a match has been made. This calculation can yield false positives but never false negatives. The equation for matching can be expressed as: \( T \cap C = T \) There are several properties that define a Bloom filter: the number of bits in the vector ( \[ p = \left( 1 - e^{-kn/m} \right) ^k \] However, it has been reported that the false positive rate in real deployments is higher than the value given by this equation.34 Theoretically it has been proven that the equation offered a lower bound of the false positive rate, and a more accurate false positive rate has been discovered.5 The net result is that we can describe a Bloom filter with a “shape” and that shape can be derived from combinations of the properties for example from To compare Bloom filters, they must have the same shape and use the same hashing functions. Why?Bloom filters are often used to reduce the search space. For example, consider an application looking for a file which may occur on one of many systems. Without a Bloom filter, each system must be queried for the existence of the file. Generally this is a lengthy process. However, if a Bloom filter is created for each system, then the query could first check the filter. If the filter indicates the file might be on the system, then the expensive lookup check is performed. Because the Bloom filter never returns a false negative, this strategy reduces the search space to only those systems that may contain the file. Examples of large Bloom filter collections can be found in bioinformatics678 where Bloom filters are used to represent gene sequences, and Bloom filter based databases where records are encoded into Bloom filters.910 Medium, the digital publishing company, uses Bloom filters to track what articles have been read.11 Bitcoin uses them to speed up clients.12 Bloom filters have been used to improve caching performance13 and in detecting malicious websites.14 How?So, let’s work through an example. Let's assume we want to put 3 items in a filter Now that we know the shape of our Bloom filters, let’s populate one. In this example we will be using a CRC hash; this is not recommended and is only used here for ease of example. Also, we will be using a naive combinatorial hashing technique that should not be used in real life. We start by taking the CRC hash for “CAT” which is
We need to generate 3 hash values (
This yields a Bloom filter of
The collection is the union of the other three values. This represents the set of animals. The interesting one in this collection is DOG. When we execute the naive combinatorial hashing algorithm on the DOG hash, every value is 2. We’ll come back to this in a moment. If we perform the same calculations on “HORSE”, we get \(\{2,5,9\}\). Now to see if HORSE is in our collection we solve \(\{2,5,9\} \cap \{0,2,5,6,7,10\} = \{2,5,9\} \longrightarrow \{2,5\} \ne \{2,5,9\}\) . So HORSE is not in the collection. If we only put CAT and GUINEA PIG into the collection, we get the same result for the collection. But when testing for DOG we get the true statement \(\{2\} \cap \{0,2,5,6,7,10\} = \{2\}\). The filter says that DOG is in the collection. This is an example of a false positive result. Dog also shows the weakness of the naive combinatorial hashing technique. A proper implementation can be found in the EnhancedDoubleHashing class in the Apache Commons Collections version 4.5 library. In this case, a tetrahedral number is added to the increment to reduce the probability of a single bit being selected over the course of the hash. Statistics?Bloom filters lend themselves to several statistics. The most common is the Hamming value or cardinality. This is simply the number of bits that are enabled in the vector. From this value a number of statistics can be calculated. The hamming distance is the number of bits that have to be flipped to convert one Bloom filter to another. For example to convert Cat \(\{0,5,6\}\) to horse \(\{2,5,9\}\), we have to turn off bits \(\{0,6\}\) and turn on bits \(\{2,9\}\) so the hamming distance is 4. Bloom filters with lower hamming distances are in some sense similar. Another measure of similarity is the Cosine similarity also known as Orchini similarity, Tucker coefficient of congruence or Ochiai similarity. To calculate it the cardinality of the intersection (bitwise ‘and’) of the two filters is then divided by the square root of the cardinality of the first filter times the cardinality of the second filter. The result is a number in the range \([0,1]\). The cosine distance is calculated as The final measure of similarity that we will cover is the Jaccard similarity also known as the Jaccard Index, Intersection over Union, and Jaccard similarity coefficient. To calculate the Jaccard index the cardinality of the intersection (bitwise ‘and’) of the two Bloom filters is calculated. This value is divided by the cardinality of the union (bitwise ‘or’) of the two Bloom filters. The Jaccard distance, like the cosine distance, is calculated as The similarity and distance statistics can be used to group similar Bloom filters together; for example when distributing files across a system that uses Bloom filters to determine where the file might be located. In this case it might make sense to store Bloom filters in the collection that has minimal distance. In addition to basic similarity and difference, if the shape of the filter is known some information about the data behind the filters can be estimated. For example the number of items merged into a filter ( \[ n = \frac{-m ln(1 - c/m)}{k} \] Estimating the size of the union of two filters is simply calculating Estimating the size of the intersection of two filters is the estimated Usage ErrorsThere are several places that errors can creep into Bloom filter usage. Saturation errorsSaturation errors arise from underestimating the number of items that will be placed into the filter. Let’s define “saturation” as the number of items merged into a filter divided by the number of items specified in the Shape. Then once For Bloom filters defined as
The table shows that the probability of false positives is two orders of magnitude larger when the saturation reaches two. After five times the estimated number of items have been added the false positive rate is so high as to make the filter useless. Hashing errorsA second focus of errors is the generation of the hashes. If a combinatorial hashing algorithm is used and the number of bits is significantly higher than the midpoint of the hash values then the generated values will be weighted to the lower bits. For example if byte values were used to set the increment and the initial values but the number of bits was in excess of 255, then the higher valued bits could not be selected on the first hash, and in all cases values far above 255 would be rarely selected. ReviewSo far we have covered what Bloom filters are, why we use them, how to construct and use them, explored a few statistics, and looked at potential problems arising from usage errors. In the next post we will see how the Apache Commons Collections code implements Bloom filters and look at how to implement the common usage patterns using the library. Footnotes
|