Class EnhancedDoubleHasher

java.lang.Object
org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher
All Implemented Interfaces:
Hasher

public class EnhancedDoubleHasher extends Object implements Hasher
A Hasher that implements combinatorial hashing as described by Krisch and Mitzenmacher using the enhanced double hashing technique described in the wikipedia article Double Hashing.

Common use for this hasher is to generate bit indices from a byte array output of a hashing or MessageDigest algorithm.

Thoughts on the hasher input

Note that it is worse to create smaller numbers for the initial and increment. If the initial is smaller than the number of bits in a filter then hashing will start at the same point when the size increases; likewise the increment will be the same if it remains smaller than the number of bits in the filter and so the first few indices will be the same if the number of bits changes (but is still larger than the increment). In a worse case scenario with small initial and increment for all items, hashing may not create indices that fill the full region within a much larger filter. Imagine hashers created with initial and increment values less than 255 with a filter size of 30000 and number of hash functions as 5. Ignoring the tetrahedral addition (a maximum of 20 for k=5) the max index is 255 * 4 + 255 = 1275, this covers 4.25% of the filter. This also ignores the negative wrapping but the behavior is the same, some bits cannot be reached.

So this needs to be avoided as the filter probability assumptions will be void. If the initial and increment are larger than the number of bits then the modulus will create a 'random' position and increment within the size.

Since:
4.5
  • Constructor Details

    • EnhancedDoubleHasher

      public EnhancedDoubleHasher(byte[] buffer)
      Constructs the EnhancedDoubleHasher from a byte array.

      This method simplifies the conversion from a Digest or hasher algorithm output to the two values used by the EnhancedDoubleHasher.

      The byte array is split in 2 and the first 8 bytes of each half are interpreted as a big-endian long value. Excess bytes are ignored. If there are fewer than 16 bytes the following conversions are made.

      1. If there is an odd number of bytes the excess byte is assigned to the increment value
      2. The bytes alloted are read in big-endian order any byte not populated is set to zero.

      This ensures that small arrays generate the largest possible increment and initial values.

      Parameters:
      buffer - the buffer to extract the longs from.
      Throws:
      IllegalArgumentException - is buffer length is zero.
    • EnhancedDoubleHasher

      public EnhancedDoubleHasher(long initial, long increment)
      Constructs the EnhancedDoubleHasher from 2 longs. The long values will be interpreted as unsigned values.
      Parameters:
      initial - The initial value for the hasher.
      increment - The value to increment the hash by on each iteration.
  • Method Details

    • indices

      public IndexProducer indices(Shape shape)
      Description copied from interface: Hasher
      Creates an IndexProducer for this hasher based on the Shape.

      The IndexProducer will create indices within the range defined by the number of bits in the shape. The total number of indices will respect the number of hash functions per item defined by the shape. However the count of indices may not be a multiple of the number of hash functions if the implementation has removed duplicates.

      This IndexProducer must be deterministic in that it must return the same indices for the same Shape.

      No guarantee is made as to order of indices.

      Duplicates indices for a single item may be produced.

      Specified by:
      indices in interface Hasher
      Parameters:
      shape - the shape of the desired Bloom filter.
      Returns:
      the iterator of integers