Understanding MinHash in Lucene / Elasticsearch

Xing Zeng
5 min readMar 6, 2020

--

MinHash is a quite common technique in Information Retrieval. Just in case if you had never heard of it, here is the explanation of how it worked on Elasticsearch’s reference page:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html

A general idea of the way MinHash produces a signature for a document is by applying a random permutation over the whole index vocabulary (random numbering for the vocabulary), and recording the minimum value for this permutation for the document (the minimum number for a vocabulary word that is present in the document). The permutations are run several times; combining the minimum values for all of them will constitute a signature for the document.

In practice, instead of random permutations, a number of hash functions are chosen. A hash function calculates a hash code for each of a document’s tokens and chooses the minimum hash code among them. The minimum hash codes from all hash functions are combined to form a signature for the document.

That explanation, however, does not match with settings described by Elasticsearch in order to configure MinHash correctly:

hash_count :

The number of hashes to hash the token stream with. Defaults to 1.

bucket_count :

The number of buckets to divide the minhashes into. Defaults to 512.

hash_set_size :

The number of minhashes to keep per bucket. Defaults to 1.

Note that the hash count defaults to be 1, which does not match with the explanation of “a number of hash functions are chosen”. While at the same time, the number of buckets is still significantly big to result in a signature of higher dimensionality.

A deeper look into Lucene’s MinHashFilter.java code explains how does it actually work.

The MinHashSet are created from the following lines:

this.hashCount = hashCount;
this.bucketCount = bucketCount;
this.hashSetSize = hashSetSize;
this.withRotation = withRotation;
this.bucketSize = (1L << 32) / bucketCount;
if((1L << 32) % bucketCount != 0)
{
bucketSize++;
}
minHashSets = new ArrayList<>(this.hashCount);
for (int i = 0; i < this.hashCount; i++) {
ArrayList<FixedSizeTreeSet<LongPair>> buckets = new ArrayList<>(this.bucketCount);
minHashSets.add(buckets);
for (int j = 0; j < this.bucketCount; j++) {
FixedSizeTreeSet<LongPair> minSet = new FixedSizeTreeSet<>(this.hashSetSize);
buckets.add(minSet);
}
}

Main points here:

  1. The bucketCount is the same as what ES has claimed, it refers to the size of the signature where each signature is a place where some sort of “recoding the minimal value” will happens
  2. The hashSetSize is an extra functionality ES supported so that on one bucket, you can store more than 1 minimal value.

This explains why if you increase them, you would be more likely to get higher precision, as they can be seen as increasing the dimension of the minHashSet variable, which in turns reduce the likelihood of dissimilar document to collide on a smaller dimensioned space.

The way how hashCount work and how it was able to deal with 512 buckets with one hash function is located at where the values of minHashSetis being actually created:

while (input.incrementToken()) {
found = true;
String current = new String(termAttribute.buffer(), 0, termAttribute.length());

for (int i = 0; i < hashCount; i++) {
byte[] bytes = current.getBytes("UTF-16LE");
LongPair hash = new LongPair();
murmurhash3_x64_128(bytes, 0, bytes.length, 0, hash);
LongPair rehashed = combineOrdered(hash, getIntHash(i));
minHashSets.get(i).get((int) ((rehashed.val2 >>> 32) / bucketSize)).add(rehashed);
}
endOffset = offsetAttribute.endOffset();
}

Now Recall that, if we ignore the remainder which are minor and does not appears for the default parameter:

bucketSize = (1L << 32) / bucketCount

which implies

(rehashed.val2 >>> 32) / bucketSize
= (rehashed.val2 >>> 32) / ((1L << 32) / bucketCount)
= bucketCount * (the upper 32 digits of rehashed.val2 / 2 ** 32)

This means that, for a particular hash function, it will at most only place you into one of the buckets, based on what the hash value is. This resulted in the two following:

  1. One token will affect at most one bucket if ignoring rotation.
  2. Assuming Simple Uniform Hashing Principles and a long enough document, for every bucket, there exists a token in the document that will fill this particular bucket.

It should be easy to see that such an approach will still work on finding similar documents, as dissimilar documents are still going to be more likely to have different values in the bucket resulted from different hashing values.

Another way of thinking about how this unorthodox way of doing MinHash would work is to think from how would you deduce this algorithm from the regular MinHash algorithm. Basically, you can assume for a particular token, it will never become the minimal element of all other buckets except one. This suggests that at least for a particular carefully created group of documents, whatever the signature one may get from Lucene’s MinHash, is still going to have limited differences from what one may get from a more regular MinHash function, depending on how many of the token would actually result in a smaller value in some other buckets.

This also explains why if you increase hashCount, you will have a higher recall, despite the fact that you are also increasing the dimension of the minHashSet. The point is that as the hash function never interferes with each other, so whatever similar document that is unfortunately not being mapped to enough similar values by the first hash function may get recovered by the second hash function.

As for why Lucene used this type of rather an uncommon strategy to deal with LSH, I am not really sure. One benefit I do found is that having only one hash function is much faster than maintaining multiple hash functions. Anyone who is willing to shed some light on it is very welcome.

Updates on December 17, 2020:

I’ve been reading a book, Mining Massive Datasets, recently, just to learn more about some of the standardized processes in Data Mining is as I have never got a chance to formally take such a course in School. Apparently, Lucene’s approach is very similar to a recommended speed up method mentioned in Section 3.3.7 of this book.

The book quoted:

…we divide the rows of M into k/m groups. Then for each hash function, we compute one minhash value by examining only the first m rows of M, a different minhash value by examining only the second m rows, and so on. We thus get k/m minhash values from a single hash function and a single pass over all the rows of M. In fact, if k/m is large enough, we may get all the rows of the signature matrix that we need by a single hash function applied to each of the subsets of rows of M.

Where here:

  • M is, originally in the book, a matrix with each row representing all possible elements in a Set (i.e. possible shingle of tokens), and column representing a Set that has contained some of the elements (i.e. documents)
  • k is the number of all possible elements
  • m is the size of “group” where each “group” is only allowed with some members of k but not the other
  • Thus, there are k/m group that doesn’t touch each other, i.e. “one token will affect at most one bucket if ignoring rotation”

Combining the above and using only one hash function you get the default minhash algorithm in Lucene.

--

--

Xing Zeng
Xing Zeng

Written by Xing Zeng

Study and Work in NLP & Big Data

Responses (1)