Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

20/06/2019 - 09:25

Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread applications.  The goal is to preprocess a database such that, given a query text, we can quickly find the database entry that is most similar to the query.  This similar text may differ by character replacements, insertions or deletions---in other words, we want the most similar item (or at least an approximately-most-similar item) under edit distance.  This problem has widespread uses in text processing and computational biology.


Previous approaches to this problem fell into three categories.  Brute-force approaches search through the whole database, using very little preprocessing at the cost of large query time.  Prefix-tree approaches can be much faster, but they have very large exponential terms, so they are only efficient when the desired text is very similar (has a small edit distance).  Embedding approaches can overcome these large search costs regardless of the similarity of the desired text, but embeddings can only be used when the approximation factor is very large (in fact superconstant).


I will be presenting a new method that overcomes these issues: a locality-sensitive hashing approach.  This method uses a simple hash function, under which similar texts are more likely to hash together than dissimilar texts.   Using this hash function, we can answer queries using a hash table lookup, achieving good bounds for all approximation factors and moderate distances to the most similar database text.  I will also discuss some potential future directions of this approach, as well as some other recent work on related problems.