Universal hitting sets and minimizers
Handling the flood of deep DNA sequencing data calls for better algorithms. Minimizers are a central recent paradigm that has improved many sequence analysis tasks. We present an alternative paradigm that leads to substantial further improvement in such tasks. For integers k and L>k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We present a heuristic called DOCKS to find a compact UHS.
We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed offers substantial saving compared to minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers.
Joint work with Yaron Orenstein (BGU), David Pellow (TAU), Guillaume Marcais, Daniel Bork and Carl Kingsford (CMU)