Re: [Seqan-dev] Approximate Searching in set of strings

Tilo Eißler <eissler@in.tum.de> · Thu, 27 May 2010 10:21:20 +0200

Hi Manuel,

thank you very much for your fast reply.

I'm looking for a replacement for a proprietary search index (based on
suffix tries) which allows for approximate searching. Thus at the moment
I don't know how the approximate searching is implemented exactly and
therefore about its complexity.

I'm not to deep into search index theory so I've asked to not miss
something. Your answer confirms what I expected.
I will toy around a little bit more taking your proposals into account
and get back if I've got further questions.

Thanks again and best wishes,

Tilo

Am 26.05.2010 17:50, schrieb Holtgrewe, Manuel:
> Hi Tilo,
> 
> as far as I know, there is no way to do non-heuristic approximate string searching using indices, i.e. something like O(1) lookups. What exactly are you looking for and trying to solve?
> 
> 
> SeqAn has implementations of various online algorithms for approximate string searching, e.g. see [1]. Notably, Myer's Bitvector algorithm is very fast for searching single strings with a length up to the machine word width in large strings. Going beyond this limit makes things a bit slower since bits have to migrate over machine words.
> 
> Pex, AbndmAlgo and DPSearch also allow for edit distance, DPSearch even allows linear gap and arbitrary substitution scores.
> 
> 
> If you are looking for read mapping, i.e. many short reads to be found with Hamming/edit distance in long strings, have a look at RazerS, our read mapping tool. It uses the SWIFT algorithm to first find regions that overlap with epsilon-matches and then uses Myer's Bitvector algorithm or a naive Hamming search to look for the exact positions.
> 
> 
> Another possible approach is building a q-gram index of your long string (haystack) and then searching for the q-grams of your short string (needle) in the haystack. You can look at the LAGAN demo to see an application of q-gram indices and seeds in SeqAn. FASTA, BLAST, bowtie and bwa are codes that do kinds of "approximate string searching" and could be implemented with the infrastructure of the SeqAn library.
> 
> 
> People around here use Visual Studio, Emacs, XCode and occasionally vim. I have never seen Visual Studio really make sense out of SeqAn style C++, XCode seems to infer symbol names that make sense (which Visual Studio should be able, too). I have not tried to get syntax completion running in either Emacs or XCode.
> 
> HTH,
> Manuel
> 
> [1] http://www.seqan.de/dddoc/html_devel/CLASS_Pattern.html
> 
> 
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev

-- 

Dipl.-Inf. Tilo Eißler
Technische Universität München   |tel.   +49-89-289-19478
Institut für Informatik, I10     |
Boltzmannstr. 3                  |Office 02.13.059
85748 Garching b. München        |email  eissler@in.tum.de