Re: [Seqan-dev] Wildcard characters in the haystack?
To cite David:
> the only search that supports wildcards is Shift-And for exact pattern matching or Myers for approximate matching. Both are single pattern searches. To have a multi-pattern Aho-Corasick with wildcards, all bases have to be enumerated at X-positions which would blow the string trie up. To resolve this, identical paths could be merged, sounds like a BSc. thesis.
>
> David
Am 20.10.2010 um 09:39 schrieb Johannes Junker:
> Hi,
>
> I was just wondering if it is possible in seqan to use wildcard
> characters within the haystack. As far as I understood from the
> documentation, a wildcard search is only possible for a needle
> containing wildcard characters against some haystack. However, in the
> case below, the haystack all_protein_sequences may contain ambiguous
> characters (e.g. an X should match all possible amino acid letters in
> the needle, a J should match only I and L, and so on), whereas the
> needles themselves do not contain any ambiguous characters. In the
> current implementation, the protein sequences containing these
> wildcard characters are not matched with their corresponding needles.
> Is there some clever way to do this?
>
> 157 seqan::Finder<seqan::String<char> > finder(all_protein_sequences);
> 158 seqan::Pattern<seqan::StringSet<seqan::String<char> >,
> seqan::AhoCorasick > pattern(needle);
> 159
> 160 seqan::String<seqan::Pair<Size, Size> > pat_hits;
> 161 Map<Size, vector<Size> > peptide_to_indices;
> 162 writeDebug_("Finding peptide/protein matches...", 1);
> 163 while (find(finder, pattern))
> 164 {
> 165 seqan::appendValue(pat_hits, seqan::Pair<Size,
> Size>(position(pattern), position(finder)));
> 166 peptide_to_indices[position(pattern)].push_back(position(finder));
> 167 }
>
> Thanks in advance!
>
> Best,
> Johannes
>
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev