Re: [Seqan-dev] Wildcard characters in the haystack?


To cite David:

> the only search that supports wildcards is Shift-And for exact pattern matching or Myers for approximate matching. Both are single pattern searches. To have a multi-pattern Aho-Corasick with wildcards, all bases have to be enumerated at X-positions which would blow the string trie up. To resolve this, identical paths could be merged, sounds like a BSc. thesis.
> 
> David

Am 20.10.2010 um 09:39 schrieb Johannes Junker:

> Hi,
> 
> I was just wondering if it is possible in seqan to use wildcard
> characters within the haystack. As far as I understood from the
> documentation, a wildcard search is only possible for a needle
> containing wildcard characters against some haystack. However, in the
> case below, the haystack all_protein_sequences may contain ambiguous
> characters (e.g. an X should match all possible amino acid letters in
> the needle, a J should match only I and L, and so on), whereas the
> needles themselves do not contain any ambiguous characters. In the
> current implementation, the protein sequences containing these
> wildcard characters are not matched with their corresponding needles.
> Is there some clever way to do this?
> 
> 157   seqan::Finder<seqan::String<char> > finder(all_protein_sequences);
> 158   seqan::Pattern<seqan::StringSet<seqan::String<char> >,
> seqan::AhoCorasick > pattern(needle);
> 159 	
> 160 	seqan::String<seqan::Pair<Size, Size> > pat_hits;
> 161 	Map<Size, vector<Size> > peptide_to_indices;
> 162 	writeDebug_("Finding peptide/protein matches...", 1);
> 163 	while (find(finder, pattern))
> 164 	{
> 165 	    seqan::appendValue(pat_hits, seqan::Pair<Size,
> Size>(position(pattern), position(finder)));
> 166 	    peptide_to_indices[position(pattern)].push_back(position(finder));
> 167 	}
> 
> Thanks in advance!
> 
> Best,
> Johannes
> 
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev