[Seqan-dev] Wildcard characters in the haystack?
- From: Johannes Junker <dr.kugelmehl@googlemail.com>
- To: seqan-dev@lists.fu-berlin.de
- Date: Wed, 20 Oct 2010 09:39:26 +0200
- Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=mhvejgXfYjt1LCgfIECei2w2WDgYFMuK+gH5Eqw/7wTmoCNdvvDnyKUNQzTc0rzMWE 8svr/qfongAHD51N6PNoe0rMR8D4qjYBTQub2c7+/ZNiNA0qFBBA8YM89wiAWFzLc4p2 1Zu2C3NwZ5p/Nj588Qx/vn0ym6JvfNnJiP1/8=
- Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
- Subject: [Seqan-dev] Wildcard characters in the haystack?
Hi,
I was just wondering if it is possible in seqan to use wildcard
characters within the haystack. As far as I understood from the
documentation, a wildcard search is only possible for a needle
containing wildcard characters against some haystack. However, in the
case below, the haystack all_protein_sequences may contain ambiguous
characters (e.g. an X should match all possible amino acid letters in
the needle, a J should match only I and L, and so on), whereas the
needles themselves do not contain any ambiguous characters. In the
current implementation, the protein sequences containing these
wildcard characters are not matched with their corresponding needles.
Is there some clever way to do this?
157 seqan::Finder<seqan::String<char> > finder(all_protein_sequences);
158 seqan::Pattern<seqan::StringSet<seqan::String<char> >,
seqan::AhoCorasick > pattern(needle);
159
160 seqan::String<seqan::Pair<Size, Size> > pat_hits;
161 Map<Size, vector<Size> > peptide_to_indices;
162 writeDebug_("Finding peptide/protein matches...", 1);
163 while (find(finder, pattern))
164 {
165 seqan::appendValue(pat_hits, seqan::Pair<Size,
Size>(position(pattern), position(finder)));
166 peptide_to_indices[position(pattern)].push_back(position(finder));
167 }
Thanks in advance!
Best,
Johannes