From eissler@in.tum.de Wed Oct 20 11:58:46 2010 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8VRV-0003gf-BT>; Wed, 20 Oct 2010 11:58:45 +0200 Received: from mail-out1.informatik.tu-muenchen.de ([131.159.0.8]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8VRV-0006On-6A>; Wed, 20 Oct 2010 11:58:45 +0200 Received: from [131.159.35.14] (unknown [131.159.35.14]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.in.tum.de (Postfix) with ESMTP id 84B26CB31 for ; Wed, 20 Oct 2010 11:58:44 +0200 (CEST) Message-ID: <4CBEBD53.5030805@in.tum.de> Date: Wed, 20 Oct 2010 11:58:43 +0200 From: =?ISO-8859-15?Q?Tilo_Ei=DFler?= User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; de-DE; rv:1.9.1.12) Gecko/20100915 Lightning/1.0b1 Thunderbird/3.0.8 MIME-Version: 1.0 To: Seqan developer list Content-Type: multipart/mixed; boundary="------------030702040008000906020607" X-Virus-Scanned: ClamAV using ClamSMTP X-Originating-IP: 131.159.0.8 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287568725-00000C0F-0986B373/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.001580, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: [Seqan-dev] SearchIndex based on external StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Oct 2010 09:58:46 -0000 This is a multi-part message in MIME format. --------------030702040008000906020607 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Hello, as the subject denotes I'm currently trying to build a SearchIndex based on an external StringSet using the actual trunk version of seqan. I started with the "Index Finder StringSet" example program. Then I've altered the String type to the external specialisation which results in the following term for the StringSet: StringSet < String, External<> > > mySet; Afterwards I've resized the set and appended values to each entry of the set using the append function. Building the index and a finder at least compiles as well, but trying to search the finder using a pattern does not work, or more precisely, the compilation fails. I've attached my sourcecode. I'm getting confused with the compiler error message, so I'm asking if someone is kind and takes a look at it to help me :-) Is there a major flaw in my thought process? Or is it possible to do what I'm trying? Another (related) topic: To my knowledge the default index type is the enhanced suffix array. I've read that the build process can use external memory. Is this done by default or do I need to provide an extra specialisation to achieve this? The resulting suffix array lies in main memory, right? Thank you very much for your help and best regards, Tilo --------------030702040008000906020607 Content-Type: text/x-c++src; name="seqan_ext_stringset_esa.cpp" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="seqan_ext_stringset_esa.cpp" #include #include using namespace seqan; int main() { StringSet < String, External<> > > mySet; resize(mySet, 3); append(mySet[0], "aggtttccggNNNtagcgcttaa"); append(mySet[1], "attgctgcNaatgtgctaa"); append(mySet[2], "atggctgcNaccatgtgctatta"); Index< StringSet < String, External<> > > > myIndex(mySet); Finder< Index< StringSet < String, External<> > > > > myFinder(myIndex); Pattern< String, External<> > > pat = "agg"; // or can I simply use the following as usual? // Pattern< String > pat = "agg"; std::cout << "hit at "; while (find(myFinder, pat)) std::cout << position(myFinder) << " "; std::cout << ::std::endl; return 0; } --------------030702040008000906020607-- From manuel.holtgrewe@fu-berlin.de Wed Oct 20 12:48:43 2010 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8WDq-0005mr-Ez>; Wed, 20 Oct 2010 12:48:42 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8WDq-00048p-Ca>; Wed, 20 Oct 2010 12:48:42 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8WDq-0004US-Az>; Wed, 20 Oct 2010 12:48:42 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Wed, 20 Oct 2010 12:48:41 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Wed, 20 Oct 2010 12:48:41 +0200 Thread-Topic: [Seqan-dev] SearchIndex based on external StringSet Thread-Index: ActwRFwwdxb5LIZxQKS0RyUydkFlGQ== Message-ID: <942A07C3-EAFC-41D8-B5AA-260B46BAAFAB@fu-berlin.de> References: <4CBEBD53.5030805@in.tum.de> In-Reply-To: <4CBEBD53.5030805@in.tum.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: multipart/mixed; boundary="_002_942A07C3EAFC41D8B5AA260B46BAAFABfuberlinde_" MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287571722-00000C0F-5B8F9463/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] SearchIndex based on external StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Oct 2010 10:48:43 -0000 --_002_942A07C3EAFC41D8B5AA260B46BAAFABfuberlinde_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Am 20.10.2010 um 11:58 schrieb Tilo Ei=DFler: > Hello, >=20 > as the subject denotes I'm currently trying to build a SearchIndex based > on an external StringSet using the actual trunk version of seqan. >=20 > I started with the "Index Finder StringSet" example program. Then I've > altered the String type to the external specialisation which results in > the following term for the StringSet: >=20 > StringSet < String, External<> > > mySet; >=20 > Afterwards I've resized the set and appended values to each entry of the > set using the append function. >=20 > Building the index and a finder at least compiles as well, but trying to > search the finder using a pattern does not work, or more precisely, the > compilation fails. I've attached my sourcecode. > I'm getting confused with the compiler error message, so I'm asking if > someone is kind and takes a look at it to help me :-) > Is there a major flaw in my thought process? Or is it possible to do > what I'm trying? 1) The type in the Pattern is the String you want to search. It can differ = from the type of the string you want to build an index on. The specializati= ons of the string classes are completely independent, the alphabets have to= be compatible (for certain definitions of "compatible" ;). So, first of all rather use: Pattern< String > pat =3D "agg"; 2) To the Strings, StringSet: I think the specializations were not what you= wanted. External Strings are used like this: String > I.e. String > for an external string and not String<***Str= ing***, External<> > (the *** are there for marking the wrong part). Typedefs often help to make the code more readable. The attached program is= more readable and should probably do what you want.=20 > Another (related) topic: >=20 > To my knowledge the default index type is the enhanced suffix array. > I've read that the build process can use external memory. Is this done > by default or do I need to provide an extra specialisation to achieve thi= s? > The resulting suffix array lies in main memory, right? I think this depends on the algorithm you use for building the SA. If I rem= ember correctly, at least the Skew algorithms have external but no internal= implementation, however this should not matter greatly since for small ind= ices, the kernel will not write buffers to the disk anyway. I guess it is not documented yet since I could not find it in the documenta= tion. index_base.h has the following: // suffix array construction specs struct Skew3; struct Skew7; struct LarssonSadakane; struct ManberMyers; struct SAQSort; struct QGram_Alg; // lcp table construction algorithms struct Kasai; struct KasaiOriginal; // original, but more space-consuming algorithm // enhanced suffix array construction algorithms struct ChildTab; struct BWT; David would be the right one to explain this in more detail and eventually = document this. *m --_002_942A07C3EAFC41D8B5AA260B46BAAFABfuberlinde_ Content-Type: application/octet-stream; name="seqan_ext_stringset_esa.cpp" Content-Description: seqan_ext_stringset_esa.cpp Content-Disposition: attachment; filename="seqan_ext_stringset_esa.cpp"; size=818; creation-date="Wed, 20 Oct 2010 12:48:41 GMT"; modification-date="Wed, 20 Oct 2010 12:48:41 GMT" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPGlvc3RyZWFtPg0KDQojaW5jbHVkZSA8c2VxYW4vaW5kZXguaD4NCg0KdXNpbmcg bmFtZXNwYWNlIHNlcWFuOw0KDQppbnQgbWFpbigpIHsNCgl0eXBlZGVmIFN0cmluZzxjaGFyLCBF eHRlcm5hbDw+ID4gVEV4dGVybmFsQ2hhclN0cmluZzsNCgl0eXBlZGVmIFN0cmluZ1NldDxURXh0 ZXJuYWxDaGFyU3RyaW5nPiBUU3RyaW5nU2V0Ow0KCXR5cGVkZWYgSW5kZXg8VFN0cmluZ1NldD4g VEluZGV4Ow0KICAgIA0KICAgIFRTdHJpbmdTZXQgbXlTZXQ7DQogICAgcmVzaXplKG15U2V0LCAz KTsNCiAgICBhcHBlbmQobXlTZXRbMF0sICJhZ2d0dHRjY2dnTk5OdGFnY2djdHRhYSIpOw0KICAg IGFwcGVuZChteVNldFsxXSwgImF0dGdjdGdjTmFhdGd0Z2N0YWEiKTsNCiAgICBhcHBlbmQobXlT ZXRbMl0sICJhdGdnY3RnY05hY2NhdGd0Z2N0YXR0YSIpOw0KICAgIA0KICAgIFRJbmRleCBteUlu ZGV4KG15U2V0KTsgDQogICAgRmluZGVyPFRJbmRleD4gbXlGaW5kZXIobXlJbmRleCk7DQoJDQog ICAgUGF0dGVybjxDaGFyU3RyaW5nPiBwYXQgPSAiYWdnIjsNCiAgICAvLyBvciBjYW4gSSBzaW1w bHkgdXNlIHRoZSBmb2xsb3dpbmcgYXMgdXN1YWw/DQogICAgLy8gUGF0dGVybjwgU3RyaW5nPGNo YXI+ID4gcGF0ID0gImFnZyI7DQoJDQogICAgc3RkOjpjb3V0IDw8ICJoaXQgYXQgIjsNCiAgICB3 aGlsZSAoZmluZChteUZpbmRlciwgcGF0KSkgDQogICAgICAgIHN0ZDo6Y291dCA8PCBwb3NpdGlv bihteUZpbmRlcikgPDwgIiAgIjsNCiAgICBzdGQ6OmNvdXQgPDwgOjpzdGQ6OmVuZGw7DQoJDQoJ DQogICAgcmV0dXJuIDA7DQp9DQo= --_002_942A07C3EAFC41D8B5AA260B46BAAFABfuberlinde_-- From eissler@in.tum.de Wed Oct 20 14:17:47 2010 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8Xc2-0000zG-MU>; Wed, 20 Oct 2010 14:17:46 +0200 Received: from mail-out1.informatik.tu-muenchen.de ([131.159.0.8]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8Xc2-0007q0-GZ>; Wed, 20 Oct 2010 14:17:46 +0200 Received: from [131.159.35.14] (unknown [131.159.35.14]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.in.tum.de (Postfix) with ESMTP id CB05D36539 for ; Wed, 20 Oct 2010 14:17:45 +0200 (CEST) Message-ID: <4CBEDDE6.60402@in.tum.de> Date: Wed, 20 Oct 2010 14:17:42 +0200 From: =?ISO-8859-1?Q?Tilo_Ei=DFler?= User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; de-DE; rv:1.9.1.12) Gecko/20100915 Lightning/1.0b1 Thunderbird/3.0.8 MIME-Version: 1.0 To: SeqAn Development References: <4CBEBD53.5030805@in.tum.de> <942A07C3-EAFC-41D8-B5AA-260B46BAAFAB@fu-berlin.de> In-Reply-To: <942A07C3-EAFC-41D8-B5AA-260B46BAAFAB@fu-berlin.de> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-Originating-IP: 131.159.0.8 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287577066-00000C0F-32399047/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000145, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] SearchIndex based on external StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Oct 2010 12:17:47 -0000 Hello again, > > 1) The type in the Pattern is the String you want to search. It can differ from the type of the string you want to build an index on. The specializations of the string classes are completely independent, the alphabets have to be compatible (for certain definitions of "compatible" ;). > > So, first of all rather use: > > Pattern< String > pat = "agg"; Ok, my first Pattern variant looks odd anyways ;-) > > 2) To the Strings, StringSet: I think the specializations were not what you wanted. > > External Strings are used like this: > > String > > > I.e. String > for an external string and not String<***String***, External<> > (the *** are there for marking the wrong part). > > Typedefs often help to make the code more readable. The attached program is more readable and should probably do what you want. hmm, I don't remember how I came to my external string. Now I've got it, thank you very much for the corrected version. > >> Another (related) topic: >> >> To my knowledge the default index type is the enhanced suffix array. >> I've read that the build process can use external memory. Is this done >> by default or do I need to provide an extra specialisation to achieve this? >> The resulting suffix array lies in main memory, right? > > I think this depends on the algorithm you use for building the SA. If I remember correctly, at least the Skew algorithms have external but no internal implementation, however this should not matter greatly since for small indices, the kernel will not write buffers to the disk anyway. Right, for small inidces it doesn't matter, but it may be of interest for large sequences/sets of sequences or on computers with small amounts of main memory. So it is of interest if building an application capable of handling different amounts of input data. > > I guess it is not documented yet since I could not find it in the documentation. index_base.h has the following: > > // suffix array construction specs > struct Skew3; > struct Skew7; > struct LarssonSadakane; > struct ManberMyers; > struct SAQSort; > struct QGram_Alg; > > // lcp table construction algorithms > struct Kasai; > struct KasaiOriginal; // original, but more space-consuming algorithm > > // enhanced suffix array construction algorithms > struct ChildTab; > struct BWT; > > David would be the right one to explain this in more detail and eventually document this. > I came to my question because I've read in the dissertation about seqan that there are ESA-construction algorithms that use external storage but I haven't been able to find some hints in the documentation if it's done per default or not. I'm toying around a little bit more, but I appreciate any further hints as well :-) Thanks again and best regards Tilo From manuel.holtgrewe@fu-berlin.de Wed Oct 20 14:55:10 2010 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8YCD-0002SX-SD>; Wed, 20 Oct 2010 14:55:09 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8YCD-0003c9-Ol>; Wed, 20 Oct 2010 14:55:09 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8YCC-00024V-VI>; Wed, 20 Oct 2010 14:55:09 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Wed, 20 Oct 2010 14:55:08 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Wed, 20 Oct 2010 14:55:07 +0200 Thread-Topic: [Seqan-dev] SearchIndex based on external StringSet Thread-Index: ActwVgX5RIzybT/IS/6/wjVz3T1wGQ== Message-ID: <8B4875D9-021E-4CB7-A6FE-522C00008045@fu-berlin.de> References: <4CBEBD53.5030805@in.tum.de> <942A07C3-EAFC-41D8-B5AA-260B46BAAFAB@fu-berlin.de> <4CBEDDE6.60402@in.tum.de> In-Reply-To: <4CBEDDE6.60402@in.tum.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287579309-00000C0F-FD7F08B7/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.294253, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] SearchIndex based on external StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Oct 2010 12:55:11 -0000 I'm pretty sure the implementation is done externally by default. From dr.kugelmehl@googlemail.com Wed Oct 20 09:39:28 2010 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8TGh-000622-8Q>; Wed, 20 Oct 2010 09:39:27 +0200 Received: from mail-qy0-f175.google.com ([209.85.216.175]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P8TGh-0003B1-1Q>; Wed, 20 Oct 2010 09:39:27 +0200 Received: by qyk10 with SMTP id 10so1107047qyk.13 for ; Wed, 20 Oct 2010 00:39:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=c+y1UQbTc6fzrNVx2YcrKn886nPPykt75qMc+csMI34=; b=j2aGrJoGgTC+jWqCL6TCTYO4Coow6t4gDO5a77KostMxe55s16qyO/+OxlHb3I0BSb RJVkgd+5/1EzEuz9wQaB7hlYzxMrnDNvgIFfGkiNEnXE4kWWhJoHKtBIcz8u/ff+3/I6 Rsim6srh4JzBUBzVssNIYsUiChMZkXBY4CBBw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=mhvejgXfYjt1LCgfIECei2w2WDgYFMuK+gH5Eqw/7wTmoCNdvvDnyKUNQzTc0rzMWE 8svr/qfongAHD51N6PNoe0rMR8D4qjYBTQub2c7+/ZNiNA0qFBBA8YM89wiAWFzLc4p2 1Zu2C3NwZ5p/Nj588Qx/vn0ym6JvfNnJiP1/8= MIME-Version: 1.0 Received: by 10.224.127.140 with SMTP id g12mr5585833qas.349.1287560366163; Wed, 20 Oct 2010 00:39:26 -0700 (PDT) Received: by 10.229.43.215 with HTTP; Wed, 20 Oct 2010 00:39:26 -0700 (PDT) Date: Wed, 20 Oct 2010 09:39:26 +0200 Message-ID: From: Johannes Junker To: seqan-dev@lists.fu-berlin.de Content-Type: text/plain; charset=ISO-8859-1 X-Originating-IP: 209.85.216.175 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287560367-00000C0F-6C8DDC60/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.070305, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.4 required=5.0 tests=DNS_FROM_RFC_POST, RCVD_BY_IP, SPF_HELO_PASS,SPF_PASS X-Mailman-Approved-At: Fri, 22 Oct 2010 13:39:53 +0200 Subject: [Seqan-dev] Wildcard characters in the haystack? X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Oct 2010 07:39:28 -0000 Hi, I was just wondering if it is possible in seqan to use wildcard characters within the haystack. As far as I understood from the documentation, a wildcard search is only possible for a needle containing wildcard characters against some haystack. However, in the case below, the haystack all_protein_sequences may contain ambiguous characters (e.g. an X should match all possible amino acid letters in the needle, a J should match only I and L, and so on), whereas the needles themselves do not contain any ambiguous characters. In the current implementation, the protein sequences containing these wildcard characters are not matched with their corresponding needles. Is there some clever way to do this? 157 seqan::Finder > finder(all_protein_sequences); 158 seqan::Pattern >, seqan::AhoCorasick > pattern(needle); 159 160 seqan::String > pat_hits; 161 Map > peptide_to_indices; 162 writeDebug_("Finding peptide/protein matches...", 1); 163 while (find(finder, pattern)) 164 { 165 seqan::appendValue(pat_hits, seqan::Pair(position(pattern), position(finder))); 166 peptide_to_indices[position(pattern)].push_back(position(finder)); 167 } Thanks in advance! Best, Johannes From manuel.holtgrewe@fu-berlin.de Fri Oct 22 13:40:57 2010 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P9FzU-0002P9-AI>; Fri, 22 Oct 2010 13:40:56 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P9FzU-0007kL-8U>; Fri, 22 Oct 2010 13:40:56 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1P9FzU-00061V-7n>; Fri, 22 Oct 2010 13:40:56 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 22 Oct 2010 13:40:55 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Fri, 22 Oct 2010 13:40:54 +0200 Thread-Topic: [Seqan-dev] Wildcard characters in the haystack? Thread-Index: Actx3fzGOcI3GmqrT5ymMYaYjMCeDw== Message-ID: References: In-Reply-To: Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1287747656-00000C0F-5C9B59E6/0-0/0-0 X-Bogosity: Unsure, tests=bogofilter, spamicity=0.496957, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,FU_BOGO_UNSURE Subject: Re: [Seqan-dev] Wildcard characters in the haystack? X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Oct 2010 11:40:57 -0000 To cite David: > the only search that supports wildcards is Shift-And for exact pattern ma= tching or Myers for approximate matching. Both are single pattern searches.= To have a multi-pattern Aho-Corasick with wildcards, all bases have to be = enumerated at X-positions which would blow the string trie up. To resolve t= his, identical paths could be merged, sounds like a BSc. thesis. >=20 > David Am 20.10.2010 um 09:39 schrieb Johannes Junker: > Hi, >=20 > I was just wondering if it is possible in seqan to use wildcard > characters within the haystack. As far as I understood from the > documentation, a wildcard search is only possible for a needle > containing wildcard characters against some haystack. However, in the > case below, the haystack all_protein_sequences may contain ambiguous > characters (e.g. an X should match all possible amino acid letters in > the needle, a J should match only I and L, and so on), whereas the > needles themselves do not contain any ambiguous characters. In the > current implementation, the protein sequences containing these > wildcard characters are not matched with their corresponding needles. > Is there some clever way to do this? >=20 > 157 seqan::Finder > finder(all_protein_sequences); > 158 seqan::Pattern >, > seqan::AhoCorasick > pattern(needle); > 159 =09 > 160 seqan::String > pat_hits; > 161 Map > peptide_to_indices; > 162 writeDebug_("Finding peptide/protein matches...", 1); > 163 while (find(finder, pattern)) > 164 { > 165 seqan::appendValue(pat_hits, seqan::Pair Size>(position(pattern), position(finder))); > 166 peptide_to_indices[position(pattern)].push_back(position(finder)= ); > 167 } >=20 > Thanks in advance! >=20 > Best, > Johannes >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From hauswedell@mi.fu-berlin.de Thu Oct 28 16:00:41 2010 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1PBT20-0006Hc-GL>; Thu, 28 Oct 2010 16:00:40 +0200 Received: from einhorn.in-berlin.de ([192.109.42.8]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1PBT20-0006SL-Dj>; Thu, 28 Oct 2010 16:00:40 +0200 X-Envelope-From: hauswedell@mi.fu-berlin.de X-Envelope-To: Received: from fbsdlap.soldiner.lan (dslb-088-075-073-142.pools.arcor-ip.net [88.75.73.142]) (authenticated bits=0) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id o9SE0dhh032271 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Thu, 28 Oct 2010 16:00:40 +0200 From: Hannes Hauswedell To: seqan-dev@lists.fu-berlin.de Date: Thu, 28 Oct 2010 16:00:31 +0200 User-Agent: KMail/1.13.5 (FreeBSD/8.0-STABLE; KDE/4.5.1; i386; ; ) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201010281600.32237.hauswedell@mi.fu-berlin.de> X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Originating-IP: 192.109.42.8 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1288274440-00000C0F-F8ADE12A/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.5 required=5.0 tests=FORGED_RCVD_HELO,OPTING_OUT Subject: [Seqan-dev] IRC-Channel X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Oct 2010 14:00:41 -0000 Hi everyone, I wanted to ask if there is some internal IRC-Channel or something=20 similar, or if something similar is planned. While this mailing-list is=20 not over-burdened with traffic right now, something like a channel might=20 be better suited for asking small questions that don't require pasting=20 of code [1] and where people can opt-out by not being in the channel all=20 the time=E2=80=A6 Thanks! Regards, Hannes [1] Things like "where was this again?", or "I am trying to something=20 this way, before I spend 3 days on it, can you tell me if I am on the=20 right track?"