From danielpeterjames@gmail.com Tue Jul 05 16:11:31 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6La-00036v-2m>; Tue, 05 Jul 2011 16:11:30 +0200 Received: from smtp.sanger.ac.uk ([193.62.202.243]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6LZ-00074h-Vi>; Tue, 05 Jul 2011 16:11:30 +0200 Received: from intmail2a.internal.sanger.ac.uk ([172.17.14.146] helo=smtp.sanger.ac.uk) by mailrelay.internal.sanger.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1Qe6LZ-00028w-8k for seqan-dev@lists.fu-berlin.de; Tue, 05 Jul 2011 15:11:29 +0100 Received: from ssh.sanger.ac.uk ([193.62.203.55] helo=analytics.google.com) by intmail2a.internal.sanger.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1Qe6LY-0003YN-Vo for seqan-dev@lists.fu-berlin.de; Tue, 05 Jul 2011 15:11:29 +0100 From: Daniel James Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Tue, 5 Jul 2011 15:11:28 +0100 Message-Id: To: SeqAn Development Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-Originating-IP: 193.62.202.243 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1309875090-00005A17-54FB9D05/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000477, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,FORGED_RCVD_HELO,SPF_HELO_PASS Subject: [Seqan-dev] Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2011 14:11:31 -0000 Hi I'm running into some unexpected behaviour whilst try to use an SA fibre = of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in = string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion = failed : static_cast(pos) < = static_cast(length(me)) was: 171599218 >=3D 100 (Trying to = access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { string syms; MyGenerator (string syms =3D "ACGT") : syms(syms) { = srand(time(NULL)); } char operator()(void) { return syms[rand() % syms.size()]; } =20= }; int main(int argc, char** argv) { // Input the number of sequences for the string set. stringstream ss(argv[1]); unsigned n; ss >> n; typedef StringSet = TMyStringSet; typedef Index > > = TMyIndex; typedef Fibre::Type = TMySAFibre; typedef Fibre::Type = TMyDirFibre; // Fill a string set with 60-mer DNA sequences. StringSet myStringSet; string input_s; DnaString input; for (unsigned i =3D 0; i < n; ++i) { input_s.resize(60); generate_n(input_s.begin(), 60, MyGenerator()); input =3D input_s; appendValue(myStringSet, input); } // Build the index. TMyIndex index(myStringSet); // Require the QGramSA fibre. cout << "requiring SA fibre:\n"; float t0 =3D clock(); indexRequire(index, QGramSA()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; cout << "requiring Dir fibre:\n"; t0 =3D clock(); indexRequire(index, QGramDir()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; TMySAFibre mySAFibre =3D getFibre(index, QGramSA()); cout << "QGramSA length: " << length(mySAFibre) << endl; TMyDirFibre myDirFibre =3D getFibre(index, QGramDir()); cout << "QGramDir length: " << length(myDirFibre) << endl; return 0; } -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From danielpeterjames@gmail.com Tue Jul 05 16:18:35 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6SQ-0003PI-AF>; Tue, 05 Jul 2011 16:18:34 +0200 Received: from mail-yi0-f54.google.com ([209.85.218.54]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6SQ-0003HM-3z>; Tue, 05 Jul 2011 16:18:34 +0200 Received: by yic13 with SMTP id 13so978554yic.13 for ; Tue, 05 Jul 2011 07:18:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; bh=kQkxOv9tpFvv6m0ugCDHtsORE6O7Rheq9QEdrDXQ9s4=; b=qJ+llFOdJL60JS7aYZ8La4uN5S2/B0fpkMab+0Q9TQs7Jhi0R/Z9WlDYr8WBx/2Mdb oa/urSsKa1UDVoGVpQztHPYc5+ZwLYVR8GoRmACDGlh8UtJESffhGTc0+E2ELbAkBb+K bhKbhnAidLWGRk4Um1OmziXzCxOefSdXioZq8= Received: by 10.236.67.76 with SMTP id i52mr100266yhd.308.1309875513070; Tue, 05 Jul 2011 07:18:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.95.42 with HTTP; Tue, 5 Jul 2011 07:18:13 -0700 (PDT) From: Daniel James Date: Tue, 5 Jul 2011 15:18:13 +0100 Message-ID: To: seqan-dev@lists.fu-berlin.de Content-Type: text/plain; charset=ISO-8859-1 X-Originating-IP: 209.85.218.54 X-purgate: suspect X-purgate-type: suspect X-purgate-ID: 151147::1309875514-00005A17-BC43DD3F/3384268707-0/0-1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000332, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: xx X-Spam-Status: No, score=2.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,FU_XPURGATE_SUSP,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS Subject: [Seqan-dev] Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2011 14:18:35 -0000 Hi I'm running into some unexpected behaviour whilst try to use an SA fibre of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion failed : static_cast(pos) < static_cast(length(me)) was: 171599218 >= 100 (Trying to access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { string syms; MyGenerator (string syms = "ACGT") : syms(syms) { srand(time(NULL)); } char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { // Input the number of sequences for the string set. stringstream ss(argv[1]); unsigned n; ss >> n; typedef StringSet TMyStringSet; typedef Index > > TMyIndex; typedef Fibre::Type TMySAFibre; typedef Fibre::Type TMyDirFibre; // Fill a string set with 60-mer DNA sequences. StringSet myStringSet; string input_s; DnaString input; for (unsigned i = 0; i < n; ++i) { input_s.resize(60); generate_n(input_s.begin(), 60, MyGenerator()); input = input_s; appendValue(myStringSet, input); } // Build the index. TMyIndex index(myStringSet); // Require the QGramSA fibre. cout << "requiring SA fibre:\n"; float t0 = clock(); indexRequire(index, QGramSA()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; cout << "requiring Dir fibre:\n"; t0 = clock(); indexRequire(index, QGramDir()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; TMySAFibre mySAFibre = getFibre(index, QGramSA()); cout << "QGramSA length: " << length(mySAFibre) << endl; TMyDirFibre myDirFibre = getFibre(index, QGramDir()); cout << "QGramDir length: " << length(myDirFibre) << endl; return 0; } From johdro@mpi-inf.mpg.de Wed Jul 06 16:28:27 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeT5V-0001Ng-Ge>; Wed, 06 Jul 2011 16:28:25 +0200 Received: from infao0809.mpi-sb.mpg.de ([139.19.1.49] helo=hera.mpi-sb.mpg.de) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeT5V-0008UJ-D9>; Wed, 06 Jul 2011 16:28:25 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date: MIME-Version:Content-Type:Content-Transfer-Encoding:Message-Id; bh=a0v4GOmkTrAE5rbYj1h8Vb/ZJcevuKayWIu9qRHYGCw=; b=VAWWGLcZtRNUt d6gCyS9VuQFJeXwDFZbacrvddix09Xvk3AsBSsnQa8qUQdfwVrPon0U4eg/S8bab /AR/nds0j+YCKrmzulp7xJQ4Zw8sMvBeMDxUWwaSDiZZHr0i6m1qwHXAyi5DRSwJ 3klvOTqMi+RjXyJbd3/LqlmILx/TKg= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:52797) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeT5S-0003qj-Nx for seqan-dev@lists.fu-berlin.de; Wed, 06 Jul 2011 16:28:24 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:44123 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeT5S-0002wc-En for seqan-dev@lists.fu-berlin.de; Wed, 06 Jul 2011 16:28:22 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?utf-8?q?Universit=C3=A4t_D=C3=BCsseldorf/Max-Planck-Institut_f=C3=BCr?= =?utf-8?q?_Informatik?= =?utf-8?q?_Saarbr=C3=BCcken?= To: seqan-dev@lists.fu-berlin.de Date: Wed, 6 Jul 2011 16:28:12 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107061628.12413.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1309962505-00005A17-4D1B1C83/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO Subject: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jul 2011 14:28:27 -0000 Hello, I am using Seqan to access a large FASTA file. In this case, I am importing= the whole RefSeq DB for random access (into memory or memory-mapped). This= can be quite a huge file, so I decided to go for a dynamic strategy writin= g a generic SequenceStorage object. It works well for typedef seqan::String< seqan::Dna5 > StringType; //(default type) typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; but not for typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; Here is the Code that imports the data using the MMap-Trick from the HowTo = and put it into a StringSet< StringType > data_; with an index data structure=20 std::map< std::string, long unsigned int > id2pos_; =2D------------------------------------------------------------------------= =2D------ seqan::MultiSeqFile db_sequences; seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY ); seqan::split( db_sequences, seqan::Fasta() ); for( unsigned int i =3D 0; i < num_records; ++i ) { StringType seq; seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); =09 std::string id; seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValueById= ( data_, seq ); } =2D------------------------------------------------------------------------= =2D------ 1) seqan::assignValueById() will cause a segfault at sequence number 33,924= out of 276,313 when using a StringSet with mmap strings. 2) Also, I don't know how to define a StringSet using array strings. 3) Using a regular Dna5 string, the how operation will take about 5 minutes= =2E A packed string requires much longer to load. Is there any way to speed= this up? I could think of a (binary) sink for a StingSet to avoid parsing = and recoding every time I load the DB sequences. Is there anything like thi= s (planned)? I appreciate your help! Gru=DF Johannes From johdro@mpi-inf.mpg.de Thu Jul 07 15:08:24 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeoJb-0003OS-7j>; Thu, 07 Jul 2011 15:08:23 +0200 Received: from infao0809.mpi-sb.mpg.de ([139.19.1.49] helo=hera.mpi-sb.mpg.de) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeoJb-0006DN-4q>; Thu, 07 Jul 2011 15:08:23 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=fTx48qu9q2NCR7L9CuR2mafmVB3Kik4kMtgSeVmUGu0=; b=t LzHJG5WYoRZz76auNR1y2jYDsR7RCrAFZK1s05VTisAuQwAtZXjdG2I2jJ4/Saq5 dZOr/hs6Z5CVqO1mrKE2oqmFUeyewyPqov6w+O0VLMGHfz572LWauyplrI/2MPmh DxHi4JeUWb3G6mna0k/V9ZEb/CrC5U9nA6qmEekZVg= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:39323) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeoJV-0002Nj-Bs for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 15:08:22 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:56142 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeoJV-0002WF-1g for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 15:08:17 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 7 Jul 2011 15:08:15 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> In-Reply-To: <201107061628.12413.johdro@mpi-inf.mpg.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107071508.16040.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310044103-00005A17-A41CCE20/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000003, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 13:08:24 -0000 Can anybody check whether the use of MMStrings is used correctly here via t= he assignSeq meta function? Do I still have to keep the MultiSeqFile object after all MMap Strings in t= he StringSet are constructed? Gru=DF Johannes From weese@campus.fu-berlin.de Thu Jul 07 16:14:36 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-00064I-LQ>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-00026n-JO>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-0006SK-EB>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 7 Jul 2011 16:14:35 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 7 Jul 2011 16:14:34 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw8sDLzHJD5vx43T9SStqI6cTrHiQ== Message-ID: <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> References: <201107061628.12413.johdro@mpi-inf.mpg.de> In-Reply-To: <201107061628.12413.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310048075-00005A17-5BCD2A4A/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.002522, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 14:14:37 -0000 Hi Johannes, I assume the value of num_records less or equal to length(db_sequences). Lo= oking at your code it seems that you try to use a memory mapped string as a= temporary variable in a large loop. Maybe not the best idea, as it would c= reate a temporary file and deletes it in every iteration. It could be that = the temporary could not be opened, you could test that with a #define SEQAN= _DEBUG before including any SeqAn header. You should at least move all the instantiations out of the loop. Still I do= nt think you need a memory mapped string (seq) to store a single sequence o= f a multi fasta file. Also I cannot see, where you store the read sequences= . It would make sense to use a single StringSet >, Owner > > data_ that stores multiple sequences using a single memo= ry mapped string. HTH. If the problem still remains, please create a bug ticket with source c= ode and example files. Cheers, David -- David Weese weese@inf.fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021=20 Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: > Hello, > I am using Seqan to access a large FASTA file. In this case, I am importi= ng the whole RefSeq DB for random access (into memory or memory-mapped). Th= is can be quite a huge file, so I decided to go for a dynamic strategy writ= ing a generic SequenceStorage object. It works well for >=20 > typedef seqan::String< seqan::Dna5 > StringType; //(default type) > typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; >=20 > but not for > typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; >=20 > Here is the Code that imports the data using the MMap-Trick from the HowT= o and put it into a >=20 > StringSet< StringType > data_; >=20 > with an index data structure=20 >=20 > std::map< std::string, long unsigned int > id2pos_; >=20 > -------------------------------------------------------------------------= ------- > seqan::MultiSeqFile db_sequences; > seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY ); > seqan::split( db_sequences, seqan::Fasta() ); >=20 > for( unsigned int i =3D 0; i < num_records; ++i ) { > StringType seq; > seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > =09 > std::string id; > seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValueBy= Id( data_, seq ); > } > -------------------------------------------------------------------------= ------- >=20 > 1) seqan::assignValueById() will cause a segfault at sequence number 33,9= 24 out of 276,313 when using a StringSet with mmap strings. >=20 > 2) Also, I don't know how to define a StringSet using array strings. >=20 > 3) Using a regular Dna5 string, the how operation will take about 5 minut= es. A packed string requires much longer to load. Is there any way to speed= this up? I could think of a (binary) sink for a StingSet to avoid parsing = and recoding every time I load the DB sequences. Is there anything like thi= s (planned)? >=20 > I appreciate your help! >=20 > Gru=DF Johannes >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From johdro@mpi-inf.mpg.de Thu Jul 07 17:11:45 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeqEy-0008LQ-3u>; Thu, 07 Jul 2011 17:11:44 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeqEy-0005Dc-0G>; Thu, 07 Jul 2011 17:11:44 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=3DaBen/ajd/uf30AXydrnkjBtmlkxvSn7O+o9W7cw08=; b=F Dgi7o/0N4jRXaDYo0D9VS6fkJfS77KijYs8+u+D5BcPpT9BJCO/iVxXnXV9ylp+0 IWxJm6i/4UEBBz+UxMlr6jXEFfTYIU5uq40CNz3PMw+/rqPqKzPZeHu7H/T9hYQD BtEKSyZ7B2lDdwGxK3y+8ivDUIKbaCE8SXeQyj1K0w= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:39329) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeqEv-00029I-2J for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 17:11:43 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:53717 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeqEu-0003Ye-PC for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 17:11:40 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 7 Jul 2011 17:11:39 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> In-Reply-To: <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107071711.39904.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310051504-00005A17-507A7EEC/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 15:11:45 -0000 Hello David, thank for your comments. I am still not confident with the design concept o= f memory mapped single strings in Seqan. The idea of the loop in any case i= s to create a StringSet which type depends on the choosen StringType. So it= works this way: 1) create temporary sequences object 2) assign content from memory mapped multi-fasta file (MultiSeqFile) 3) store in StringSet which will have ownership (I guess this is done via a= copy constructor) This works fine for standard and packed string types. I would also like to = have a StringSet that contains strings that are actually memory mapped from= the original multi-fasta file. I thought that the assignSeq function would= appropriately handle this when I use it with default-constructed memory ma= pped sequence object. I seems I misunderstood the design of this sequence t= ype. Is there any way to construct such a StringSet I have in mind? Gru=DF Johannes Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: > Hi Johannes, >=20 > I assume the value of num_records less or equal to length(db_sequences). = Looking at your code it seems that you try to use a memory mapped string as= a temporary variable in a large loop. Maybe not the best idea, as it would= create a temporary file and deletes it in every iteration. It could be tha= t the temporary could not be opened, you could test that with a #define SEQ= AN_DEBUG before including any SeqAn header. > You should at least move all the instantiations out of the loop. Still I = dont think you need a memory mapped string (seq) to store a single sequence= of a multi fasta file. Also I cannot see, where you store the read sequenc= es. It would make sense to use a single StringSet >, Owner= > > data_ that stores multiple sequences using a single me= mory mapped string. >=20 > HTH. If the problem still remains, please create a bug ticket with source= code and example files. >=20 > Cheers, > David > -- > David Weese weese@inf.fu-berlin.de > Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ > Institut f=FCr Informatik Phone: +49 30 838 75246 > Takustra=DFe 9 Algorithmic Bioinformatics > 14195 Berlin Room 021=20 >=20 > Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: >=20 > > Hello, > > I am using Seqan to access a large FASTA file. In this case, I am impor= ting the whole RefSeq DB for random access (into memory or memory-mapped). = This can be quite a huge file, so I decided to go for a dynamic strategy wr= iting a generic SequenceStorage object. It works well for > >=20 > > typedef seqan::String< seqan::Dna5 > StringType; //(default type) > > typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; > >=20 > > but not for > > typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; > >=20 > > Here is the Code that imports the data using the MMap-Trick from the Ho= wTo and put it into a > >=20 > > StringSet< StringType > data_; > >=20 > > with an index data structure=20 > >=20 > > std::map< std::string, long unsigned int > id2pos_; > >=20 > > -----------------------------------------------------------------------= =2D-------- > > seqan::MultiSeqFile db_sequences; > > seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY = ); > > seqan::split( db_sequences, seqan::Fasta() ); > >=20 > > for( unsigned int i =3D 0; i < num_records; ++i ) { > > StringType seq; > > seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > > =09 > > std::string id; > > seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > > id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValue= ById( data_, seq ); > > } > > -----------------------------------------------------------------------= =2D-------- > >=20 > > 1) seqan::assignValueById() will cause a segfault at sequence number 33= ,924 out of 276,313 when using a StringSet with mmap strings. > >=20 > > 2) Also, I don't know how to define a StringSet using array strings. > >=20 > > 3) Using a regular Dna5 string, the how operation will take about 5 min= utes. A packed string requires much longer to load. Is there any way to spe= ed this up? I could think of a (binary) sink for a StingSet to avoid parsin= g and recoding every time I load the DB sequences. Is there anything like t= his (planned)? > >=20 > > I appreciate your help! > >=20 > > Gru=DF Johannes From weese@campus.fu-berlin.de Thu Jul 07 20:28:04 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0007vj-T3>; Thu, 07 Jul 2011 20:28:03 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0002SC-Qd>; Thu, 07 Jul 2011 20:28:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0002Kt-LM>; Thu, 07 Jul 2011 20:28:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 7 Jul 2011 20:28:02 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 7 Jul 2011 20:28:00 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw805qnZobLvX9bQL2LWaIqLXojkw== Message-ID: <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> References: <201107061628.12413.johdro@mpi-inf.mpg.de> <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> <201107071711.39904.johdro@mpi-inf.mpg.de> In-Reply-To: <201107071711.39904.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310063282-00005A17-74DBFF58/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.006334, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 18:28:04 -0000 Hi, follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/EfficientI= mportOfMillionsOfSequences and simply change: StringSet > seqs; into: StringSet >, Owner > > seqs; That should do what you want. Regards, David Am 07.07.2011 um 17:11 schrieb Johannes Dr=F6ge: > Hello David, > thank for your comments. I am still not confident with the design concept= of memory mapped single strings in Seqan. The idea of the loop in any case= is to create a StringSet which type depends on the choosen StringType. So = it works this way: >=20 > 1) create temporary sequences object > 2) assign content from memory mapped multi-fasta file (MultiSeqFile) > 3) store in StringSet which will have ownership (I guess this is done via= a copy constructor) >=20 > This works fine for standard and packed string types. I would also like t= o have a StringSet that contains strings that are actually memory mapped fr= om the original multi-fasta file. I thought that the assignSeq function wou= ld appropriately handle this when I use it with default-constructed memory = mapped sequence object. I seems I misunderstood the design of this sequence= type. Is there any way to construct such a StringSet I have in mind? >=20 > Gru=DF Johannes >=20 >=20 > Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: >> Hi Johannes, >>=20 >> I assume the value of num_records less or equal to length(db_sequences).= Looking at your code it seems that you try to use a memory mapped string a= s a temporary variable in a large loop. Maybe not the best idea, as it woul= d create a temporary file and deletes it in every iteration. It could be th= at the temporary could not be opened, you could test that with a #define SE= QAN_DEBUG before including any SeqAn header. >> You should at least move all the instantiations out of the loop. Still I= dont think you need a memory mapped string (seq) to store a single sequenc= e of a multi fasta file. Also I cannot see, where you store the read sequen= ces. It would make sense to use a single StringSet >, Owne= r > > data_ that stores multiple sequences using a single m= emory mapped string. >>=20 >> HTH. If the problem still remains, please create a bug ticket with sourc= e code and example files. >>=20 >> Cheers, >> David >> -- >> David Weese weese@inf.fu-berlin.de >> Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ >> Institut f=FCr Informatik Phone: +49 30 838 75246 >> Takustra=DFe 9 Algorithmic Bioinformatics >> 14195 Berlin Room 021=20 >>=20 >> Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: >>=20 >>> Hello, >>> I am using Seqan to access a large FASTA file. In this case, I am impor= ting the whole RefSeq DB for random access (into memory or memory-mapped). = This can be quite a huge file, so I decided to go for a dynamic strategy wr= iting a generic SequenceStorage object. It works well for >>>=20 >>> typedef seqan::String< seqan::Dna5 > StringType; //(default type) >>> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; >>>=20 >>> but not for >>> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; >>>=20 >>> Here is the Code that imports the data using the MMap-Trick from the Ho= wTo and put it into a >>>=20 >>> StringSet< StringType > data_; >>>=20 >>> with an index data structure=20 >>>=20 >>> std::map< std::string, long unsigned int > id2pos_; >>>=20 >>> -----------------------------------------------------------------------= --------- >>> seqan::MultiSeqFile db_sequences; >>> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY = ); >>> seqan::split( db_sequences, seqan::Fasta() ); >>>=20 >>> for( unsigned int i =3D 0; i < num_records; ++i ) { >>> StringType seq; >>> seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); >>> =09 >>> std::string id; >>> seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); >>> id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValue= ById( data_, seq ); >>> } >>> -----------------------------------------------------------------------= --------- >>>=20 >>> 1) seqan::assignValueById() will cause a segfault at sequence number 33= ,924 out of 276,313 when using a StringSet with mmap strings. >>>=20 >>> 2) Also, I don't know how to define a StringSet using array strings. >>>=20 >>> 3) Using a regular Dna5 string, the how operation will take about 5 min= utes. A packed string requires much longer to load. Is there any way to spe= ed this up? I could think of a (binary) sink for a StingSet to avoid parsin= g and recoding every time I load the DB sequences. Is there anything like t= his (planned)? >>>=20 >>> I appreciate your help! >>>=20 >>> Gru=DF Johannes >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From johdro@mpi-inf.mpg.de Fri Jul 08 16:05:36 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfBgV-00034Q-Bj>; Fri, 08 Jul 2011 16:05:35 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfBgV-0001DM-8I>; Fri, 08 Jul 2011 16:05:35 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=/bszjVp6WuzvHjnp18vPiW7EKLEzqoDztcsl5fbuuRw=; b=v MQfg2bvy1YUJIwVKCxwG6XdiZt+LPW7O7UBAZHqzkwuJxsg+0S/enZRw3Gy0hTnf yWLb0d3sDSMxMKmKLf1DUAEDjFOZVRTQTvTs4Hl+HypHO/F+O1o8LUc19ayT5Xvj 4WfM4Fz2ZhS07VwaMX5C1HFntrVACRJI8pwcjehY8s= Received: from infao0710.mpi-klsb.mpg.de ([139.19.1.27]:35964 helo=zak.mpi-klsb.mpg.de) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QfBgR-0001yu-NY for seqan-dev@lists.fu-berlin.de; Fri, 08 Jul 2011 16:05:34 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:58432 helo=linux-eu7n.site) by zak.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QfBgR-0001St-AO for seqan-dev@lists.fu-berlin.de; Fri, 08 Jul 2011 16:05:31 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Fri, 8 Jul 2011 16:05:30 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107071711.39904.johdro@mpi-inf.mpg.de> <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> In-Reply-To: <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107081605.30762.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310133935-00005A17-F217A9FC/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jul 2011 14:05:36 -0000 Sorry, I still don't get it. How can [ MutiFastaFile =3D=3D> Dna5String =3D=3D> StringSet >, Owner > > ] work, if it copies the value of the = sequence? Doesn't assignSeq() copy the value into the Dna5String seq? What happens when I use appendValue to add seq to the StringSet, where does= it actually reside (it should still be in the MultiFasta file). I need to access the MultiFastaFile (on the hard disk) as a regular StringS= et to read its contents on demand, not copy its sequences into a new memory= =2Dmapped file. Johannes Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David: > Hi, >=20 > follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/Efficien= tImportOfMillionsOfSequences and simply change: >=20 > StringSet > seqs; >=20 > into: >=20 > StringSet >, Owner > > seqs; >=20 > That should do what you want. >=20 > Regards, > David >=20 >=20 > Am 07.07.2011 um 17:11 schrieb Johannes Dr=F6ge: >=20 > > Hello David, > > thank for your comments. I am still not confident with the design conce= pt of memory mapped single strings in Seqan. The idea of the loop in any ca= se is to create a StringSet which type depends on the choosen StringType. S= o it works this way: > >=20 > > 1) create temporary sequences object > > 2) assign content from memory mapped multi-fasta file (MultiSeqFile) > > 3) store in StringSet which will have ownership (I guess this is done v= ia a copy constructor) > >=20 > > This works fine for standard and packed string types. I would also like= to have a StringSet that contains strings that are actually memory mapped = from the original multi-fasta file. I thought that the assignSeq function w= ould appropriately handle this when I use it with default-constructed memor= y mapped sequence object. I seems I misunderstood the design of this sequen= ce type. Is there any way to construct such a StringSet I have in mind? > >=20 > > Gru=DF Johannes > >=20 > >=20 > > Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: > >> Hi Johannes, > >>=20 > >> I assume the value of num_records less or equal to length(db_sequences= ). Looking at your code it seems that you try to use a memory mapped string= as a temporary variable in a large loop. Maybe not the best idea, as it wo= uld create a temporary file and deletes it in every iteration. It could be = that the temporary could not be opened, you could test that with a #define = SEQAN_DEBUG before including any SeqAn header. > >> You should at least move all the instantiations out of the loop. Still= I dont think you need a memory mapped string (seq) to store a single seque= nce of a multi fasta file. Also I cannot see, where you store the read sequ= ences. It would make sense to use a single StringSet >, Ow= ner > > data_ that stores multiple sequences using a single= memory mapped string. > >>=20 > >> HTH. If the problem still remains, please create a bug ticket with sou= rce code and example files. > >>=20 > >> Cheers, > >> David > >> -- > >> David Weese weese@inf.fu-berlin.de > >> Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ > >> Institut f=FCr Informatik Phone: +49 30 838 75246 > >> Takustra=DFe 9 Algorithmic Bioinformatics > >> 14195 Berlin Room 021=20 > >>=20 > >> Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: > >>=20 > >>> Hello, > >>> I am using Seqan to access a large FASTA file. In this case, I am imp= orting the whole RefSeq DB for random access (into memory or memory-mapped)= =2E This can be quite a huge file, so I decided to go for a dynamic strateg= y writing a generic SequenceStorage object. It works well for > >>>=20 > >>> typedef seqan::String< seqan::Dna5 > StringType; //(default type) > >>> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; > >>>=20 > >>> but not for > >>> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; > >>>=20 > >>> Here is the Code that imports the data using the MMap-Trick from the = HowTo and put it into a > >>>=20 > >>> StringSet< StringType > data_; > >>>=20 > >>> with an index data structure=20 > >>>=20 > >>> std::map< std::string, long unsigned int > id2pos_; > >>>=20 > >>> ---------------------------------------------------------------------= =2D---------- > >>> seqan::MultiSeqFile db_sequences; > >>> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONL= Y ); > >>> seqan::split( db_sequences, seqan::Fasta() ); > >>>=20 > >>> for( unsigned int i =3D 0; i < num_records; ++i ) { > >>> StringType seq; > >>> seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > >>> =09 > >>> std::string id; > >>> seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > >>> id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignVal= ueById( data_, seq ); > >>> } > >>> ---------------------------------------------------------------------= =2D---------- > >>>=20 > >>> 1) seqan::assignValueById() will cause a segfault at sequence number = 33,924 out of 276,313 when using a StringSet with mmap strings. > >>>=20 > >>> 2) Also, I don't know how to define a StringSet using array strings. > >>>=20 > >>> 3) Using a regular Dna5 string, the how operation will take about 5 m= inutes. A packed string requires much longer to load. Is there any way to s= peed this up? I could think of a (binary) sink for a StingSet to avoid pars= ing and recoding every time I load the DB sequences. Is there anything like= this (planned)? > >>>=20 > >>> I appreciate your help! > >>>=20 > >>> Gru=DF Johannes > >=20 > > _______________________________________________ > > seqan-dev mailing list > > seqan-dev@lists.fu-berlin.de > > https://lists.fu-berlin.de/listinfo/seqan-dev >=20 >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev >=20 From weese@campus.fu-berlin.de Fri Jul 08 19:35:22 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0002bg-CW>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0005fc-AT>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0000sK-59>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 8 Jul 2011 19:35:21 +0200 From: "Weese, David" To: SeqAn Development Date: Fri, 8 Jul 2011 19:35:20 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw9lWkH4aSvRvxwQgugBESkPDWPrQ== Message-ID: References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107071711.39904.johdro@mpi-inf.mpg.de> <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> <201107081605.30762.johdro@mpi-inf.mpg.de> In-Reply-To: <201107081605.30762.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310146521-00005A17-AD3478FA/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.148144, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jul 2011 17:35:22 -0000 Am 08.07.2011 um 16:05 schrieb Johannes Dr=F6ge: > Sorry, I still don't get it. > How can [ MutiFastaFile =3D=3D> Dna5String =3D=3D> StringSet >, Owner > > ] work, if it copies the value of th= e sequence? > Doesn't assignSeq() copy the value into the Dna5String seq? assignSeq *extracts* the sequence information from a block that may contain= a header, a sequence interspersed by newlines, quality values, etc. If want to get sequence substrings of an unprocessed Fasta file, they may c= ontain whitespace. >=20 > What happens when I use appendValue to add seq to the StringSet, where do= es it actually reside (it should still be in the MultiFasta file). As assignSeq(seq, ...) extracts the sequence character-by-character there i= s no association between seq and the Fasta file. >=20 > I need to access the MultiFastaFile (on the hard disk) as a regular Strin= gSet to read its contents on demand, not copy its sequences into a new memo= ry-mapped file. Then you need to keep the split MultiSeqFile and extract the sequences on d= emand with assignSeq. If you access the sequences very often I would recommend to fill a StringSe= t >, Owner > > (see my last mail) whic= h also resides on your hard disk but can be accessed without assignSeq. >=20 > Johannes >=20 >=20 > Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David: >> Hi, >>=20 >> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/Efficie= ntImportOfMillionsOfSequences and simply change: >>=20 >> StringSet > seqs; >>=20 >> into: >>=20 >> StringSet >, Owner > > seqs; >>=20 >> That should do what you want. >>=20 >> Regards, >> David >>=20 >>=20 From danielpeterjames@gmail.com Sun Jul 10 17:50:53 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfwHU-0002YN-B4>; Sun, 10 Jul 2011 17:50:52 +0200 Received: from mail-yi0-f54.google.com ([209.85.218.54]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfwHU-00070f-4p>; Sun, 10 Jul 2011 17:50:52 +0200 Received: by yic13 with SMTP id 13so277791yic.13 for ; Sun, 10 Jul 2011 08:50:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=5qAe/7b7Q5c2is4ho0laIfgp2thrk2m02ARWsqRM1FA=; b=b4lu+PSQopLrxkUfP7HTolr/JjrNEEyqRv49OecxCRQ0AO2IyNNtypgNMPc87etdmg 6FKKCXe/z5AE7qjisd9NU1oRJdSaI6tHAoeqfc7hok6nSrXSqebGDahTAi/TNb2wCDR9 Vb3xApDwDM5pc1DHBcJJNP5vyiMbucfn5F8Q0= Received: by 10.236.67.76 with SMTP id i52mr4259807yhd.308.1310313051120; Sun, 10 Jul 2011 08:50:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.95.42 with HTTP; Sun, 10 Jul 2011 08:50:30 -0700 (PDT) In-Reply-To: References: From: Daniel James Date: Sun, 10 Jul 2011 16:50:30 +0100 Message-ID: To: seqan-dev@lists.fu-berlin.de Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Originating-IP: 209.85.218.54 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310313052-00005A17-C3D1B63C/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS Subject: [Seqan-dev] Fwd: Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Jul 2011 15:50:53 -0000 Hi I'm getting an error whilst trying to get the SA fibre of a QGram index over a StringSet There's a minimal example below. Is there any chance someone could have a look at this? I'm on Rev: 9996 of trunk. Daniel #include #include using namespace seqan; // Generates random nucleotides. struct MyGenerator : std::unary_function { std::string syms; MyGenerator (std::string syms =3D "ACGT") : syms(syms) { srand(time(NUL= L)); } char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { typedef StringSet TMyStringS= et; typedef Index > > TMyIndex; StringSet myStringSet; for (unsigned i =3D 0; i < 100; ++i) { DnaString input; resize(input, 60); generate_n(begin(input), 60, MyGenerator()); appendValue(myStringSet, input); } std::cout << myStringSet[0] << std::endl; std::cout << "requiring QGramSA..." << std::endl; TMyIndex index(myStringSet); indexRequire(index, QGramSA()); return 0; } ---------- Forwarded message ---------- From: Daniel James Date: 5 July 2011 15:18 Subject: Building QGramSA fibre from StringSet To: seqan-dev@lists.fu-berlin.de Hi I'm running into some unexpected behaviour whilst try to use an SA fibre of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion failed : static_cast(pos) < static_cast(length(me)) was: 171599218 >=3D 100 (Trying to access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { =A0 string syms; =A0 MyGenerator (string syms =3D "ACGT") : syms(syms) { srand(time(NULL)); = } =A0 char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { =A0 // Input the number of sequences for the string set. =A0 stringstream ss(argv[1]); =A0 unsigned n; =A0 ss >> n; =A0 typedef StringSet =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0TMyStringSet; =A0 typedef Index > > =A0 TMyInde= x; =A0 typedef Fibre::Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0TMySAFibre; =A0 typedef Fibre::Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 TMyDirFibre; =A0 // Fill a string set with 60-mer DNA sequences. =A0 StringSet myStringSet; =A0 string input_s; =A0 DnaString input; =A0 for (unsigned i =3D 0; i < n; ++i) =A0 { =A0 =A0 =A0 input_s.resize(60); =A0 =A0 =A0 generate_n(input_s.begin(), 60, MyGenerator()); =A0 =A0 =A0 input =3D input_s; =A0 =A0 =A0 appendValue(myStringSet, input); =A0 } =A0 // Build the index. =A0 TMyIndex index(myStringSet); =A0 // Require the QGramSA fibre. =A0 cout << "requiring SA fibre:\n"; =A0 float t0 =3D clock(); =A0 indexRequire(index, QGramSA()); =A0 cout << (clock() - t0)/CLOCKS_PER_SEC << endl; =A0 cout << "requiring Dir fibre:\n"; =A0 t0 =3D clock(); =A0 indexRequire(index, QGramDir()); =A0 cout << (clock() - t0)/CLOCKS_PER_SEC << endl; =A0 TMySAFibre mySAFibre =3D getFibre(index, QGramSA()); =A0 cout << "QGramSA length: " << length(mySAFibre) << endl; =A0 TMyDirFibre myDirFibre =3D getFibre(index, QGramDir()); =A0 cout << "QGramDir length: " << length(myDirFibre) << endl; =A0 return 0; } From f.buske@uq.edu.au Wed Jul 13 13:47:27 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QgxuX-0001lj-VD>; Wed, 13 Jul 2011 13:47:26 +0200 Received: from mailhub4.uq.edu.au ([130.102.149.131]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QgxuX-0000Cc-6Y>; Wed, 13 Jul 2011 13:47:25 +0200 Received: from smtp4.uq.edu.au (smtp4.uq.edu.au [130.102.128.19]) by mailhub4.uq.edu.au (8.13.8/8.13.8) with ESMTP id p6DBlKDX017825 for ; Wed, 13 Jul 2011 21:47:21 +1000 Received: from Fabian-Buskes-MacBook.local (c122-108-178-120.rochd4.qld.optusnet.com.au [122.108.178.120]) (authenticated bits=0) by smtp4.uq.edu.au (8.13.8/8.13.8) with ESMTP id p6DBlIIG014885 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 13 Jul 2011 21:47:20 +1000 Message-ID: <4E1D85C6.6020500@uq.edu.au> Date: Wed, 13 Jul 2011 21:47:18 +1000 From: Fabian Buske User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: SeqAn Development Content-Type: multipart/mixed; boundary="------------050502070902010701000607" X-UQ-FilterTime: 1310557641 X-Scanned-By: MIMEDefang 2.58 on UQ Mailhub on 130.102.149.131 X-Originating-IP: 130.102.149.131 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310557645-00005A17-78AC34FE/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.9 required=5.0 tests=FORGED_RCVD_HELO, RATWARE_GECKO_BUILD Subject: [Seqan-dev] bugfix: missing metafunction in misc_dequeue.h X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2011 11:47:27 -0000 This is a multi-part message in MIME format. --------------050502070902010701000607 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi, it seems that seqan/misc/misc_dequeue.h does currently not have the functionality to report the type of its values (missing metafunction). I didn't check whether this may also explain some of the bugs in the bug-tracker related to the dequeue template. Anyway, attached is a fix for this one. Cheers, Fabian -- Fabian A. Buske Institute for Molecular Bioscience The University of Queensland Brisbane, Qld. 4072 Australia Phone: (61)-(7)-334-62608 --------------050502070902010701000607 Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0"; name="misc_dequeue_bugfix.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="misc_dequeue_bugfix.diff" Index: misc_dequeue.h =================================================================== --- misc_dequeue.h (revision 10040) +++ misc_dequeue.h (working copy) @@ -108,6 +108,11 @@ typedef Iter const, PositionIterator> Type; }; +template +struct Value > +{ + typedef TValue Type; +}; ////////////////////////////////////////////////////////////////////////////// --------------050502070902010701000607-- From Knut.Reinert@fu-berlin.de Thu Jul 14 22:57:41 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhSya-0005FF-4i>; Thu, 14 Jul 2011 22:57:40 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhSyY-0001Sj-KQ>; Thu, 14 Jul 2011 22:57:39 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhSyH-0006FT-2l>; Thu, 14 Jul 2011 22:57:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 14 Jul 2011 22:57:19 +0200 From: "Reinert, Knut" To: AG ABI , Martin Vingron , "Kruglyak, Semyon" , Dirk Evers , Tobias Mann , Bret Barnes , Raffaele Giancarlo , Anthony Cox , Kathrin Trappe , Jochen Singer , Markus Bauer , Ole Schulz-Trieglaff , Nikolaus Rajewsky , Stefan Mundlos , Peter Robinson , =?iso-8859-1?Q?J=FCrgen_Kleffe?= , Hans-Peter Lenhof , Stefan Kurtz , Sven Rahmann , Jens Stoye , Lars Langner , kuss_a , Marcel Grunert , Gunnar Klau , wei chen , Chen Li , Steven Salzberg , "franzime@zedat.fu-berlin.de" , Konrad Ludwig Moritz Rudolph , Franziska Zickmann , Johannes Krugel , "Dr. Hans-Joachim Hinz" , Ulf Leser , Ulrich Meyer , Marcel Schulz , Cedric Notredame , Andreas Doering , Christoph Dieterich , Carsten Kemena , Michael Stromberg , "bertram.weiss@bayerhealthcare.com" , Lars Bertram , Michael Berthold , Marcel Martin , Tobias Rausch , Johannes Roehr , Martin Riese , "timmermann@molgen.mpg.de" , Johannes Fischer , Kurt Mehlhorn , Peter Sanders , "naeher@uni-trier.de" , Gonzalo Navarro , Andreas Hildebrandt , Aaron Halpern , Robert Giegerich , Rolf Backofen , "stadler@bioinf.uni-leipzig.de" , Kristina Little , Ralf Zimmer , Andreas Keller , Sabrina Krakau , "Dr. Bernhard Balkenhol" , Ralf Herwig , Han-Yu Chuang , "taeubig@informatik.tu-muenchen.de" , Vineet Bafna , Pavel Pevzner , David Haussler , Benedict Paten , "langmead@cs.umd.edu" , Granger Sutton , Shibu Yooseph , Andreas Beutler , Paolo Di Tommaso , "birney@ebi.ac.uk" , "jens-uwe.krause@lgcgenomics.com" , "efritzilas@illumina.com" , "rina.ahmed@mdc-berlin.de" , Camila Mazzoni , "wwong@illumina.com" , Fabian Buske , "Denis C. Bauer" , Franziska Zickmann , Oliver Kohlbacher , "Dr. Jan Baumbach" , "lengauer@mpi-sb.mpg.de" , Ernst Althaus , "mario.albrecht@mpi-inf.mpg.de" , "cbock@mpi-inf.mpg.de" , Stefan Canzar , =?iso-8859-1?Q?=22B=E4rwolf=2C_Aneta=22?= , Laurent Mouchard , Max Crochemore , "costas.iliopoulos@gmail.com Iliopoulos" , "steinke@zib.de Steinke" , "kathleen.steinhoefel@kcl.ac.uk" , =?iso-8859-1?Q?Johannes_Dr=F6ge?= , Thomas Dan Otto , Mat , Bernhard Renard , SeqAn Development , "mallgaier@igb-berlin.de" , "monaghan@igb-berlin.de" , =?iso-8859-1?Q?Bj=F6rn_Kahlert?= , Lutz Prechelt , Leon Kuchenbecker Date: Thu, 14 Jul 2011 22:57:02 +0200 Thread-Topic: Invitation to 3rd SeqAn workshop, September 12.-14., Harnack Haus, Berlin Dahlem Thread-Index: AcxCaJ6NuqA1bRFrSBC99lXlr1OsJw== Message-ID: <762980B4-CF7C-40B9-9AA0-6993B5B6283C@fu-berlin.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: multipart/signed; boundary="Apple-Mail-120--490142946"; protocol="application/pkcs7-signature"; micalg=sha1 MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310677060-00005A17-281F8CB3/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Mailman-Approved-At: Fri, 15 Jul 2011 00:20:54 +0200 Subject: [Seqan-dev] Invitation to 3rd SeqAn workshop, September 12.-14., Harnack Haus, Berlin Dahlem X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2011 20:57:41 -0000 --Apple-Mail-120--490142946 Content-Type: multipart/alternative; boundary=Apple-Mail-118--490143012 --Apple-Mail-118--490143012 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Dear friends, SeqAn users, and algorithm developers, I invite you (or coworkers) to participate in the 3rd SeqAn workshop = (www.seqan.de),=20 which will be from September 12. to 14. 2011 in Berlin, Germany. The workshop will be free of charge and be sponsored by the German = Ministry for Education and Research within the VIP program. The venue will be the Harnack Haus in Berlin Dahlem = (http://www.harnackhaus-berlin.mpg.de/). More information can be found on the attached flyer and on the VIP = project website (http://www.seqan-biostore.de/wp) In order to plan the details we would like you to confirm your = participation until August 14th the latest. Please send a mail to Sabrina Krakau with the = information=20 a) whether you want to participate b) whether you would like to give a talk on the Monday the 12th about = your recent research, open problems, or your experience with SeqAn (see = attached schedule). Please feel free to forward the mail to interested users. We hope to see you in Berlin in September, The SeqAn team ----------------------------------------------------------------------- Prof. Dr.-Ing. Knut Reinert Phone/fax : +49 30 838 75 222/218 = (GE) : +1 858 8826656 (US) Algorithmic Bioinformatics Mobile : +49 160 7195754 (GE) : +1 858405 8323 (US) Freie Universit=E4t Berlin Skype : knut.reinert Takustrasse 9 E-Mail : = knut.reinert@fu-berlin.de D-14195 Berlin, Germany Web : http://knut.reinert.ws ------------------------------------------------------------------------ --Apple-Mail-118--490143012 Content-Type: multipart/mixed; boundary=Apple-Mail-119--490143012 --Apple-Mail-119--490143012 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Dear friends, SeqAn users, and = algorithm developers,

I invite you (or coworkers) to participate = in the 3rd SeqAn workshop (www.seqan.de), 
which will be =  from September 12. to 14. 2011 in Berlin, = Germany.

The workshop will be free of charge = and be sponsored by the German Ministry for Education and Research = within the VIP program.
The venue will be the Harnack Haus in = Berlin Dahlem (http://www.harnackhaus-berl= in.mpg.de/).

More information can be found = on the attached flyer and on the VIP project website (http://www.seqan-biostore.de/wp)

In order to plan the details we would like you to = confirm your participation until August 14th the = latest.

Please send a mail = to Sabrina Krakau <krakau@mi.fu-berlin.de> =  with the information 

a) whether = you want to participate
b) whether you would like to give a = talk on the Monday the 12th about your recent research, open problems, = or your experience with SeqAn (see attached = schedule).

Please feel free to forward = the mail to interested users.

We hope to = see you in Berlin in September,

The SeqAn = team

: +49 30 838 75 222/218 = (GE)
= : +1 858 8826656 (US)
Algorithmic Bioinformatics   =       Mobile : +49 160 7195754 = (GE)
= : +1  858405 8323 (US)
Freie Universit=E4t = Berlin           Skype : = knut.reinert
Takustrasse 9 =             &n= bsp;        E-Mail : knut.reinert@fu-berlin.de
D-14195 Berlin, = Germany            = Web      : http://knut.reinert.ws