From danielpeterjames@gmail.com Tue Jul 05 16:11:31 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6La-00036v-2m>; Tue, 05 Jul 2011 16:11:30 +0200 Received: from smtp.sanger.ac.uk ([193.62.202.243]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6LZ-00074h-Vi>; Tue, 05 Jul 2011 16:11:30 +0200 Received: from intmail2a.internal.sanger.ac.uk ([172.17.14.146] helo=smtp.sanger.ac.uk) by mailrelay.internal.sanger.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1Qe6LZ-00028w-8k for seqan-dev@lists.fu-berlin.de; Tue, 05 Jul 2011 15:11:29 +0100 Received: from ssh.sanger.ac.uk ([193.62.203.55] helo=analytics.google.com) by intmail2a.internal.sanger.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1Qe6LY-0003YN-Vo for seqan-dev@lists.fu-berlin.de; Tue, 05 Jul 2011 15:11:29 +0100 From: Daniel James Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Tue, 5 Jul 2011 15:11:28 +0100 Message-Id: To: SeqAn Development Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-Originating-IP: 193.62.202.243 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1309875090-00005A17-54FB9D05/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000477, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,FORGED_RCVD_HELO,SPF_HELO_PASS Subject: [Seqan-dev] Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2011 14:11:31 -0000 Hi I'm running into some unexpected behaviour whilst try to use an SA fibre = of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in = string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion = failed : static_cast(pos) < = static_cast(length(me)) was: 171599218 >=3D 100 (Trying to = access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { string syms; MyGenerator (string syms =3D "ACGT") : syms(syms) { = srand(time(NULL)); } char operator()(void) { return syms[rand() % syms.size()]; } =20= }; int main(int argc, char** argv) { // Input the number of sequences for the string set. stringstream ss(argv[1]); unsigned n; ss >> n; typedef StringSet = TMyStringSet; typedef Index > > = TMyIndex; typedef Fibre::Type = TMySAFibre; typedef Fibre::Type = TMyDirFibre; // Fill a string set with 60-mer DNA sequences. StringSet myStringSet; string input_s; DnaString input; for (unsigned i =3D 0; i < n; ++i) { input_s.resize(60); generate_n(input_s.begin(), 60, MyGenerator()); input =3D input_s; appendValue(myStringSet, input); } // Build the index. TMyIndex index(myStringSet); // Require the QGramSA fibre. cout << "requiring SA fibre:\n"; float t0 =3D clock(); indexRequire(index, QGramSA()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; cout << "requiring Dir fibre:\n"; t0 =3D clock(); indexRequire(index, QGramDir()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; TMySAFibre mySAFibre =3D getFibre(index, QGramSA()); cout << "QGramSA length: " << length(mySAFibre) << endl; TMyDirFibre myDirFibre =3D getFibre(index, QGramDir()); cout << "QGramDir length: " << length(myDirFibre) << endl; return 0; } -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From danielpeterjames@gmail.com Tue Jul 05 16:18:35 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6SQ-0003PI-AF>; Tue, 05 Jul 2011 16:18:34 +0200 Received: from mail-yi0-f54.google.com ([209.85.218.54]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qe6SQ-0003HM-3z>; Tue, 05 Jul 2011 16:18:34 +0200 Received: by yic13 with SMTP id 13so978554yic.13 for ; Tue, 05 Jul 2011 07:18:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; bh=kQkxOv9tpFvv6m0ugCDHtsORE6O7Rheq9QEdrDXQ9s4=; b=qJ+llFOdJL60JS7aYZ8La4uN5S2/B0fpkMab+0Q9TQs7Jhi0R/Z9WlDYr8WBx/2Mdb oa/urSsKa1UDVoGVpQztHPYc5+ZwLYVR8GoRmACDGlh8UtJESffhGTc0+E2ELbAkBb+K bhKbhnAidLWGRk4Um1OmziXzCxOefSdXioZq8= Received: by 10.236.67.76 with SMTP id i52mr100266yhd.308.1309875513070; Tue, 05 Jul 2011 07:18:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.95.42 with HTTP; Tue, 5 Jul 2011 07:18:13 -0700 (PDT) From: Daniel James Date: Tue, 5 Jul 2011 15:18:13 +0100 Message-ID: To: seqan-dev@lists.fu-berlin.de Content-Type: text/plain; charset=ISO-8859-1 X-Originating-IP: 209.85.218.54 X-purgate: suspect X-purgate-type: suspect X-purgate-ID: 151147::1309875514-00005A17-BC43DD3F/3384268707-0/0-1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000332, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: xx X-Spam-Status: No, score=2.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,FU_XPURGATE_SUSP,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS Subject: [Seqan-dev] Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2011 14:18:35 -0000 Hi I'm running into some unexpected behaviour whilst try to use an SA fibre of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion failed : static_cast(pos) < static_cast(length(me)) was: 171599218 >= 100 (Trying to access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { string syms; MyGenerator (string syms = "ACGT") : syms(syms) { srand(time(NULL)); } char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { // Input the number of sequences for the string set. stringstream ss(argv[1]); unsigned n; ss >> n; typedef StringSet TMyStringSet; typedef Index > > TMyIndex; typedef Fibre::Type TMySAFibre; typedef Fibre::Type TMyDirFibre; // Fill a string set with 60-mer DNA sequences. StringSet myStringSet; string input_s; DnaString input; for (unsigned i = 0; i < n; ++i) { input_s.resize(60); generate_n(input_s.begin(), 60, MyGenerator()); input = input_s; appendValue(myStringSet, input); } // Build the index. TMyIndex index(myStringSet); // Require the QGramSA fibre. cout << "requiring SA fibre:\n"; float t0 = clock(); indexRequire(index, QGramSA()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; cout << "requiring Dir fibre:\n"; t0 = clock(); indexRequire(index, QGramDir()); cout << (clock() - t0)/CLOCKS_PER_SEC << endl; TMySAFibre mySAFibre = getFibre(index, QGramSA()); cout << "QGramSA length: " << length(mySAFibre) << endl; TMyDirFibre myDirFibre = getFibre(index, QGramDir()); cout << "QGramDir length: " << length(myDirFibre) << endl; return 0; } From johdro@mpi-inf.mpg.de Wed Jul 06 16:28:27 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeT5V-0001Ng-Ge>; Wed, 06 Jul 2011 16:28:25 +0200 Received: from infao0809.mpi-sb.mpg.de ([139.19.1.49] helo=hera.mpi-sb.mpg.de) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeT5V-0008UJ-D9>; Wed, 06 Jul 2011 16:28:25 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date: MIME-Version:Content-Type:Content-Transfer-Encoding:Message-Id; bh=a0v4GOmkTrAE5rbYj1h8Vb/ZJcevuKayWIu9qRHYGCw=; b=VAWWGLcZtRNUt d6gCyS9VuQFJeXwDFZbacrvddix09Xvk3AsBSsnQa8qUQdfwVrPon0U4eg/S8bab /AR/nds0j+YCKrmzulp7xJQ4Zw8sMvBeMDxUWwaSDiZZHr0i6m1qwHXAyi5DRSwJ 3klvOTqMi+RjXyJbd3/LqlmILx/TKg= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:52797) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeT5S-0003qj-Nx for seqan-dev@lists.fu-berlin.de; Wed, 06 Jul 2011 16:28:24 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:44123 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeT5S-0002wc-En for seqan-dev@lists.fu-berlin.de; Wed, 06 Jul 2011 16:28:22 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?utf-8?q?Universit=C3=A4t_D=C3=BCsseldorf/Max-Planck-Institut_f=C3=BCr?= =?utf-8?q?_Informatik?= =?utf-8?q?_Saarbr=C3=BCcken?= To: seqan-dev@lists.fu-berlin.de Date: Wed, 6 Jul 2011 16:28:12 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107061628.12413.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1309962505-00005A17-4D1B1C83/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO Subject: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jul 2011 14:28:27 -0000 Hello, I am using Seqan to access a large FASTA file. In this case, I am importing= the whole RefSeq DB for random access (into memory or memory-mapped). This= can be quite a huge file, so I decided to go for a dynamic strategy writin= g a generic SequenceStorage object. It works well for typedef seqan::String< seqan::Dna5 > StringType; //(default type) typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; but not for typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; Here is the Code that imports the data using the MMap-Trick from the HowTo = and put it into a StringSet< StringType > data_; with an index data structure=20 std::map< std::string, long unsigned int > id2pos_; =2D------------------------------------------------------------------------= =2D------ seqan::MultiSeqFile db_sequences; seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY ); seqan::split( db_sequences, seqan::Fasta() ); for( unsigned int i =3D 0; i < num_records; ++i ) { StringType seq; seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); =09 std::string id; seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValueById= ( data_, seq ); } =2D------------------------------------------------------------------------= =2D------ 1) seqan::assignValueById() will cause a segfault at sequence number 33,924= out of 276,313 when using a StringSet with mmap strings. 2) Also, I don't know how to define a StringSet using array strings. 3) Using a regular Dna5 string, the how operation will take about 5 minutes= =2E A packed string requires much longer to load. Is there any way to speed= this up? I could think of a (binary) sink for a StingSet to avoid parsing = and recoding every time I load the DB sequences. Is there anything like thi= s (planned)? I appreciate your help! Gru=DF Johannes From johdro@mpi-inf.mpg.de Thu Jul 07 15:08:24 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeoJb-0003OS-7j>; Thu, 07 Jul 2011 15:08:23 +0200 Received: from infao0809.mpi-sb.mpg.de ([139.19.1.49] helo=hera.mpi-sb.mpg.de) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeoJb-0006DN-4q>; Thu, 07 Jul 2011 15:08:23 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=fTx48qu9q2NCR7L9CuR2mafmVB3Kik4kMtgSeVmUGu0=; b=t LzHJG5WYoRZz76auNR1y2jYDsR7RCrAFZK1s05VTisAuQwAtZXjdG2I2jJ4/Saq5 dZOr/hs6Z5CVqO1mrKE2oqmFUeyewyPqov6w+O0VLMGHfz572LWauyplrI/2MPmh DxHi4JeUWb3G6mna0k/V9ZEb/CrC5U9nA6qmEekZVg= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:39323) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeoJV-0002Nj-Bs for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 15:08:22 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:56142 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeoJV-0002WF-1g for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 15:08:17 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 7 Jul 2011 15:08:15 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> In-Reply-To: <201107061628.12413.johdro@mpi-inf.mpg.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107071508.16040.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310044103-00005A17-A41CCE20/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000003, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 13:08:24 -0000 Can anybody check whether the use of MMStrings is used correctly here via t= he assignSeq meta function? Do I still have to keep the MultiSeqFile object after all MMap Strings in t= he StringSet are constructed? Gru=DF Johannes From weese@campus.fu-berlin.de Thu Jul 07 16:14:36 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-00064I-LQ>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-00026n-JO>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QepLf-0006SK-EB>; Thu, 07 Jul 2011 16:14:35 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 7 Jul 2011 16:14:35 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 7 Jul 2011 16:14:34 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw8sDLzHJD5vx43T9SStqI6cTrHiQ== Message-ID: <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> References: <201107061628.12413.johdro@mpi-inf.mpg.de> In-Reply-To: <201107061628.12413.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310048075-00005A17-5BCD2A4A/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.002522, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 14:14:37 -0000 Hi Johannes, I assume the value of num_records less or equal to length(db_sequences). Lo= oking at your code it seems that you try to use a memory mapped string as a= temporary variable in a large loop. Maybe not the best idea, as it would c= reate a temporary file and deletes it in every iteration. It could be that = the temporary could not be opened, you could test that with a #define SEQAN= _DEBUG before including any SeqAn header. You should at least move all the instantiations out of the loop. Still I do= nt think you need a memory mapped string (seq) to store a single sequence o= f a multi fasta file. Also I cannot see, where you store the read sequences= . It would make sense to use a single StringSet >, Owner > > data_ that stores multiple sequences using a single memo= ry mapped string. HTH. If the problem still remains, please create a bug ticket with source c= ode and example files. Cheers, David -- David Weese weese@inf.fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021=20 Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: > Hello, > I am using Seqan to access a large FASTA file. In this case, I am importi= ng the whole RefSeq DB for random access (into memory or memory-mapped). Th= is can be quite a huge file, so I decided to go for a dynamic strategy writ= ing a generic SequenceStorage object. It works well for >=20 > typedef seqan::String< seqan::Dna5 > StringType; //(default type) > typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; >=20 > but not for > typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; >=20 > Here is the Code that imports the data using the MMap-Trick from the HowT= o and put it into a >=20 > StringSet< StringType > data_; >=20 > with an index data structure=20 >=20 > std::map< std::string, long unsigned int > id2pos_; >=20 > -------------------------------------------------------------------------= ------- > seqan::MultiSeqFile db_sequences; > seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY ); > seqan::split( db_sequences, seqan::Fasta() ); >=20 > for( unsigned int i =3D 0; i < num_records; ++i ) { > StringType seq; > seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > =09 > std::string id; > seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValueBy= Id( data_, seq ); > } > -------------------------------------------------------------------------= ------- >=20 > 1) seqan::assignValueById() will cause a segfault at sequence number 33,9= 24 out of 276,313 when using a StringSet with mmap strings. >=20 > 2) Also, I don't know how to define a StringSet using array strings. >=20 > 3) Using a regular Dna5 string, the how operation will take about 5 minut= es. A packed string requires much longer to load. Is there any way to speed= this up? I could think of a (binary) sink for a StingSet to avoid parsing = and recoding every time I load the DB sequences. Is there anything like thi= s (planned)? >=20 > I appreciate your help! >=20 > Gru=DF Johannes >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From johdro@mpi-inf.mpg.de Thu Jul 07 17:11:45 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeqEy-0008LQ-3u>; Thu, 07 Jul 2011 17:11:44 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QeqEy-0005Dc-0G>; Thu, 07 Jul 2011 17:11:44 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=3DaBen/ajd/uf30AXydrnkjBtmlkxvSn7O+o9W7cw08=; b=F Dgi7o/0N4jRXaDYo0D9VS6fkJfS77KijYs8+u+D5BcPpT9BJCO/iVxXnXV9ylp+0 IWxJm6i/4UEBBz+UxMlr6jXEFfTYIU5uq40CNz3PMw+/rqPqKzPZeHu7H/T9hYQD BtEKSyZ7B2lDdwGxK3y+8ivDUIKbaCE8SXeQyj1K0w= Received: from maniac.mpi-klsb.mpg.de ([139.19.1.26]:39329) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QeqEv-00029I-2J for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 17:11:43 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:53717 helo=linux-eu7n.site) by maniac.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QeqEu-0003Ye-PC for seqan-dev@lists.fu-berlin.de; Thu, 07 Jul 2011 17:11:40 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 7 Jul 2011 17:11:39 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> In-Reply-To: <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107071711.39904.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310051504-00005A17-507A7EEC/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 15:11:45 -0000 Hello David, thank for your comments. I am still not confident with the design concept o= f memory mapped single strings in Seqan. The idea of the loop in any case i= s to create a StringSet which type depends on the choosen StringType. So it= works this way: 1) create temporary sequences object 2) assign content from memory mapped multi-fasta file (MultiSeqFile) 3) store in StringSet which will have ownership (I guess this is done via a= copy constructor) This works fine for standard and packed string types. I would also like to = have a StringSet that contains strings that are actually memory mapped from= the original multi-fasta file. I thought that the assignSeq function would= appropriately handle this when I use it with default-constructed memory ma= pped sequence object. I seems I misunderstood the design of this sequence t= ype. Is there any way to construct such a StringSet I have in mind? Gru=DF Johannes Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: > Hi Johannes, >=20 > I assume the value of num_records less or equal to length(db_sequences). = Looking at your code it seems that you try to use a memory mapped string as= a temporary variable in a large loop. Maybe not the best idea, as it would= create a temporary file and deletes it in every iteration. It could be tha= t the temporary could not be opened, you could test that with a #define SEQ= AN_DEBUG before including any SeqAn header. > You should at least move all the instantiations out of the loop. Still I = dont think you need a memory mapped string (seq) to store a single sequence= of a multi fasta file. Also I cannot see, where you store the read sequenc= es. It would make sense to use a single StringSet >, Owner= > > data_ that stores multiple sequences using a single me= mory mapped string. >=20 > HTH. If the problem still remains, please create a bug ticket with source= code and example files. >=20 > Cheers, > David > -- > David Weese weese@inf.fu-berlin.de > Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ > Institut f=FCr Informatik Phone: +49 30 838 75246 > Takustra=DFe 9 Algorithmic Bioinformatics > 14195 Berlin Room 021=20 >=20 > Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: >=20 > > Hello, > > I am using Seqan to access a large FASTA file. In this case, I am impor= ting the whole RefSeq DB for random access (into memory or memory-mapped). = This can be quite a huge file, so I decided to go for a dynamic strategy wr= iting a generic SequenceStorage object. It works well for > >=20 > > typedef seqan::String< seqan::Dna5 > StringType; //(default type) > > typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; > >=20 > > but not for > > typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; > >=20 > > Here is the Code that imports the data using the MMap-Trick from the Ho= wTo and put it into a > >=20 > > StringSet< StringType > data_; > >=20 > > with an index data structure=20 > >=20 > > std::map< std::string, long unsigned int > id2pos_; > >=20 > > -----------------------------------------------------------------------= =2D-------- > > seqan::MultiSeqFile db_sequences; > > seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY = ); > > seqan::split( db_sequences, seqan::Fasta() ); > >=20 > > for( unsigned int i =3D 0; i < num_records; ++i ) { > > StringType seq; > > seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > > =09 > > std::string id; > > seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > > id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValue= ById( data_, seq ); > > } > > -----------------------------------------------------------------------= =2D-------- > >=20 > > 1) seqan::assignValueById() will cause a segfault at sequence number 33= ,924 out of 276,313 when using a StringSet with mmap strings. > >=20 > > 2) Also, I don't know how to define a StringSet using array strings. > >=20 > > 3) Using a regular Dna5 string, the how operation will take about 5 min= utes. A packed string requires much longer to load. Is there any way to spe= ed this up? I could think of a (binary) sink for a StingSet to avoid parsin= g and recoding every time I load the DB sequences. Is there anything like t= his (planned)? > >=20 > > I appreciate your help! > >=20 > > Gru=DF Johannes From weese@campus.fu-berlin.de Thu Jul 07 20:28:04 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0007vj-T3>; Thu, 07 Jul 2011 20:28:03 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0002SC-Qd>; Thu, 07 Jul 2011 20:28:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QetIw-0002Kt-LM>; Thu, 07 Jul 2011 20:28:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 7 Jul 2011 20:28:02 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 7 Jul 2011 20:28:00 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw805qnZobLvX9bQL2LWaIqLXojkw== Message-ID: <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> References: <201107061628.12413.johdro@mpi-inf.mpg.de> <54AB0ECB-7D70-4C78-9486-80916F136EBA@fu-berlin.de> <201107071711.39904.johdro@mpi-inf.mpg.de> In-Reply-To: <201107071711.39904.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310063282-00005A17-74DBFF58/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.006334, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2011 18:28:04 -0000 Hi, follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/EfficientI= mportOfMillionsOfSequences and simply change: StringSet > seqs; into: StringSet >, Owner > > seqs; That should do what you want. Regards, David Am 07.07.2011 um 17:11 schrieb Johannes Dr=F6ge: > Hello David, > thank for your comments. I am still not confident with the design concept= of memory mapped single strings in Seqan. The idea of the loop in any case= is to create a StringSet which type depends on the choosen StringType. So = it works this way: >=20 > 1) create temporary sequences object > 2) assign content from memory mapped multi-fasta file (MultiSeqFile) > 3) store in StringSet which will have ownership (I guess this is done via= a copy constructor) >=20 > This works fine for standard and packed string types. I would also like t= o have a StringSet that contains strings that are actually memory mapped fr= om the original multi-fasta file. I thought that the assignSeq function wou= ld appropriately handle this when I use it with default-constructed memory = mapped sequence object. I seems I misunderstood the design of this sequence= type. Is there any way to construct such a StringSet I have in mind? >=20 > Gru=DF Johannes >=20 >=20 > Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: >> Hi Johannes, >>=20 >> I assume the value of num_records less or equal to length(db_sequences).= Looking at your code it seems that you try to use a memory mapped string a= s a temporary variable in a large loop. Maybe not the best idea, as it woul= d create a temporary file and deletes it in every iteration. It could be th= at the temporary could not be opened, you could test that with a #define SE= QAN_DEBUG before including any SeqAn header. >> You should at least move all the instantiations out of the loop. Still I= dont think you need a memory mapped string (seq) to store a single sequenc= e of a multi fasta file. Also I cannot see, where you store the read sequen= ces. It would make sense to use a single StringSet >, Owne= r > > data_ that stores multiple sequences using a single m= emory mapped string. >>=20 >> HTH. If the problem still remains, please create a bug ticket with sourc= e code and example files. >>=20 >> Cheers, >> David >> -- >> David Weese weese@inf.fu-berlin.de >> Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ >> Institut f=FCr Informatik Phone: +49 30 838 75246 >> Takustra=DFe 9 Algorithmic Bioinformatics >> 14195 Berlin Room 021=20 >>=20 >> Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: >>=20 >>> Hello, >>> I am using Seqan to access a large FASTA file. In this case, I am impor= ting the whole RefSeq DB for random access (into memory or memory-mapped). = This can be quite a huge file, so I decided to go for a dynamic strategy wr= iting a generic SequenceStorage object. It works well for >>>=20 >>> typedef seqan::String< seqan::Dna5 > StringType; //(default type) >>> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; >>>=20 >>> but not for >>> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; >>>=20 >>> Here is the Code that imports the data using the MMap-Trick from the Ho= wTo and put it into a >>>=20 >>> StringSet< StringType > data_; >>>=20 >>> with an index data structure=20 >>>=20 >>> std::map< std::string, long unsigned int > id2pos_; >>>=20 >>> -----------------------------------------------------------------------= --------- >>> seqan::MultiSeqFile db_sequences; >>> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY = ); >>> seqan::split( db_sequences, seqan::Fasta() ); >>>=20 >>> for( unsigned int i =3D 0; i < num_records; ++i ) { >>> StringType seq; >>> seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); >>> =09 >>> std::string id; >>> seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); >>> id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignValue= ById( data_, seq ); >>> } >>> -----------------------------------------------------------------------= --------- >>>=20 >>> 1) seqan::assignValueById() will cause a segfault at sequence number 33= ,924 out of 276,313 when using a StringSet with mmap strings. >>>=20 >>> 2) Also, I don't know how to define a StringSet using array strings. >>>=20 >>> 3) Using a regular Dna5 string, the how operation will take about 5 min= utes. A packed string requires much longer to load. Is there any way to spe= ed this up? I could think of a (binary) sink for a StingSet to avoid parsin= g and recoding every time I load the DB sequences. Is there anything like t= his (planned)? >>>=20 >>> I appreciate your help! >>>=20 >>> Gru=DF Johannes >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From johdro@mpi-inf.mpg.de Fri Jul 08 16:05:36 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfBgV-00034Q-Bj>; Fri, 08 Jul 2011 16:05:35 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfBgV-0001DM-8I>; Fri, 08 Jul 2011 16:05:35 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=/bszjVp6WuzvHjnp18vPiW7EKLEzqoDztcsl5fbuuRw=; b=v MQfg2bvy1YUJIwVKCxwG6XdiZt+LPW7O7UBAZHqzkwuJxsg+0S/enZRw3Gy0hTnf yWLb0d3sDSMxMKmKLf1DUAEDjFOZVRTQTvTs4Hl+HypHO/F+O1o8LUc19ayT5Xvj 4WfM4Fz2ZhS07VwaMX5C1HFntrVACRJI8pwcjehY8s= Received: from infao0710.mpi-klsb.mpg.de ([139.19.1.27]:35964 helo=zak.mpi-klsb.mpg.de) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1QfBgR-0001yu-NY for seqan-dev@lists.fu-berlin.de; Fri, 08 Jul 2011 16:05:34 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:58432 helo=linux-eu7n.site) by zak.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1QfBgR-0001St-AO for seqan-dev@lists.fu-berlin.de; Fri, 08 Jul 2011 16:05:31 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Fri, 8 Jul 2011 16:05:30 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107071711.39904.johdro@mpi-inf.mpg.de> <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> In-Reply-To: <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107081605.30762.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310133935-00005A17-F217A9FC/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jul 2011 14:05:36 -0000 Sorry, I still don't get it. How can [ MutiFastaFile =3D=3D> Dna5String =3D=3D> StringSet >, Owner > > ] work, if it copies the value of the = sequence? Doesn't assignSeq() copy the value into the Dna5String seq? What happens when I use appendValue to add seq to the StringSet, where does= it actually reside (it should still be in the MultiFasta file). I need to access the MultiFastaFile (on the hard disk) as a regular StringS= et to read its contents on demand, not copy its sequences into a new memory= =2Dmapped file. Johannes Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David: > Hi, >=20 > follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/Efficien= tImportOfMillionsOfSequences and simply change: >=20 > StringSet > seqs; >=20 > into: >=20 > StringSet >, Owner > > seqs; >=20 > That should do what you want. >=20 > Regards, > David >=20 >=20 > Am 07.07.2011 um 17:11 schrieb Johannes Dr=F6ge: >=20 > > Hello David, > > thank for your comments. I am still not confident with the design conce= pt of memory mapped single strings in Seqan. The idea of the loop in any ca= se is to create a StringSet which type depends on the choosen StringType. S= o it works this way: > >=20 > > 1) create temporary sequences object > > 2) assign content from memory mapped multi-fasta file (MultiSeqFile) > > 3) store in StringSet which will have ownership (I guess this is done v= ia a copy constructor) > >=20 > > This works fine for standard and packed string types. I would also like= to have a StringSet that contains strings that are actually memory mapped = from the original multi-fasta file. I thought that the assignSeq function w= ould appropriately handle this when I use it with default-constructed memor= y mapped sequence object. I seems I misunderstood the design of this sequen= ce type. Is there any way to construct such a StringSet I have in mind? > >=20 > > Gru=DF Johannes > >=20 > >=20 > > Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David: > >> Hi Johannes, > >>=20 > >> I assume the value of num_records less or equal to length(db_sequences= ). Looking at your code it seems that you try to use a memory mapped string= as a temporary variable in a large loop. Maybe not the best idea, as it wo= uld create a temporary file and deletes it in every iteration. It could be = that the temporary could not be opened, you could test that with a #define = SEQAN_DEBUG before including any SeqAn header. > >> You should at least move all the instantiations out of the loop. Still= I dont think you need a memory mapped string (seq) to store a single seque= nce of a multi fasta file. Also I cannot see, where you store the read sequ= ences. It would make sense to use a single StringSet >, Ow= ner > > data_ that stores multiple sequences using a single= memory mapped string. > >>=20 > >> HTH. If the problem still remains, please create a bug ticket with sou= rce code and example files. > >>=20 > >> Cheers, > >> David > >> -- > >> David Weese weese@inf.fu-berlin.de > >> Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ > >> Institut f=FCr Informatik Phone: +49 30 838 75246 > >> Takustra=DFe 9 Algorithmic Bioinformatics > >> 14195 Berlin Room 021=20 > >>=20 > >> Am 06.07.2011 um 16:28 schrieb Johannes Dr=F6ge: > >>=20 > >>> Hello, > >>> I am using Seqan to access a large FASTA file. In this case, I am imp= orting the whole RefSeq DB for random access (into memory or memory-mapped)= =2E This can be quite a huge file, so I decided to go for a dynamic strateg= y writing a generic SequenceStorage object. It works well for > >>>=20 > >>> typedef seqan::String< seqan::Dna5 > StringType; //(default type) > >>> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType; > >>>=20 > >>> but not for > >>> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType; > >>>=20 > >>> Here is the Code that imports the data using the MMap-Trick from the = HowTo and put it into a > >>>=20 > >>> StringSet< StringType > data_; > >>>=20 > >>> with an index data structure=20 > >>>=20 > >>> std::map< std::string, long unsigned int > id2pos_; > >>>=20 > >>> ---------------------------------------------------------------------= =2D---------- > >>> seqan::MultiSeqFile db_sequences; > >>> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONL= Y ); > >>> seqan::split( db_sequences, seqan::Fasta() ); > >>>=20 > >>> for( unsigned int i =3D 0; i < num_records; ++i ) { > >>> StringType seq; > >>> seqan::assignSeq( seq, db_sequences[i], fasta_format_ ); > >>> =09 > >>> std::string id; > >>> seqan::assignSeqId( id, db_sequences[i], fasta_format_ ); > >>> id2pos_[ extractFastaCommentField( id, "gi" ) ] =3D seqan::assignVal= ueById( data_, seq ); > >>> } > >>> ---------------------------------------------------------------------= =2D---------- > >>>=20 > >>> 1) seqan::assignValueById() will cause a segfault at sequence number = 33,924 out of 276,313 when using a StringSet with mmap strings. > >>>=20 > >>> 2) Also, I don't know how to define a StringSet using array strings. > >>>=20 > >>> 3) Using a regular Dna5 string, the how operation will take about 5 m= inutes. A packed string requires much longer to load. Is there any way to s= peed this up? I could think of a (binary) sink for a StingSet to avoid pars= ing and recoding every time I load the DB sequences. Is there anything like= this (planned)? > >>>=20 > >>> I appreciate your help! > >>>=20 > >>> Gru=DF Johannes > >=20 > > _______________________________________________ > > seqan-dev mailing list > > seqan-dev@lists.fu-berlin.de > > https://lists.fu-berlin.de/listinfo/seqan-dev >=20 >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev >=20 From weese@campus.fu-berlin.de Fri Jul 08 19:35:22 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0002bg-CW>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0005fc-AT>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfExV-0000sK-59>; Fri, 08 Jul 2011 19:35:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 8 Jul 2011 19:35:21 +0200 From: "Weese, David" To: SeqAn Development Date: Fri, 8 Jul 2011 19:35:20 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: Acw9lWkH4aSvRvxwQgugBESkPDWPrQ== Message-ID: References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107071711.39904.johdro@mpi-inf.mpg.de> <32E2E994-9A9C-4536-B5D5-0A6970E3723E@fu-berlin.de> <201107081605.30762.johdro@mpi-inf.mpg.de> In-Reply-To: <201107081605.30762.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310146521-00005A17-AD3478FA/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.148144, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jul 2011 17:35:22 -0000 Am 08.07.2011 um 16:05 schrieb Johannes Dr=F6ge: > Sorry, I still don't get it. > How can [ MutiFastaFile =3D=3D> Dna5String =3D=3D> StringSet >, Owner > > ] work, if it copies the value of th= e sequence? > Doesn't assignSeq() copy the value into the Dna5String seq? assignSeq *extracts* the sequence information from a block that may contain= a header, a sequence interspersed by newlines, quality values, etc. If want to get sequence substrings of an unprocessed Fasta file, they may c= ontain whitespace. >=20 > What happens when I use appendValue to add seq to the StringSet, where do= es it actually reside (it should still be in the MultiFasta file). As assignSeq(seq, ...) extracts the sequence character-by-character there i= s no association between seq and the Fasta file. >=20 > I need to access the MultiFastaFile (on the hard disk) as a regular Strin= gSet to read its contents on demand, not copy its sequences into a new memo= ry-mapped file. Then you need to keep the split MultiSeqFile and extract the sequences on d= emand with assignSeq. If you access the sequences very often I would recommend to fill a StringSe= t >, Owner > > (see my last mail) whic= h also resides on your hard disk but can be accessed without assignSeq. >=20 > Johannes >=20 >=20 > Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David: >> Hi, >>=20 >> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/Efficie= ntImportOfMillionsOfSequences and simply change: >>=20 >> StringSet > seqs; >>=20 >> into: >>=20 >> StringSet >, Owner > > seqs; >>=20 >> That should do what you want. >>=20 >> Regards, >> David >>=20 >>=20 From danielpeterjames@gmail.com Sun Jul 10 17:50:53 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfwHU-0002YN-B4>; Sun, 10 Jul 2011 17:50:52 +0200 Received: from mail-yi0-f54.google.com ([209.85.218.54]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QfwHU-00070f-4p>; Sun, 10 Jul 2011 17:50:52 +0200 Received: by yic13 with SMTP id 13so277791yic.13 for ; Sun, 10 Jul 2011 08:50:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=5qAe/7b7Q5c2is4ho0laIfgp2thrk2m02ARWsqRM1FA=; b=b4lu+PSQopLrxkUfP7HTolr/JjrNEEyqRv49OecxCRQ0AO2IyNNtypgNMPc87etdmg 6FKKCXe/z5AE7qjisd9NU1oRJdSaI6tHAoeqfc7hok6nSrXSqebGDahTAi/TNb2wCDR9 Vb3xApDwDM5pc1DHBcJJNP5vyiMbucfn5F8Q0= Received: by 10.236.67.76 with SMTP id i52mr4259807yhd.308.1310313051120; Sun, 10 Jul 2011 08:50:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.95.42 with HTTP; Sun, 10 Jul 2011 08:50:30 -0700 (PDT) In-Reply-To: References: From: Daniel James Date: Sun, 10 Jul 2011 16:50:30 +0100 Message-ID: To: seqan-dev@lists.fu-berlin.de Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Originating-IP: 209.85.218.54 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310313052-00005A17-C3D1B63C/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.8 required=5.0 tests=DNS_FROM_RFC_ABUSE, DNS_FROM_RFC_POST,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS Subject: [Seqan-dev] Fwd: Building QGramSA fibre from StringSet X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Jul 2011 15:50:53 -0000 Hi I'm getting an error whilst trying to get the SA fibre of a QGram index over a StringSet There's a minimal example below. Is there any chance someone could have a look at this? I'm on Rev: 9996 of trunk. Daniel #include #include using namespace seqan; // Generates random nucleotides. struct MyGenerator : std::unary_function { std::string syms; MyGenerator (std::string syms =3D "ACGT") : syms(syms) { srand(time(NUL= L)); } char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { typedef StringSet TMyStringS= et; typedef Index > > TMyIndex; StringSet myStringSet; for (unsigned i =3D 0; i < 100; ++i) { DnaString input; resize(input, 60); generate_n(begin(input), 60, MyGenerator()); appendValue(myStringSet, input); } std::cout << myStringSet[0] << std::endl; std::cout << "requiring QGramSA..." << std::endl; TMyIndex index(myStringSet); indexRequire(index, QGramSA()); return 0; } ---------- Forwarded message ---------- From: Daniel James Date: 5 July 2011 15:18 Subject: Building QGramSA fibre from StringSet To: seqan-dev@lists.fu-berlin.de Hi I'm running into some unexpected behaviour whilst try to use an SA fibre of a QGram index that's built over a string set. The below code runs OK with 1 or 10 as command line args, (strings in string set), but fails at 100 with the following exception: /Users/dj5/usr/local/include/seqan/sequence/string_base.h:238 Assertion failed : static_cast(pos) < static_cast(length(me)) was: 171599218 >=3D 100 (Trying to access an element behind the last one!) Have I made a coding blunder or is this a bug? Many thanks, Daniel #include #include #include #include #include #include #include using namespace std; using namespace seqan; // Generates random nucleotides. struct MyGenerator : public unary_function { =A0 string syms; =A0 MyGenerator (string syms =3D "ACGT") : syms(syms) { srand(time(NULL)); = } =A0 char operator()(void) { return syms[rand() % syms.size()]; } }; int main(int argc, char** argv) { =A0 // Input the number of sequences for the string set. =A0 stringstream ss(argv[1]); =A0 unsigned n; =A0 ss >> n; =A0 typedef StringSet =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0TMyStringSet; =A0 typedef Index > > =A0 TMyInde= x; =A0 typedef Fibre::Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0TMySAFibre; =A0 typedef Fibre::Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 TMyDirFibre; =A0 // Fill a string set with 60-mer DNA sequences. =A0 StringSet myStringSet; =A0 string input_s; =A0 DnaString input; =A0 for (unsigned i =3D 0; i < n; ++i) =A0 { =A0 =A0 =A0 input_s.resize(60); =A0 =A0 =A0 generate_n(input_s.begin(), 60, MyGenerator()); =A0 =A0 =A0 input =3D input_s; =A0 =A0 =A0 appendValue(myStringSet, input); =A0 } =A0 // Build the index. =A0 TMyIndex index(myStringSet); =A0 // Require the QGramSA fibre. =A0 cout << "requiring SA fibre:\n"; =A0 float t0 =3D clock(); =A0 indexRequire(index, QGramSA()); =A0 cout << (clock() - t0)/CLOCKS_PER_SEC << endl; =A0 cout << "requiring Dir fibre:\n"; =A0 t0 =3D clock(); =A0 indexRequire(index, QGramDir()); =A0 cout << (clock() - t0)/CLOCKS_PER_SEC << endl; =A0 TMySAFibre mySAFibre =3D getFibre(index, QGramSA()); =A0 cout << "QGramSA length: " << length(mySAFibre) << endl; =A0 TMyDirFibre myDirFibre =3D getFibre(index, QGramDir()); =A0 cout << "QGramDir length: " << length(myDirFibre) << endl; =A0 return 0; } From f.buske@uq.edu.au Wed Jul 13 13:47:27 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QgxuX-0001lj-VD>; Wed, 13 Jul 2011 13:47:26 +0200 Received: from mailhub4.uq.edu.au ([130.102.149.131]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QgxuX-0000Cc-6Y>; Wed, 13 Jul 2011 13:47:25 +0200 Received: from smtp4.uq.edu.au (smtp4.uq.edu.au [130.102.128.19]) by mailhub4.uq.edu.au (8.13.8/8.13.8) with ESMTP id p6DBlKDX017825 for ; Wed, 13 Jul 2011 21:47:21 +1000 Received: from Fabian-Buskes-MacBook.local (c122-108-178-120.rochd4.qld.optusnet.com.au [122.108.178.120]) (authenticated bits=0) by smtp4.uq.edu.au (8.13.8/8.13.8) with ESMTP id p6DBlIIG014885 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 13 Jul 2011 21:47:20 +1000 Message-ID: <4E1D85C6.6020500@uq.edu.au> Date: Wed, 13 Jul 2011 21:47:18 +1000 From: Fabian Buske User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: SeqAn Development Content-Type: multipart/mixed; boundary="------------050502070902010701000607" X-UQ-FilterTime: 1310557641 X-Scanned-By: MIMEDefang 2.58 on UQ Mailhub on 130.102.149.131 X-Originating-IP: 130.102.149.131 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310557645-00005A17-78AC34FE/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.9 required=5.0 tests=FORGED_RCVD_HELO, RATWARE_GECKO_BUILD Subject: [Seqan-dev] bugfix: missing metafunction in misc_dequeue.h X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2011 11:47:27 -0000 This is a multi-part message in MIME format. --------------050502070902010701000607 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi, it seems that seqan/misc/misc_dequeue.h does currently not have the functionality to report the type of its values (missing metafunction). I didn't check whether this may also explain some of the bugs in the bug-tracker related to the dequeue template. Anyway, attached is a fix for this one. Cheers, Fabian -- Fabian A. Buske Institute for Molecular Bioscience The University of Queensland Brisbane, Qld. 4072 Australia Phone: (61)-(7)-334-62608 --------------050502070902010701000607 Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0"; name="misc_dequeue_bugfix.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="misc_dequeue_bugfix.diff" Index: misc_dequeue.h =================================================================== --- misc_dequeue.h (revision 10040) +++ misc_dequeue.h (working copy) @@ -108,6 +108,11 @@ typedef Iter const, PositionIterator> Type; }; +template +struct Value > +{ + typedef TValue Type; +}; ////////////////////////////////////////////////////////////////////////////// --------------050502070902010701000607-- From Knut.Reinert@fu-berlin.de Thu Jul 14 22:57:41 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhSya-0005FF-4i>; Thu, 14 Jul 2011 22:57:40 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhSyY-0001Sj-KQ>; Thu, 14 Jul 2011 22:57:39 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhSyH-0006FT-2l>; Thu, 14 Jul 2011 22:57:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 14 Jul 2011 22:57:19 +0200 From: "Reinert, Knut" To: AG ABI , Martin Vingron , "Kruglyak, Semyon" , Dirk Evers , Tobias Mann , Bret Barnes , Raffaele Giancarlo , Anthony Cox , Kathrin Trappe , Jochen Singer , Markus Bauer , Ole Schulz-Trieglaff , Nikolaus Rajewsky , Stefan Mundlos , Peter Robinson , =?iso-8859-1?Q?J=FCrgen_Kleffe?= , Hans-Peter Lenhof , Stefan Kurtz , Sven Rahmann , Jens Stoye , Lars Langner , kuss_a , Marcel Grunert , Gunnar Klau , wei chen , Chen Li , Steven Salzberg , "franzime@zedat.fu-berlin.de" , Konrad Ludwig Moritz Rudolph , Franziska Zickmann , Johannes Krugel , "Dr. Hans-Joachim Hinz" , Ulf Leser , Ulrich Meyer , Marcel Schulz , Cedric Notredame , Andreas Doering , Christoph Dieterich , Carsten Kemena , Michael Stromberg , "bertram.weiss@bayerhealthcare.com" , Lars Bertram , Michael Berthold , Marcel Martin , Tobias Rausch , Johannes Roehr , Martin Riese , "timmermann@molgen.mpg.de" , Johannes Fischer , Kurt Mehlhorn , Peter Sanders , "naeher@uni-trier.de" , Gonzalo Navarro , Andreas Hildebrandt , Aaron Halpern , Robert Giegerich , Rolf Backofen , "stadler@bioinf.uni-leipzig.de" , Kristina Little , Ralf Zimmer , Andreas Keller , Sabrina Krakau , "Dr. Bernhard Balkenhol" , Ralf Herwig , Han-Yu Chuang , "taeubig@informatik.tu-muenchen.de" , Vineet Bafna , Pavel Pevzner , David Haussler , Benedict Paten , "langmead@cs.umd.edu" , Granger Sutton , Shibu Yooseph , Andreas Beutler , Paolo Di Tommaso , "birney@ebi.ac.uk" , "jens-uwe.krause@lgcgenomics.com" , "efritzilas@illumina.com" , "rina.ahmed@mdc-berlin.de" , Camila Mazzoni , "wwong@illumina.com" , Fabian Buske , "Denis C. Bauer" , Franziska Zickmann , Oliver Kohlbacher , "Dr. Jan Baumbach" , "lengauer@mpi-sb.mpg.de" , Ernst Althaus , "mario.albrecht@mpi-inf.mpg.de" , "cbock@mpi-inf.mpg.de" , Stefan Canzar , =?iso-8859-1?Q?=22B=E4rwolf=2C_Aneta=22?= , Laurent Mouchard , Max Crochemore , "costas.iliopoulos@gmail.com Iliopoulos" , "steinke@zib.de Steinke" , "kathleen.steinhoefel@kcl.ac.uk" , =?iso-8859-1?Q?Johannes_Dr=F6ge?= , Thomas Dan Otto , Mat , Bernhard Renard , SeqAn Development , "mallgaier@igb-berlin.de" , "monaghan@igb-berlin.de" , =?iso-8859-1?Q?Bj=F6rn_Kahlert?= , Lutz Prechelt , Leon Kuchenbecker Date: Thu, 14 Jul 2011 22:57:02 +0200 Thread-Topic: Invitation to 3rd SeqAn workshop, September 12.-14., Harnack Haus, Berlin Dahlem Thread-Index: AcxCaJ6NuqA1bRFrSBC99lXlr1OsJw== Message-ID: <762980B4-CF7C-40B9-9AA0-6993B5B6283C@fu-berlin.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: multipart/signed; boundary="Apple-Mail-120--490142946"; protocol="application/pkcs7-signature"; micalg=sha1 MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310677060-00005A17-281F8CB3/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Mailman-Approved-At: Fri, 15 Jul 2011 00:20:54 +0200 Subject: [Seqan-dev] Invitation to 3rd SeqAn workshop, September 12.-14., Harnack Haus, Berlin Dahlem X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2011 20:57:41 -0000 --Apple-Mail-120--490142946 Content-Type: multipart/alternative; boundary=Apple-Mail-118--490143012 --Apple-Mail-118--490143012 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Dear friends, SeqAn users, and algorithm developers, I invite you (or coworkers) to participate in the 3rd SeqAn workshop = (www.seqan.de),=20 which will be from September 12. to 14. 2011 in Berlin, Germany. The workshop will be free of charge and be sponsored by the German = Ministry for Education and Research within the VIP program. The venue will be the Harnack Haus in Berlin Dahlem = (http://www.harnackhaus-berlin.mpg.de/). More information can be found on the attached flyer and on the VIP = project website (http://www.seqan-biostore.de/wp) In order to plan the details we would like you to confirm your = participation until August 14th the latest. Please send a mail to Sabrina Krakau with the = information=20 a) whether you want to participate b) whether you would like to give a talk on the Monday the 12th about = your recent research, open problems, or your experience with SeqAn (see = attached schedule). Please feel free to forward the mail to interested users. We hope to see you in Berlin in September, The SeqAn team ----------------------------------------------------------------------- Prof. Dr.-Ing. Knut Reinert Phone/fax : +49 30 838 75 222/218 = (GE) : +1 858 8826656 (US) Algorithmic Bioinformatics Mobile : +49 160 7195754 (GE) : +1 858405 8323 (US) Freie Universit=E4t Berlin Skype : knut.reinert Takustrasse 9 E-Mail : = knut.reinert@fu-berlin.de D-14195 Berlin, Germany Web : http://knut.reinert.ws ------------------------------------------------------------------------ --Apple-Mail-118--490143012 Content-Type: multipart/mixed; boundary=Apple-Mail-119--490143012 --Apple-Mail-119--490143012 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Dear friends, SeqAn users, and = algorithm developers,

I invite you (or coworkers) to participate = in the 3rd SeqAn workshop (www.seqan.de), 
which will be =  from September 12. to 14. 2011 in Berlin, = Germany.

The workshop will be free of charge = and be sponsored by the German Ministry for Education and Research = within the VIP program.
The venue will be the Harnack Haus in = Berlin Dahlem (http://www.harnackhaus-berl= in.mpg.de/).

More information can be found = on the attached flyer and on the VIP project website (http://www.seqan-biostore.de/wp)

In order to plan the details we would like you to = confirm your participation until August 14th the = latest.

Please send a mail = to Sabrina Krakau <krakau@mi.fu-berlin.de> =  with the information 

a) whether = you want to participate
b) whether you would like to give a = talk on the Monday the 12th about your recent research, open problems, = or your experience with SeqAn (see attached = schedule).

Please feel free to forward = the mail to interested users.

We hope to = see you in Berlin in September,

The SeqAn = team

: +49 30 838 75 222/218 = (GE)
= : +1 858 8826656 (US)
Algorithmic Bioinformatics   =       Mobile : +49 160 7195754 = (GE)
= : +1  858405 8323 (US)
Freie Universit=E4t = Berlin           Skype : = knut.reinert
Takustrasse 9 =             &n= bsp;        E-Mail : knut.reinert@fu-berlin.de
D-14195 Berlin, = Germany            = Web      : http://knut.reinert.ws

= --Apple-Mail-119--490143012-- --Apple-Mail-118--490143012-- --Apple-Mail-120--490142946 Content-Disposition: attachment; filename="smime.p7s" Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIF+jCCBfYw ggTeoAMCAQICBA4+Vb4wDQYJKoZIhvcNAQEFBQAwgbUxCzAJBgNVBAYTAkRFMQ8wDQYDVQQIEwZC ZXJsaW4xDzANBgNVBAcTBkJlcmxpbjEiMCAGA1UEChMZRnJlaWUgVW5pdmVyc2l0YWV0IEJlcmxp bjEOMAwGA1UECxMFWkVEQVQxMDAuBgNVBAMTJ0ZyZWllIFVuaXZlcnNpdGFldCBCZXJsaW4gLSBG VS1DQSAtIEcwMTEeMBwGCSqGSIb3DQEJARYPY2FARlUtQmVybGluLkRFMB4XDTA5MDUyODE4MzAw NloXDTEyMDUyNzE4MzAwNlowgZoxCzAJBgNVBAYTAkRFMQ8wDQYDVQQIEwZCZXJsaW4xDzANBgNV BAcTBkJlcmxpbjEiMCAGA1UEChMZRnJlaWUgVW5pdmVyc2l0YWV0IEJlcmxpbjEuMCwGA1UECxMl RmFjaGJlcmVpY2ggTWF0aGVtYXRpayB1bmQgSW5mb3JtYXRpazEVMBMGA1UEAxMMS251dCBSZWlu ZXJ0MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA1KhJct0+8zA+Rpez11JSOVahqmh2 YJ8TWWVIvxERJPkUUgz+M4u4mEk3fr4oayj2KC5MoT8sHRbcIw4pEnN1NP3a9tWJZhXbInsR0eWM 5s6LXaLEHbczNg+V4xaFzAm6JR1sJ5h6LDWqmh2lmUoJE9l1ypydet5rf6Qnvbkys4Xwg4Dp4f89 uAZznbpo36FgDqS848FzIRW6wvzatFtVxYiQ/zpRggWLYNRIWx9jZi5A9LrFq79Cx6h7bWU13hpW u8QT2yE1cfRnTw2lvNXdKQDmNVHtVub7CHdG3voeJiiFvApgPkSbGDi1nXMtVePb/1xd5CFMMoqw IwH44W2KVwIDAQABo4ICJTCCAiEwCQYDVR0TBAIwADALBgNVHQ8EBAMCBeAwKQYDVR0lBCIwIAYI KwYBBQUHAwIGCCsGAQUFBwMEBgorBgEEAYI3FAICMB0GA1UdDgQWBBQf3YyAtgT64FlT3cbKq7WM Zsg64DAfBgNVHSMEGDAWgBQG4T30b/Qwt3o7V7AxBYl7DVhabDCBkQYDVR0RBIGJMIGGgRlrbnV0 LnJlaW5lcnRAZnUtYmVybGluLmRlgRdyZWluZXJ0QG1pLmZ1LWJlcmxpbi5kZYEYcmVpbmVydEBp bmYuZnUtYmVybGluLmRlgRtyZWluZXJ0QGNhbXB1cy5mdS1iZXJsaW4uZGWBGUtudXQuUmVpbmVy dEBmdS1iZXJsaW4uZGUwdQYDVR0fBG4wbDA0oDKgMIYuaHR0cDovL2NkcDEucGNhLmRmbi5kZS9m dS1jYS9wdWIvY3JsL2NhY3JsLmNybDA0oDKgMIYuaHR0cDovL2NkcDIucGNhLmRmbi5kZS9mdS1j YS9wdWIvY3JsL2NhY3JsLmNybDCBkAYIKwYBBQUHAQEEgYMwgYAwPgYIKwYBBQUHMAKGMmh0dHA6 Ly9jZHAxLnBjYS5kZm4uZGUvZnUtY2EvcHViL2NhY2VydC9jYWNlcnQuY3J0MD4GCCsGAQUFBzAC hjJodHRwOi8vY2RwMi5wY2EuZGZuLmRlL2Z1LWNhL3B1Yi9jYWNlcnQvY2FjZXJ0LmNydDANBgkq hkiG9w0BAQUFAAOCAQEAekvz2kbo9rxQsMh1ETLQMyFUoo4Dcm4FEXXvAl0k9jXCmq+6kcctkRFY 8Adm3GJ0QHvvTzd9X/fNUKw76Yr31QkczMfLz+9aJjNvro7EfKXsPGQI/ODiUuxR8Q8cOYYsmZqB R2PFZCHfWwN4IyNoPHeBySlguxjNSVXQ5+xMEzuEc/7iNW2MjaX5VjRah9zuP6iJpqgnSkJ6ttYl a+8vhNQ1e/Cx8k+FFF7w+7nF+dZLWhGuylJEjPpBdGjh16BtHWA6AsInAek8xzJyIloXa20jebAo Nx4RdQ6LjaqovFMQWanCXWzOUWpH1BvZ9NPXNDnCgkOqJCClqM3pZHLzxDGCA+0wggPpAgEBMIG+ MIG1MQswCQYDVQQGEwJERTEPMA0GA1UECBMGQmVybGluMQ8wDQYDVQQHEwZCZXJsaW4xIjAgBgNV BAoTGUZyZWllIFVuaXZlcnNpdGFldCBCZXJsaW4xDjAMBgNVBAsTBVpFREFUMTAwLgYDVQQDEydG cmVpZSBVbml2ZXJzaXRhZXQgQmVybGluIC0gRlUtQ0EgLSBHMDExHjAcBgkqhkiG9w0BCQEWD2Nh QEZVLUJlcmxpbi5ERQIEDj5VvjAJBgUrDgMCGgUAoIICAzAYBgkqhkiG9w0BCQMxCwYJKoZIhvcN AQcBMBwGCSqGSIb3DQEJBTEPFw0xMTA3MTQyMDU3MDJaMCMGCSqGSIb3DQEJBDEWBBTNxRN/x9M+ 3of/NKfzu3y4NbdXbDCBzwYJKwYBBAGCNxAEMYHBMIG+MIG1MQswCQYDVQQGEwJERTEPMA0GA1UE CBMGQmVybGluMQ8wDQYDVQQHEwZCZXJsaW4xIjAgBgNVBAoTGUZyZWllIFVuaXZlcnNpdGFldCBC ZXJsaW4xDjAMBgNVBAsTBVpFREFUMTAwLgYDVQQDEydGcmVpZSBVbml2ZXJzaXRhZXQgQmVybGlu IC0gRlUtQ0EgLSBHMDExHjAcBgkqhkiG9w0BCQEWD2NhQEZVLUJlcmxpbi5ERQIEDj5VvjCB0QYL KoZIhvcNAQkQAgsxgcGggb4wgbUxCzAJBgNVBAYTAkRFMQ8wDQYDVQQIEwZCZXJsaW4xDzANBgNV BAcTBkJlcmxpbjEiMCAGA1UEChMZRnJlaWUgVW5pdmVyc2l0YWV0IEJlcmxpbjEOMAwGA1UECxMF WkVEQVQxMDAuBgNVBAMTJ0ZyZWllIFVuaXZlcnNpdGFldCBCZXJsaW4gLSBGVS1DQSAtIEcwMTEe MBwGCSqGSIb3DQEJARYPY2FARlUtQmVybGluLkRFAgQOPlW+MA0GCSqGSIb3DQEBAQUABIIBAM4p EigMjSuIYOa1Hg5+dFdymnJj+c0tIB7l+pvkmtpEC0Q+Dg6Ow8+UOlkLyoHmMaaDN+1BIMw62DhI MD6APbxCFeVzWn6vEW+0TxpeWDpMznDFl+io/Dp0CDOiHNXdFzExEOHeWq9RGasCBLAW6f6sQKTy tDG+E5756KPcru8fkLu8MPWW4+jYEh/LrEMY2KQJL5kn+m1zmgFYOeubikJrQ7VDcEASAiRwVFb6 VeI0W3ldai55j+72Sgvp1UJs0sJ2IdkaQ3nsE6U78YNB8vICbqOUnElYihD7CXPk8JSJrikjTx3j vxz2qJVJGpkDVuYIHf1dCNEDlVjR5QwJIe4AAAAAAAA= --Apple-Mail-120--490142946-- From jer15@hermes.cam.ac.uk Fri Jul 15 10:18:10 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qhdb6-0005Mx-Ii>; Fri, 15 Jul 2011 10:18:08 +0200 Received: from ppsw-52.csi.cam.ac.uk ([131.111.8.152]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qhdb6-0002ev-FE>; Fri, 15 Jul 2011 10:18:08 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:45290 helo=[192.168.1.4]) by ppsw-52.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1Qhdb6-0006Gg-DT (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Fri, 15 Jul 2011 09:18:08 +0100 Message-ID: <4E1FF7BF.9080003@mail.cryst.bbk.ac.uk> Date: Fri, 15 Jul 2011 09:18:07 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: SeqAn Development Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.152 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310717888-00005A17-0AEBF0C0/0-0/0-0 X-Bogosity: Unsure, tests=bogofilter, spamicity=0.461385, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: xx X-Spam-Status: No, score=2.3 required=5.0 tests=FU_BOGO_UNSURE, RATWARE_GECKO_BUILD Subject: [Seqan-dev] Sparse property map X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2011 08:18:10 -0000 Hi, If I want to store a property for every vertex in a suffix array I can do something like this: String< double > property; const double v = 1.3; resizeVertexMap( index, property ); assignProperty( property, value( top_down_it( index ) ), v ); const double prop_value = getProperty( property, value( top_down_it( index ) ) ); Here the properties are stored in a seqan::String. What if I only want to store properties for a very few of the vertices in a very large index? Is there some sparse storage I can use instead of String? Thanks, John. From manuel.holtgrewe@fu-berlin.de Fri Jul 15 10:31:51 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhdoM-0005tv-UD>; Fri, 15 Jul 2011 10:31:51 +0200 Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhdoM-0003Bk-Rx>; Fri, 15 Jul 2011 10:31:50 +0200 Received: from 91-65-212-104-dynip.superkabel.de ([91.65.212.104] helo=[192.168.0.100]) by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa (envelope-from ) id <1QhdoM-0006oa-PY>; Fri, 15 Jul 2011 10:31:50 +0200 Message-Id: <7F6EF017-C5D0-4B09-8D93-02CA0B75D42B@fu-berlin.de> From: Manuel Holtgrewe To: SeqAn Development In-Reply-To: <4E1FF7BF.9080003@mail.cryst.bbk.ac.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v936) Date: Fri, 15 Jul 2011 10:31:50 +0200 References: <4E1FF7BF.9080003@mail.cryst.bbk.ac.uk> X-Mailer: Apple Mail (2.936) X-Originating-IP: 91.65.212.104 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310718710-00005A17-DA9C462F/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.112023, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Sparse property map X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2011 08:31:52 -0000 John, the long-term solution for us is to fix the map module and make the =20 graph module work better with it. A medium-term solution would be to fix graph_property.h. The quick fix for you is to use std::map directly. Comment out the =20 resizeVertexMap(). You can use seqan::_getId(vertexOrEdgeDescriptor) =20 to get a number from a vertex or edge descriptor. Bests, Manuel Am 15.07.2011 um 10:18 schrieb John Reid: > Hi, > > If I want to store a property for every vertex in a suffix array I can > do something like this: > > String< double > property; > const double v =3D 1.3; > resizeVertexMap( index, property ); > assignProperty( property, value( top_down_it( index ) ), v ); > const double prop_value =3D getProperty( property, value( > top_down_it( index ) ) ); > > Here the properties are stored in a seqan::String. What if I only want > to store properties for a very few of the vertices in a very large > index? Is there some sparse storage I can use instead of String? > > Thanks, > John. > > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev --=20 Manuel Holtgrewe manuel.holtgrewe@fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021 From jer15@hermes.cam.ac.uk Fri Jul 15 17:58:27 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhkmY-0006SJ-J0>; Fri, 15 Jul 2011 17:58:26 +0200 Received: from ppsw-41.csi.cam.ac.uk ([131.111.8.141]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhkmY-0004k9-Fo>; Fri, 15 Jul 2011 17:58:26 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:42310 helo=[192.168.1.4]) by ppsw-41.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.156]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QhkmY-0005ie-QL (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Fri, 15 Jul 2011 16:58:26 +0100 Message-ID: <4E2063A1.80108@mail.cryst.bbk.ac.uk> Date: Fri, 15 Jul 2011 16:58:25 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: SeqAn Development Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.141 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310745506-00005A17-D8261620/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.231499, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: [Seqan-dev] global position range X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2011 15:58:27 -0000 Suppose I am at a vertex of a suffix array created from a set of strings. If I take one occurrence of the vertex I can access the local and global positions via the posGlobalize() and posLocalize() functions. The local position is the id of the suffix's sequence in the string set and the position of the suffix in the string. The global position is an int. What is the range of the global position? Does it range between 0 and N where N is the number of characters in the string set? Thanks, John. From jer15@hermes.cam.ac.uk Fri Jul 15 18:47:05 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhlXc-0008Ca-Fy>; Fri, 15 Jul 2011 18:47:04 +0200 Received: from ppsw-52.csi.cam.ac.uk ([131.111.8.152]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhlXc-00012j-Ck>; Fri, 15 Jul 2011 18:47:04 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:54032 helo=[192.168.1.4]) by ppsw-52.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QhlXc-0002pu-DU (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Fri, 15 Jul 2011 17:47:04 +0100 Message-ID: <4E206F07.5020305@mail.cryst.bbk.ac.uk> Date: Fri, 15 Jul 2011 17:47:03 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110424 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: SeqAn Development Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.152 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310748424-00005A17-35869255/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.054341, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: [Seqan-dev] Metafunction to discover type of pair members X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2011 16:47:05 -0000 An occurrence of a representative in an index is held as a pair as far as I can work out. I can't find anything in the docs that lets me discover the types of the members of the pair. I think they are both long unsigned ints but it would be nice to know at compile-time. Is there such a metafunction? Thanks, John. From manuel.holtgrewe@fu-berlin.de Fri Jul 15 20:36:19 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QhnFK-0003gR-5e>; Fri, 15 Jul 2011 20:36:18 +0200 Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QhnFK-0005TT-2t>; Fri, 15 Jul 2011 20:36:18 +0200 Received: from 91-65-212-104-dynip.superkabel.de ([91.65.212.104] helo=[192.168.0.100]) by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa (envelope-from ) id <1QhnFK-0004bM-0N>; Fri, 15 Jul 2011 20:36:18 +0200 Message-Id: <8898322D-A9F1-45AC-83FB-9CACEDFEEBFD@fu-berlin.de> From: Manuel Holtgrewe To: SeqAn Development In-Reply-To: <4E206F07.5020305@mail.cryst.bbk.ac.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v936) Date: Fri, 15 Jul 2011 20:36:16 +0200 References: <4E206F07.5020305@mail.cryst.bbk.ac.uk> X-Mailer: Apple Mail (2.936) X-Originating-IP: 91.65.212.104 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310754978-00005A17-B767956F/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000151, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Metafunction to discover type of pair members X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2011 18:36:20 -0000 Try either metafunctions Key, Value or the nested types I1/I2 of Pair. *m Am 15.07.2011 um 18:47 schrieb John Reid: > An occurrence of a representative in an index is held as a pair as far > as I can work out. I can't find anything in the docs that lets me > discover the types of the members of the pair. I think they are both > long unsigned ints but it would be nice to know at compile-time. Is > there such a metafunction? > > Thanks, > John. > > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev --=20 Manuel Holtgrewe manuel.holtgrewe@fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021 From peter.robinson@charite.de Sat Jul 16 17:01:18 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6Mj-00064G-V4>; Sat, 16 Jul 2011 17:01:14 +0200 Received: from mail2.charite.de ([141.42.206.200]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6Mj-0008JN-Sx>; Sat, 16 Jul 2011 17:01:13 +0200 Received: from localhost (localhost [127.0.0.1]) by mail2.charite.de (Postfix) with ESMTP id 3RGYqs5p13z1tHZ for ; Sat, 16 Jul 2011 17:01:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=charite.de; h= content-transfer-encoding:content-type:content-type:subject :subject:mime-version:user-agent:from:from:date:date:message-id :received:received; s=default; t=1310828471; x=1312642872; bh=/u nCvseOl4FvHSvDDROuOwt4FHD6H7Vcu9X6BHotEcU=; b=yWDNQWzk2EfSttgQsg OgZsPPLHDdhJR3DabGn5v9t1lssLf9gMo7XWqeDSKpTE9tzBiT4LmumeVdGuqwI0 IXP/sHmkQe775Du1pQP8sOovkPLVz5wdHKuD1P7OHcaeYTgVhGYKydFvv5rs69kp Xub9CJKno9/XOZUvU6AaB3MYw= Received: from postamt.charite.de (postamt.charite.de [141.42.206.36]) (using TLSv1 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mail2.charite.de (Postfix) with ESMTPS for ; Sat, 16 Jul 2011 17:01:11 +0200 (CEST) Received: from [141.42.174.57] (bioinf-peter.charite.de [141.42.174.57]) by postamt.charite.de (Postfix) with ESMTP id 3RGYqq2Xppz2r0l for ; Sat, 16 Jul 2011 17:01:11 +0200 (CEST) Message-ID: <4E21AA80.2010901@charite.de> Date: Sat, 16 Jul 2011 17:13:04 +0200 From: Peter Robinson User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110606 Icedove/3.1.10 MIME-Version: 1.0 To: "seqan-dev@lists.fu-berlin.de" Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: 141.42.206.200 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310828473-00005A17-212CCB62/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: x X-Spam-Status: No, score=1.3 required=5.0 tests=RATWARE_GECKO_BUILD, SPF_HELO_PASS,SPF_PASS,TO_ADDRESS_EQ_REAL Subject: [Seqan-dev] fastq / gzip X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jul 2011 15:01:18 -0000 Hi, I was wondering if anyone has experience with using SeqAn to read gzipped FASTQ files? It would be nice to have this all within SeqAn, but failing that, does any one have any recommendation for a gzip decompression library, perhaps boost? Is there example code anywhere (sorry if I have missed something obvious) thanks Peter -- PD Dr. med. Peter N. Robinson, MSc. Institut für Medizinische Genetik und Humangenetik Charité - Universitätsmedizin Berlin Augustenburger Platz 1 13353 Berlin Germany voice: 49-30-450566042 fax: 49-30-450569915 email: peter.robinson@charite.de http://compbio.charite.de/ http://www.human-phenotype-ontology.org Introduction to Bio-Ontologies: http://www.crcpress.com/product/isbn/9781439836651 From Knut.Reinert@fu-berlin.de Sat Jul 16 17:13:01 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6Y5-0006SL-Hw>; Sat, 16 Jul 2011 17:12:57 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6Y5-0004ok-G9>; Sat, 16 Jul 2011 17:12:57 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6Y5-0004iX-Al>; Sat, 16 Jul 2011 17:12:57 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Sat, 16 Jul 2011 17:12:57 +0200 From: "Reinert, Knut" To: SeqAn Development Date: Sat, 16 Jul 2011 17:13:43 +0200 Thread-Topic: [Seqan-dev] fastq / gzip Thread-Index: AcxDytgn1grPCGdqSRiTvaeUVtrdcQ== Message-ID: <11336539-5303-4623-B932-A4D8FA940FF1@fu-berlin.de> References: <4E21AA80.2010901@charite.de> In-Reply-To: <4E21AA80.2010901@charite.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310829177-00005A17-B3F22326/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] fastq / gzip X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jul 2011 15:13:01 -0000 V2UgaGF2ZSB0aGF0Lg0KSnVzdCBhZGRlZCBJIHNlbmQgbGF0ZXIgdGhlIGV4YW1wbGUuDQoNCktu dXQNCg0KU2VudCBmcm9tIG15IGlQYWQsIHNvcnJ5IGZvciBiZWluZyBzaG9ydC4NCg0KT24gSnVs IDE2LCAyMDExLCBhdCA4OjAxIEFNLCAiUGV0ZXIgUm9iaW5zb24iIDxwZXRlci5yb2JpbnNvbkBj aGFyaXRlLmRlPiB3cm90ZToNCg0KPiBIaSwNCj4gSSB3YXMgd29uZGVyaW5nIGlmIGFueW9uZSBo YXMgZXhwZXJpZW5jZSB3aXRoIHVzaW5nIFNlcUFuIHRvIHJlYWQgDQo+IGd6aXBwZWQgRkFTVFEg ZmlsZXM/IEl0IHdvdWxkIGJlIG5pY2UgdG8gaGF2ZSB0aGlzIGFsbCB3aXRoaW4gU2VxQW4sIGJ1 dCANCj4gZmFpbGluZyB0aGF0LCBkb2VzIGFueSBvbmUgaGF2ZSBhbnkgcmVjb21tZW5kYXRpb24g Zm9yIGEgZ3ppcCANCj4gZGVjb21wcmVzc2lvbiBsaWJyYXJ5LCBwZXJoYXBzIGJvb3N0PyBJcyB0 aGVyZSBleGFtcGxlIGNvZGUgYW55d2hlcmUgDQo+IChzb3JyeSBpZiBJIGhhdmUgbWlzc2VkIHNv bWV0aGluZyBvYnZpb3VzKQ0KPiANCj4gdGhhbmtzIFBldGVyDQo+IC0tIA0KPiBQRCBEci4gbWVk LiBQZXRlciBOLiBSb2JpbnNvbiwgTVNjLg0KPiBJbnN0aXR1dCBmw7xyIE1lZGl6aW5pc2NoZSBH ZW5ldGlrIHVuZCBIdW1hbmdlbmV0aWsNCj4gQ2hhcml0w6kgLSBVbml2ZXJzaXTDpHRzbWVkaXpp biBCZXJsaW4NCj4gQXVndXN0ZW5idXJnZXIgUGxhdHogMQ0KPiAxMzM1MyBCZXJsaW4NCj4gR2Vy bWFueQ0KPiB2b2ljZTogNDktMzAtNDUwNTY2MDQyDQo+IGZheDogICA0OS0zMC00NTA1Njk5MTUN Cj4gZW1haWw6IHBldGVyLnJvYmluc29uQGNoYXJpdGUuZGUNCj4gaHR0cDovL2NvbXBiaW8uY2hh cml0ZS5kZS8NCj4gaHR0cDovL3d3dy5odW1hbi1waGVub3R5cGUtb250b2xvZ3kub3JnDQo+IElu dHJvZHVjdGlvbiB0byBCaW8tT250b2xvZ2llczogDQo+IGh0dHA6Ly93d3cuY3JjcHJlc3MuY29t L3Byb2R1Y3QvaXNibi85NzgxNDM5ODM2NjUxDQo+IA0KPiBfX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fXw0KPiBzZXFhbi1kZXYgbWFpbGluZyBsaXN0DQo+IHNl cWFuLWRldkBsaXN0cy5mdS1iZXJsaW4uZGUNCj4gaHR0cHM6Ly9saXN0cy5mdS1iZXJsaW4uZGUv bGlzdGluZm8vc2VxYW4tZGV2DQo= From manuel.holtgrewe@fu-berlin.de Sat Jul 16 17:20:05 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qi6ex-0006kQ-46>; Sat, 16 Jul 2011 17:20:03 +0200 Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1Qi6ex-0005YH-1m>; Sat, 16 Jul 2011 17:20:03 +0200 Received: from 91-65-212-104-dynip.superkabel.de ([91.65.212.104] helo=[192.168.0.100]) by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa (envelope-from ) id <1Qi6ew-00062s-Tm>; Sat, 16 Jul 2011 17:20:03 +0200 Message-Id: <19444C2B-B13A-4B04-A5AF-0455CA763F79@fu-berlin.de> From: Manuel Holtgrewe To: SeqAn Development In-Reply-To: <4E21AA80.2010901@charite.de> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Date: Sat, 16 Jul 2011 17:20:02 +0200 References: <4E21AA80.2010901@charite.de> X-Mailer: Apple Mail (2.936) X-Originating-IP: 91.65.212.104 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1310829603-00005A17-FF32A5B1/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.120306, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] fastq / gzip X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jul 2011 15:20:05 -0000 Dear Peter, the new module [1] provides the Stream concept, adaptions for <{io,f}stream>, stream classes as well as the class Stream with specializations for reading and writing GZ [2] and BZ2 streams. GZ streams are accessed via zlib which has to be available. Based on these streams the RecordReader (the name could change but the API should be stable otherwise) class and its specializations [3] provide functionality for easy access to text files. Other useful contents of the stream module is lexical cast functionality [4]. We also have functionality to read FASTA and FASTQ files from streams as well as from String > objects using the new Stream interface, you could look at the DDDoc documentation of functions read2 [6] and readRecord [7] or the benchmark_stream demo [5] or the tests of the stream module in the extras repository. The documentation is a bit scarce at the moment. The API might slightly change in the future but should mainly be stable. On the positive sight: There are comprehensive tests for the new I/O code so it should work at least as well as the old code if not better and by our benchmarks, the performance for reading and writing FASTA/FASTQ is better with the new code. Currently, thew new code coexists with the old input/output functionality but will eventually replace the old stream API. Bests, Manuel [1] http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/FileIO2 [2] http://www.seqan.de/dddoc/html_devel/SPEC_G_Z+_File+_Stream.html [3] http://www.seqan.de/dddoc/html_devel/CLASS_Record_Reader.html [4] http://www.seqan.de/dddoc/html_devel/FUNCTION.lexical_Cast.html [5] http://trac.mi.fu-berlin.de/seqan/browser/trunk/seqan/extras/demos/benchmark_stream.cpp [6] http://www.seqan.de/dddoc/html_devel/FUNCTION.read2.html [7] http://www.seqan.de/dddoc/html_devel/FUNCTION.read_Record.html From johdro@mpi-inf.mpg.de Thu Jul 21 10:59:03 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qjp5y-0008Fq-1p>; Thu, 21 Jul 2011 10:59:02 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qjp5x-0003iO-UZ>; Thu, 21 Jul 2011 10:59:02 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=tDL218e5nlgGyvkRm5+1tM/feHKAT85thlfkA9sguts=; b=v riC34sdRbEUKQnbzs+ZsbP4tSNvv8knIepgUi+CqDrCeRE5CMgFKA+UyDX7TOeYo rtv1j+zGXev1jOtsF9S8cXJLk26SY8pAs9dFvWaGcZNL3m34vyTFJFQ4+fBVfc9d tyWwZkiv7165HQ6t+Gt1FSClIBRH7gKfTQhsqIjAvQ= Received: from zak.mpi-klsb.mpg.de ([139.19.1.27]:38041) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1Qjp5s-0008Q6-8Y for seqan-dev@lists.fu-berlin.de; Thu, 21 Jul 2011 10:59:01 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:46388 helo=linux-eu7n.site) by zak.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1Qjp5r-000886-TT for seqan-dev@lists.fu-berlin.de; Thu, 21 Jul 2011 10:58:55 +0200 From: Johannes =?iso-8859-1?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 21 Jul 2011 10:58:51 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107081605.30762.johdro@mpi-inf.mpg.de> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-Id: <201107211058.51290.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311238742-00005A17-1D368470/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2011 08:59:03 -0000 Hello David, since I got the concepts now, here are my plans and suggestions... Maybe yo= u can comment on them. Since accessing a regular unprocessed FASTA file by means of a StringSet vi= a MMap is not possible. One has to use a split() (first pass) and assignSeq= () (second pass). As you suggested, it might be better to process the file = and access it directly via a StringSet >, Owner > >. This implies that one has to load and import these sequences,= then save them into a file and open it again, when one needs them. Is ther= e any documentation on this (or in some implemenation)? I can image this ca= n be done simply by giving it a filename throught the constructor or so... Also I noticed, that when I use an in-memory representation of a stringset = it will take quite long to open and import it. This is mainly due to the fi= rst pass split(), which is logically not really necessary as one could iter= ate through the file without previously determining all split points. To av= oid this, one can as well use the above strategy of saving a processed Owne= r string as well, I guess. Process the file, save it again an= d load it as raw by copying it completely into memory and access it through= a StringSet. The considerably slower assignSeq() to packed strings would p= robably also profit from this concept. By the way, is there any experience about the (reading) access speed to pac= ked strings compared to normal ones and how much space they save on average? I noticed that all of this boils down to generic StringSet dump() and load(= ) functions... Is my thinking correct? If yes, I would probably go for the second since I = need an (in-memory, optionally MMap) frequent random access object of a lar= ge FASTA file that minimizes loading time. Gru=DF Johannes Am Freitag, 8. Juli 2011 19:35:20 schrieb Weese, David: >=20 > Am 08.07.2011 um 16:05 schrieb Johannes Dr=F6ge: >=20 > > Sorry, I still don't get it. > > How can [ MutiFastaFile =3D=3D> Dna5String =3D=3D> StringSet >, Owner > > ] work, if it copies the value of = the sequence? > > Doesn't assignSeq() copy the value into the Dna5String seq? >=20 > assignSeq *extracts* the sequence information from a block that may conta= in a header, a sequence interspersed by newlines, quality values, etc. > If want to get sequence substrings of an unprocessed Fasta file, they may= contain whitespace. >=20 > >=20 > > What happens when I use appendValue to add seq to the StringSet, where = does it actually reside (it should still be in the MultiFasta file). >=20 > As assignSeq(seq, ...) extracts the sequence character-by-character there= is no association between seq and the Fasta file. >=20 > >=20 > > I need to access the MultiFastaFile (on the hard disk) as a regular Str= ingSet to read its contents on demand, not copy its sequences into a new me= mory-mapped file. >=20 > Then you need to keep the split MultiSeqFile and extract the sequences on= demand with assignSeq. > If you access the sequences very often I would recommend to fill a String= Set >, Owner > > (see my last mail) wh= ich also resides on your hard disk but can be accessed without assignSeq. >=20 > >=20 > > Johannes > >=20 > >=20 > > Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David: > >> Hi, > >>=20 > >> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/Effic= ientImportOfMillionsOfSequences and simply change: > >>=20 > >> StringSet > seqs; > >>=20 > >> into: > >>=20 > >> StringSet >, Owner > > seqs; > >>=20 > >> That should do what you want. > >>=20 > >> Regards, > >> David > >>=20 > >>=20 >=20 >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev >=20 From weese@campus.fu-berlin.de Thu Jul 21 15:02:16 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QjstL-00012P-OS>; Thu, 21 Jul 2011 15:02:15 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QjstL-0000C0-K5>; Thu, 21 Jul 2011 15:02:15 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QjstL-0001Wo-Bx>; Thu, 21 Jul 2011 15:02:15 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 21 Jul 2011 15:02:15 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 21 Jul 2011 15:02:14 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: AcxHpmnSlwZzJ7+4Tb2EwLgjbd/3nw== Message-ID: <333F8626-A854-498D-B573-F6F8B8B819E7@fu-berlin.de> References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107081605.30762.johdro@mpi-inf.mpg.de> <201107211058.51290.johdro@mpi-inf.mpg.de> In-Reply-To: <201107211058.51290.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: multipart/alternative; boundary="_000_333F8626A854498DB573F6F8B8B819E7fuberlinde_" MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311253335-00005A17-42FA5A65/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.068897, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED,HTML_30_40, HTML_MESSAGE Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2011 13:02:17 -0000 --_000_333F8626A854498DB573F6F8B8B819E7fuberlinde_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Am 21.07.2011 um 10:58 schrieb Johannes Dr=F6ge: Hello David, Hi, since I got the concepts now, here are my plans and suggestions... Maybe yo= u can comment on them. Since accessing a regular unprocessed FASTA file by means of a StringSet vi= a MMap is not possible. One has to use a split() (first pass) and assignSeq= () (second pass). As you suggested, it might be better to process the file = and access it directly via a StringSet >, Owner > >. This implies that one has to load and import these sequences,= then save them into a file and open it again, when one needs them. Is ther= e any documentation on this (or in some implemenation)? I can image this ca= n be done simply by giving it a filename throught the constructor or so... Yes, you could do it this way. This requires either: 1) a persistent StringSet which you construct once and reopen everytime 2) a StringSet which uses a temporary file to store the concatenated sequ= ences (e.g. StringSet >, Owner > >) tha= t you generate from the fasta file everytime you start your application and= that is deleted automatically in the StringSet destructor 2) is the easiest way, but the conversion is certainly more time consuming 1) requires to make both members (not only the concatenated sequence) of th= e ConcatDirect persistent strings. The second member is limits and stores t= he sequence breakpoints in concat. By default it is an Alloc String. In you= r application you can specialize: template <> struct StringSetLimits { typedef typename Size< TYourStringSet >::Type TSize_; typedef String > Type; }; to use a MMap<> String instead. Before doing anything with your StringSet s= imply call: TYourStringSet stringSet; open(stringSet.concat, "yourfile.concat"); // assigns a file to the mmap st= ring open(stringSet.limits, "yourfile.limits"); // if not called, a temporary fi= le is created =3D non-persistent // append your sequences (when runned for the first time) // or // use the sequences (later) // // save() is not required as the string is always in sync with the file on = disk Cheers, David Also I noticed, that when I use an in-memory representation of a stringset = it will take quite long to open and import it. This is mainly due to the fi= rst pass split(), which is logically not really necessary as one could iter= ate through the file without previously determining all split points. To av= oid this, one can as well use the above strategy of saving a processed Owne= r string as well, I guess. Process the file, save it again an= d load it as raw by copying it completely into memory and access it through= a StringSet. The considerably slower assignSeq() to packed strings would p= robably also profit from this concept. By the way, is there any experience about the (reading) access speed to pac= ked strings compared to normal ones and how much space they save on average= ? I noticed that all of this boils down to generic StringSet dump() and load(= ) functions... Is my thinking correct? If yes, I would probably go for the second since I = need an (in-memory, optionally MMap) frequent random access object of a lar= ge FASTA file that minimizes loading time. Gru=DF Johannes -- David Weese weese@inf.fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021 --_000_333F8626A854498DB573F6F8B8B819E7fuberlinde_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

A= m 21.07.2011 um 10:58 schrieb Johannes Dr=F6ge:

Hello David,

Hi,

since = I got the concepts now, here are my plans and suggestions... Maybe you can = comment on them.

Since accessing a regular unprocessed FASTA file by= means of a StringSet via MMap is not possible. One has to use a split() (f= irst pass) and assignSeq() (second pass). As you suggested, it might be bet= ter to process the file and access it directly via a StringSet<String<= ;Dna5Q, MMap<> >, Owner<ConcatDirect<> > >. This im= plies that one has to load and import these sequences, then save them into = a file and open it again, when one needs them. Is there any documentation o= n this (or in some implemenation)? I can image this can be done simply by g= iving it a filename throught the constructor or so...

Yes, you could do it this way. This requires either:
  1) a persistent StringSet which you construct once and= reopen everytime
  2) a StringSet which uses a tempora= ry file to store the concatenated sequences (e.g. StringSet<String&= lt;Dna5Q, MMap<> >, Owner<ConcatDirect<> > >) that = you generate from the fasta file everytime you start your application and t= hat is deleted automatically in the StringSet destructor

2) is the easiest way, but the conversion is certainly more time con= suming

1) requires to make both members (not only = the concatenated sequence) of the ConcatDirect persistent strings. The seco= nd member is limits and stores the sequence breakpoints in concat. By defau= lt it is an Alloc String. In your application you can specialize:

template <>
struct StringSetLimits<= TYourStringSet>
{
    typedef typenam= e Size< TYourStringSet >::Type TSize_;
 &nbs= p;  typedef String<TSize_, MMap<> > Type;
};

to use a MMap<> String instead. Before doing an= ything with your StringSet simply call:

TYourStrin= gSet stringSet;
open(stringSet.concat, "yourfile.concat"); // assigns a file= to the mmap string
open(stringSet.limits, "yourfile.limits"= ); // if n= ot called, a temporary file is created =3D non-persistent
<= br>
// append your sequences (when runned for the first time)
// or
// use the sequences (later)
//
= // save() is not required as the string is always in sync with the file on = disk

Cheers,
David

<= /div>

Also I noticed, that when I use an in-memory repres= entation of a stringset it will take quite long to open and import it. This= is mainly due to the first pass split(), which is logically not really nec= essary as one could iterate through the file without previously determining= all split points. To avoid this, one can as well use the above strategy of= saving a processed Owner<ConcatDirect> string as well, I guess. Proc= ess the file, save it again and load it as raw by copying it completely int= o memory and access it through a StringSet. The considerably slower assignS= eq() to packed strings would probably also profit from this concept.
By the way, is there any experience about the (reading) access speed to pa= cked strings compared to normal ones and how much space they save on averag= e?

I noticed that all of this boils down to generic StringSet dump()= and load() functions...

Is my thinking correct? If yes, I would pro= bably go for the second since I need an (in-memory, optionally MMap) freque= nt random access object of a large FASTA file that minimizes loading time.<= br>
Gru=DF Johannes




Freie Universit=E4t Berlin http:/= /www.inf.fu-berlin.de/
Institut f=FCr= Informatik <= /span>Phone: +49 30 838 75246
Takustra=DF= e 9 = Algorithmic Bioinformatics
14195 Berlin Room = 021 

= --_000_333F8626A854498DB573F6F8B8B819E7fuberlinde_-- From johdro@mpi-inf.mpg.de Thu Jul 21 17:27:01 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qjv9Q-0006uP-50>; Thu, 21 Jul 2011 17:27:00 +0200 Received: from hera.mpi-sb.mpg.de ([139.19.1.49]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qjv9Q-0002Ki-1J>; Thu, 21 Jul 2011 17:27:00 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mpi-sb.mpg.de; s=mail200803; h=From:To:Subject:Date:References: In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Message-Id; bh=aui3MzhIdPsE3J8bpHVLnxj4N0PCRHAhhd/ljdbhV8c=; b=s VxXGQseCAYWdOT7Y0Ejir9h6SBm+aqD1EV9m4VmS7lOhQnFSn0myCnPoImxo1tke X4ihdmBG6sTonYMNU8V67DTt4O85zyDSEz5riWCrWE5ECpUySqBFz1cXhbxEBO69 rvh0RAtoo4MlxsqFcdlsQjTNI9qo/4hL4lEPsn4Nz8= Received: from infao0710.mpi-klsb.mpg.de ([139.19.1.27]:46446 helo=zak.mpi-klsb.mpg.de) by hera.mpi-sb.mpg.de (envelope-from ) with esmtp (Exim 4.69) id 1Qjv9K-0005po-Kc for seqan-dev@lists.fu-berlin.de; Thu, 21 Jul 2011 17:26:59 +0200 Received: from coropuna.cs.uni-duesseldorf.de ([134.99.112.114]:55810 helo=linux-eu7n.site) by zak.mpi-klsb.mpg.de (envelope-from ) with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.69) id 1Qjv9K-0002i8-9D for seqan-dev@lists.fu-berlin.de; Thu, 21 Jul 2011 17:26:54 +0200 From: Johannes =?iso-8859-15?q?Dr=F6ge?= Organization: =?iso-8859-1?q?Universit=E4t_D=FCsseldorf/Max-Planck-Institut_f=FCr?= =?iso-8859-1?q?_Informatik?= =?iso-8859-1?q?_Saarbr=FCcken?= To: SeqAn Development Date: Thu, 21 Jul 2011 17:26:49 +0200 User-Agent: KMail/1.13.6 (Linux/2.6.37.6-0.5-desktop; KDE/4.6.0; x86_64; ; ) References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107211058.51290.johdro@mpi-inf.mpg.de> <333F8626-A854-498D-B573-F6F8B8B819E7@fu-berlin.de> In-Reply-To: <333F8626-A854-498D-B573-F6F8B8B819E7@fu-berlin.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: quoted-printable Message-Id: <201107211726.50032.johdro@mpi-inf.mpg.de> X-Originating-IP: 139.19.1.49 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311262020-00005A17-C68D4E73/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=FORGED_RCVD_HELO,SPF_PASS Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2011 15:27:01 -0000 Thanks for you nice explanation. So open() is the magic function do apply t= o a ConcatDirect StringSet. Sorry for bugging you further, before I start w= ith the implementation I have some final questions... My primary target is the in-memory sequence storage, as I wrote, so for thi= s I do not need to specialize the limits object. I suppose open() works wit= h regular Alloc strings as well? Where are temporary MMap files usually stored? Can I somehow use read-only file access with persistent MMap files? Gru=DF Johannes Am Donnerstag, 21. Juli 2011 15:02:14 schrieb Weese, David: >=20 > Am 21.07.2011 um 10:58 schrieb Johannes Dr=F6ge: >=20 > Hello David, >=20 > Hi, >=20 > since I got the concepts now, here are my plans and suggestions... Maybe = you can comment on them. >=20 > Since accessing a regular unprocessed FASTA file by means of a StringSet = via MMap is not possible. One has to use a split() (first pass) and assignS= eq() (second pass). As you suggested, it might be better to process the fil= e and access it directly via a StringSet >, Owner > >. This implies that one has to load and import these sequence= s, then save them into a file and open it again, when one needs them. Is th= ere any documentation on this (or in some implemenation)? I can image this = can be done simply by giving it a filename throught the constructor or so... >=20 > Yes, you could do it this way. This requires either: > 1) a persistent StringSet which you construct once and reopen everytime > 2) a StringSet which uses a temporary file to store the concatenated se= quences (e.g. StringSet >, Owner > >) t= hat you generate from the fasta file everytime you start your application a= nd that is deleted automatically in the StringSet destructor >=20 > 2) is the easiest way, but the conversion is certainly more time consuming >=20 > 1) requires to make both members (not only the concatenated sequence) of = the ConcatDirect persistent strings. The second member is limits and stores= the sequence breakpoints in concat. By default it is an Alloc String. In y= our application you can specialize: >=20 > template <> > struct StringSetLimits > { > typedef typename Size< TYourStringSet >::Type TSize_; > typedef String > Type; > }; >=20 > to use a MMap<> String instead. Before doing anything with your StringSet= simply call: >=20 > TYourStringSet stringSet; > open(stringSet.concat, "yourfile.concat"); // assigns a file to the mmap = string > open(stringSet.limits, "yourfile.limits"); // if not called, a temporary = file is created =3D non-persistent >=20 > // append your sequences (when runned for the first time) > // or > // use the sequences (later) > // > // save() is not required as the string is always in sync with the file o= n disk >=20 > Cheers, > David >=20 >=20 > Also I noticed, that when I use an in-memory representation of a stringse= t it will take quite long to open and import it. This is mainly due to the = first pass split(), which is logically not really necessary as one could it= erate through the file without previously determining all split points. To = avoid this, one can as well use the above strategy of saving a processed Ow= ner string as well, I guess. Process the file, save it again = and load it as raw by copying it completely into memory and access it throu= gh a StringSet. The considerably slower assignSeq() to packed strings would= probably also profit from this concept. >=20 > By the way, is there any experience about the (reading) access speed to p= acked strings compared to normal ones and how much space they save on avera= ge? >=20 > I noticed that all of this boils down to generic StringSet dump() and loa= d() functions... >=20 > Is my thinking correct? If yes, I would probably go for the second since = I need an (in-memory, optionally MMap) frequent random access object of a l= arge FASTA file that minimizes loading time. >=20 > Gru=DF Johannes >=20 >=20 >=20 >=20 > -- > David Weese weese@inf.fu-berlin.de > Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ > Institut f=FCr Informatik Phone: +49 30 838 75246 > Takustra=DFe 9 Algorithmic Bioinformatics > 14195 Berlin Room 021 >=20 >=20 From weese@campus.fu-berlin.de Fri Jul 22 09:22:14 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QkA3o-0000II-IS>; Fri, 22 Jul 2011 09:22:12 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QkA3o-0005AV-GP>; Fri, 22 Jul 2011 09:22:12 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QkA3o-00049g-B4>; Fri, 22 Jul 2011 09:22:12 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 22 Jul 2011 09:22:12 +0200 From: "Weese, David" To: SeqAn Development Date: Fri, 22 Jul 2011 09:22:11 +0200 Thread-Topic: [Seqan-dev] Random access of large FASTA file Thread-Index: AcxIQBLbAL0VC4c+QLm93/AO8fjSKw== Message-ID: References: <201107061628.12413.johdro@mpi-inf.mpg.de> <201107211058.51290.johdro@mpi-inf.mpg.de> <333F8626-A854-498D-B573-F6F8B8B819E7@fu-berlin.de> <201107211726.50032.johdro@mpi-inf.mpg.de> In-Reply-To: <201107211726.50032.johdro@mpi-inf.mpg.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311319332-00005A17-520161FC/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.196366, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Random access of large FASTA file X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jul 2011 07:22:14 -0000 Am 21.07.2011 um 17:26 schrieb Johannes Dr=F6ge: > Thanks for you nice explanation. So open() is the magic function do apply= to a ConcatDirect StringSet. Sorry for bugging you further, before I start= with the implementation I have some final questions... >=20 Actually, you don't call open with the StringSet but both of its members li= mits and concat. > My primary target is the in-memory sequence storage, as I wrote, so for t= his I do not need to specialize the limits object. I suppose open() works w= ith regular Alloc strings as well? >=20 Yes, but before need to call save() before destructing the String to synchr= onize changes in the string with the file on disk. > Where are temporary MMap files usually stored? In the system-wide temporary folder, unless you defined a different folder = in the environment variable TMPDIR. >=20 > Can I somehow use read-only file access with persistent MMap files? Yes, see: http://www.seqan.de/dddoc/html_devel/FUNCTION.open.html >=20 > Gru=DF Johannes >=20 Gru=DF, David= From jer15@hermes.cam.ac.uk Mon Jul 25 11:49:21 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QlHmq-00043e-FG>; Mon, 25 Jul 2011 11:49:20 +0200 Received: from ppsw-41.csi.cam.ac.uk ([131.111.8.141]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QlHmq-0008O8-C1>; Mon, 25 Jul 2011 11:49:20 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:45343 helo=[192.168.1.4]) by ppsw-41.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.156]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QlHmq-00068b-Q4 (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Mon, 25 Jul 2011 10:49:20 +0100 Message-ID: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> Date: Mon, 25 Jul 2011 10:49:19 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.141 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311587360-00005A17-1A6BB070/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.051863, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Jul 2011 09:49:21 -0000 Suppose I have an iterator over a suffix array. What is the time complexity of calling posGlobalise() on an occurrence the iterator? I'm asking because I have the option to store indexes to occurrences as global positions or as local positions (pairs of sequence ids, position in sequences). I'm guessing the latter is more time efficient but would take more storage. Thanks, John. From weese@campus.fu-berlin.de Tue Jul 26 08:17:03 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qlaww-0007p0-OP>; Tue, 26 Jul 2011 08:17:02 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qlaww-0004iv-MT>; Tue, 26 Jul 2011 08:17:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qlaww-00011e-HA>; Tue, 26 Jul 2011 08:17:02 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Tue, 26 Jul 2011 08:17:02 +0200 From: "Weese, David" To: SeqAn Development Date: Tue, 26 Jul 2011 08:17:22 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxLW6IWLG7RoUdDS52WHTtUPN3JqA== Message-ID: <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> In-Reply-To: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> Accept-Language: de-DE Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311661022-00005A17-F963AFD0/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.012611, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jul 2011 06:17:03 -0000 Hi John, posGlobalize needs in either case constant time. posLocalize takes constant= time only for pairs. Otherwise (for global positions =3D integers) it take= s O(m log m) where m is the number of sequences. And yes, global positions help to save memory. But you can configure the ty= pe used in the position tables by overloading SAValue for your index. With = Packed Pairs you specify the number of bits used for seqno and seqofs. HTH David Sent from my iPod. Sorry for being short. Am 25.07.2011 um 11:49 schrieb "John Reid" : > Suppose I have an iterator over a suffix array. What is the time=20 > complexity of calling posGlobalise() on an occurrence the iterator? I'm=20 > asking because I have the option to store indexes to occurrences as=20 > global positions or as local positions (pairs of sequence ids, position=20 > in sequences). I'm guessing the latter is more time efficient but would=20 > take more storage. >=20 > Thanks, > John. >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From weese@campus.fu-berlin.de Tue Jul 26 11:03:26 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QldXx-0005wU-Rd>; Tue, 26 Jul 2011 11:03:25 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QldXx-00050L-PZ>; Tue, 26 Jul 2011 11:03:25 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QldXx-0006Ra-Hx>; Tue, 26 Jul 2011 11:03:25 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Tue, 26 Jul 2011 11:03:26 +0200 From: "Weese, David" To: SeqAn Development Date: Tue, 26 Jul 2011 11:03:24 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxLcuB4YFW2bKmQRT+Goq3WKpqizQ== Message-ID: <2BA23876-03F0-4DE9-9D56-511A4C05B6AF@fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> In-Reply-To: <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311671005-00005A17-47512B6C/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000152, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jul 2011 09:03:26 -0000 Sorry, not O(m log m) but O(log m) - its a binary search. -- David Weese weese@inf.fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021=20 Am 26.07.2011 um 08:17 schrieb Weese, David: > Hi John, >=20 > posGlobalize needs in either case constant time. posLocalize takes consta= nt time only for pairs. Otherwise (for global positions =3D integers) it ta= kes O(m log m) where m is the number of sequences. > And yes, global positions help to save memory. But you can configure the = type used in the position tables by overloading SAValue for your index. Wit= h Packed Pairs you specify the number of bits used for seqno and seqofs. >=20 > HTH > David >=20 > Sent from my iPod. Sorry for being short. >=20 > Am 25.07.2011 um 11:49 schrieb "John Reid" : >=20 >> Suppose I have an iterator over a suffix array. What is the time=20 >> complexity of calling posGlobalise() on an occurrence the iterator? I'm= =20 >> asking because I have the option to store indexes to occurrences as=20 >> global positions or as local positions (pairs of sequence ids, position= =20 >> in sequences). I'm guessing the latter is more time efficient but would= =20 >> take more storage. >>=20 >> Thanks, >> John. >>=20 >> _______________________________________________ >> seqan-dev mailing list >> seqan-dev@lists.fu-berlin.de >> https://lists.fu-berlin.de/listinfo/seqan-dev >=20 > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev From jer15@hermes.cam.ac.uk Wed Jul 27 13:31:50 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qm2L7-0008BD-OB>; Wed, 27 Jul 2011 13:31:49 +0200 Received: from ppsw-51.csi.cam.ac.uk ([131.111.8.151]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qm2L7-0002CV-L5>; Wed, 27 Jul 2011 13:31:49 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:44507 helo=[192.168.1.4]) by ppsw-51.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.158]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1Qm2L7-00070f-Xc (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Wed, 27 Jul 2011 12:31:49 +0100 Message-ID: <4E2FF724.6010502@mail.cryst.bbk.ac.uk> Date: Wed, 27 Jul 2011 12:31:48 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> In-Reply-To: <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.151 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311766309-00005A17-7E4186EB/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000235, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jul 2011 11:31:50 -0000 On 26/07/11 07:17, Weese, David wrote: > Hi John, > > posGlobalize needs in either case constant time. posLocalize takes constant time only for pairs. Otherwise (for global positions = integers) it takes O(m log m) where m is the number of sequences. > And yes, global positions help to save memory. But you can configure the type used in the position tables by overloading SAValue for your index. With Packed Pairs you specify the number of bits used for seqno and seqofs. That sounds interesting. I'm not quite sure where in the documentation I should look to see what how I should overload SAValue. Is there documentation on this? Thanks, John. From jer15@hermes.cam.ac.uk Wed Jul 27 14:19:41 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qm35R-0001Xb-Bq>; Wed, 27 Jul 2011 14:19:41 +0200 Received: from ppsw-41.csi.cam.ac.uk ([131.111.8.141]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qm35R-0002DM-8W>; Wed, 27 Jul 2011 14:19:41 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:47820 helo=[192.168.1.4]) by ppsw-41.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.156]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1Qm35R-0007Jw-Pq (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Wed, 27 Jul 2011 13:19:41 +0100 Message-ID: <4E30025C.5080309@mail.cryst.bbk.ac.uk> Date: Wed, 27 Jul 2011 13:19:40 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> In-Reply-To: <4E2FF724.6010502@mail.cryst.bbk.ac.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.141 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311769181-00005A17-51693AA0/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000030, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jul 2011 12:19:42 -0000 On 27/07/11 12:31, John Reid wrote: > On 26/07/11 07:17, Weese, David wrote: >> Hi John, >> >> posGlobalize needs in either case constant time. posLocalize takes >> constant time only for pairs. Otherwise (for global positions = >> integers) it takes O(m log m) where m is the number of sequences. >> And yes, global positions help to save memory. But you can configure >> the type used in the position tables by overloading SAValue for your >> index. With Packed Pairs you specify the number of bits used for >> seqno and seqofs. > That sounds interesting. I'm not quite sure where in the documentation > I should look to see what how I should overload SAValue. Is there > documentation on this? > I think I worked it out by trial and error: Index< string_set_t, IndexEsa< Pair< unsigned int, unsigned int, BitCompressed< 17, 13 > > > > seems to be doing the job. I would like to know if it is just that I don't understand how the docs work or whether it is missing from the docs. Quite often I can't find things in the docs. From weese@campus.fu-berlin.de Wed Jul 27 22:35:21 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmAp5-0004Hl-92>; Wed, 27 Jul 2011 22:35:19 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmAp5-0002A0-6c>; Wed, 27 Jul 2011 22:35:19 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmAp5-0005lv-13>; Wed, 27 Jul 2011 22:35:19 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Wed, 27 Jul 2011 22:35:19 +0200 From: "Weese, David" To: SeqAn Development Date: Wed, 27 Jul 2011 22:35:16 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxMnLKUJHxLyjllQm+z+83QxFk31w== Message-ID: <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> In-Reply-To: <4E30025C.5080309@mail.cryst.bbk.ac.uk> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: multipart/alternative; boundary="_000_71FF79509B04409482F889809C7AB0EFfuberlinde_" MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311798919-00005A17-703DA8EF/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.7 required=5.0 tests=ALL_TRUSTED,HTML_50_60, HTML_MESSAGE Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jul 2011 20:35:21 -0000 --_000_71FF79509B04409482F889809C7AB0EFfuberlinde_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi John, you will find an introduction to the two position types for multiple sequen= ces in the tutorial: http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/Indices#HandlingMultipleSeq= uences There is a link to the metafunction SAValue which can be overloaded to chan= ge the position type used by an index: http://www.seqan.de/dddoc/html_devel/METAFUNCTION_S_A_Value.html The ESA index Index > as well as most of the subclasses t= hat whose a TSpec argument is Foo<> (and not Foo) can be further specialize= d in the form Foo > and Foo > >= and so on. That leaves room for further enhancements or specializations li= ke a custom SAValue. struct MyIndex; template struct SAValue > > { typedef Pair > Type; }; Would it have helped if that was a part of the tutorial? Or where would you= search for that example? Regards, David -- David Weese weese@inf.fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021 Am 27.07.2011 um 14:19 schrieb John Reid: On 27/07/11 12:31, John Reid wrote: On 26/07/11 07:17, Weese, David wrote: Hi John, posGlobalize needs in either case constant time. posLocalize takes constant time only for pairs. Otherwise (for global positions =3D integers) it takes O(m log m) where m is the number of sequences. And yes, global positions help to save memory. But you can configure the type used in the position tables by overloading SAValue for your index. With Packed Pairs you specify the number of bits used for seqno and seqofs. That sounds interesting. I'm not quite sure where in the documentation I should look to see what how I should overload SAValue. Is there documentation on this? I think I worked it out by trial and error: Index< string_set_t, IndexEsa< Pair< unsigned int, unsigned int, BitCompressed< 17, 13 > > > > seems to be doing the job. I would like to know if it is just that I don't understand how the docs work or whether it is missing from the docs. Quite often I can't find things in the docs. _______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev --_000_71FF79509B04409482F889809C7AB0EFfuberlinde_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi John,

you will find an introduction to the two position types for multiple sequ= ences in the tutorial:
There is a link to the metafunction SAValue which can be overl= oaded to change the position type used by an index:

The ESA index struct MyIndex;

template <typename TText>
struct SAValue<Inde= x<TText, IndexEsa<MyIndex> > >
{
typedef Pair<unsigned, unsigned, BitC= ompressed<19,13> > Type;
};

Would it have helped if that was a part of the tut= orial? Or where would you search for that example?

Regard= s,
David
--
David Weese weese@inf.fu-berlin.de
Freie Universit=E4t Berlin http://www.inf.fu-berli= n.de/
Institut f=FCr Informatik Phone: +49 30 838 75246
Takustra=DFe 9= Alg= orithmic Bioinformatics
14195 Berlin Room 021 

Am 27.07.2011 um 14:19 schrieb John Reid:

On 27/07/11 12:31,= John Reid wrote:
On 26/07/11 07:17, Weese, Da= vid wrote:
Hi John,

=
posGlobalize needs in either case constant time. = posLocalize takes
<= blockquote type=3D"cite">constant time only for pairs. Otherwise (for globa= l positions =3D
integers) it takes O(m log m) where m is the number = of sequences.
And yes, global positions help to save memory. But you = can configure
the type used in the position tables by overloading SA= Value for your
index. With Packed Pairs you specify the number of bi= ts used for
seqno and seqofs.
That sounds interesting. I'm not quite sure where in the d= ocumentation
I should look to se= e what how I should overload SAValue. Is there
documentation on this?

I think I worked it out by trial and error:

In= dex< string_set_t, IndexEsa< Pair< unsigned int, unsigned int, BitCompressed< 17, 13 > > > >

seems to be doing the = job. I would like to know if it is just that I
don't understand how the= docs work or whether it is missing from the
docs. Quite often I can't = find things in the docs.

___________________________________________= ____
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listi= nfo/seqan-dev

= --_000_71FF79509B04409482F889809C7AB0EFfuberlinde_-- From wengl@uci.edu Thu Jul 28 04:12:18 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmG5B-0007zk-AD>; Thu, 28 Jul 2011 04:12:17 +0200 Received: from smtp2.es.uci.edu ([128.200.80.32]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmG5A-0007Ed-Sl>; Thu, 28 Jul 2011 04:12:17 +0200 Received: from [128.195.53.171] (dhcp-053171.ics.uci.edu [128.195.53.171]) (authenticated bits=0) by smtp2.es.uci.edu (8.13.8/8.13.8) with ESMTP id p6S2C2SE004258 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 27 Jul 2011 19:12:14 -0700 X-UCInetID: wengl Message-ID: <4E30C59C.4080309@uci.edu> Date: Wed, 27 Jul 2011 19:12:44 -0700 From: Lingjie Weng User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10 MIME-Version: 1.0 To: seqan-dev@lists.fu-berlin.de References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> In-Reply-To: <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: 128.200.80.32 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311819137-00005A17-A5BF3577/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.005093, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: [Seqan-dev] align_myer algorithm in seqan/align/align_myer X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 02:12:18 -0000 Hi, I was wondering if you have combined Ukkonen algorithm into the align_myers algorithm in Seqan_Release_1.3/seqan/align/align_myers.h. If yes, can you explain how you update the "last active cell"? In addition, do you have a banded MyersUkkneon algorithm for global alignment. The one I found under Seqan_Release_1.3/seqan/find/find_myers_ukkonen is not applicable to global alignment. Thanks, Lingjie Weng PhD student University of California, Irvine From jer15@hermes.cam.ac.uk Thu Jul 28 09:31:57 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmL4U-0003C2-Pe>; Thu, 28 Jul 2011 09:31:54 +0200 Received: from ppsw-52.csi.cam.ac.uk ([131.111.8.152]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmL4U-00061J-KJ>; Thu, 28 Jul 2011 09:31:54 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:41704 helo=[192.168.1.4]) by ppsw-52.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QmL4U-0002Hs-DR (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Thu, 28 Jul 2011 08:31:54 +0100 Message-ID: <4E311069.1010204@mail.cryst.bbk.ac.uk> Date: Thu, 28 Jul 2011 08:31:53 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> In-Reply-To: <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.152 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311838314-00005A17-2A138403/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Dschibuti.ZEDAT.FU-Berlin.DE X-Spam-Level: xx X-Spam-Status: No, score=2.1 required=5.0 tests=HTML_60_70,HTML_MESSAGE, HTML_TITLE_EMPTY,MIME_HTML_ONLY,RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 07:31:57 -0000 On 27/07/11 21:35, Weese, David wrote:
Hi John,

you will find an introduction to the two position types for multiple sequences in the tutorial:

There is a link to the metafunction SAValue which can be overloaded to change the position type used by an index:

The ESA index Index<TText, IndexEsa<> > as well as most of the subclasses that whose a TSpec argument is Foo<> (and not Foo) can be further specialized in the form Foo<MoreSpecial<> > and Foo<MoreSpecial<EvenMoreSpecial<> > > and so on. That leaves room for further enhancements or specializations like a custom SAValue.

struct MyIndex;

template <typename TText>
struct SAValue<Index<TText, IndexEsa<MyIndex> > >
{
typedef Pair<unsigned, unsigned, BitCompressed<19,13> > Type;
};

Would it have helped if that was a part of the tutorial? Or where would you search for that example?
I tend to look in the reference section of the documentation so I was here:
http://www.seqan.de/dddoc/html_devel/SPEC_Index_Esa.html

I could see the definition
Index<TText, IndexEsa<> >

but from this page I had no idea I could overload the SAValue, or indeed if I could whether it would be an argument to the IndexEsa template or to the Index template. Are you saying I need to specialise the metafunction SAValue to overload it for my index? Presumably this must be done in the seqan namespace? Is the SAValue for the index not a template argument to Index?

Thanks,
John.

Regards,
David
--
David Weese weese@inf.fu-berlin.de
Freie Universität Berlin http://www.inf.fu-berlin.de/
Institut für Informatik Phone: +49 30 838 75246
Takustraße 9 Algorithmic Bioinformatics
14195 Berlin Room 021 

Am 27.07.2011 um 14:19 schrieb John Reid:

On 27/07/11 12:31, John Reid wrote:
On 26/07/11 07:17, Weese, David wrote:
Hi John,

posGlobalize needs in either case constant time. posLocalize takes
constant time only for pairs. Otherwise (for global positions =
integers) it takes O(m log m) where m is the number of sequences.
And yes, global positions help to save memory. But you can configure
the type used in the position tables by overloading SAValue for your
index. With Packed Pairs you specify the number of bits used for
seqno and seqofs.
That sounds interesting. I'm not quite sure where in the documentation
I should look to see what how I should overload SAValue. Is there
documentation on this?

I think I worked it out by trial and error:

Index< string_set_t, IndexEsa< Pair< unsigned int, unsigned int,
BitCompressed< 17, 13 > > > >

seems to be doing the job. I would like to know if it is just that I
don't understand how the docs work or whether it is missing from the
docs. Quite often I can't find things in the docs.

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

_______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev
From manuel.holtgrewe@fu-berlin.de Thu Jul 28 12:41:11 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmO1f-0007ev-0X>; Thu, 28 Jul 2011 12:41:11 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmO1e-00060X-Uz>; Thu, 28 Jul 2011 12:41:11 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmO1e-0005Jh-Ny>; Thu, 28 Jul 2011 12:41:10 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 28 Jul 2011 12:41:10 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Thu, 28 Jul 2011 12:41:09 +0200 Thread-Topic: [Seqan-dev] align_myer algorithm in seqan/align/align_myer Thread-Index: AcxNEt1imRpQkrJNTeajlnMxmqkm7g== Message-ID: <68D67F86-FAE5-44CB-B7E1-FB7353982755@fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E30C59C.4080309@uci.edu> In-Reply-To: <4E30C59C.4080309@uci.edu> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311849671-00005A17-FD265E0F/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000015, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] align_myer algorithm in seqan/align/align_myer X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 10:41:11 -0000 Am 28.07.2011 um 04:12 schrieb Lingjie Weng: > Hi, >=20 > I was wondering if you have combined Ukkonen algorithm into the=20 > align_myers algorithm in Seqan_Release_1.3/seqan/align/align_myers.h. =20 > If yes, can you explain how you update the "last active cell"? No, it's implemented in the Finder MyersUkkonen, though. > In addition, do you have a banded MyersUkkneon algorithm for global=20 > alignment. The one I found under=20 > Seqan_Release_1.3/seqan/find/find_myers_ukkonen is not applicable to=20 > global alignment. Yes, you are right. The banded Myers Ukkonen not applicable to global align= ments. You can try to use the banded globalAlignment() algorithm (Gotoh/NW). Bests, Manuel= From weese@campus.fu-berlin.de Thu Jul 28 13:26:26 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmOjQ-0000xQ-4g>; Thu, 28 Jul 2011 13:26:24 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmOjQ-00054d-2L>; Thu, 28 Jul 2011 13:26:24 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmOjP-0000ox-PS>; Thu, 28 Jul 2011 13:26:24 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Thu, 28 Jul 2011 13:26:23 +0200 From: "Weese, David" To: SeqAn Development Date: Thu, 28 Jul 2011 13:26:22 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxNGS5bFMepyqHeTESEFFWCs5rauA== Message-ID: <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> In-Reply-To: <4E311069.1010204@mail.cryst.bbk.ac.uk> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: multipart/alternative; boundary="_000_38D74D8CB8DD47C8B3D0493CB49C41BDfuberlinde_" MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-ZEDAT-Hint: A X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311852384-00005A17-0F8975AB/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.045691, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.3 required=5.0 tests=ALL_TRUSTED,HTML_20_30, HTML_MESSAGE Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 11:26:26 -0000 --_000_38D74D8CB8DD47C8B3D0493CB49C41BDfuberlinde_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I tend to look in the reference section of the documentation so I was here: http://www.seqan.de/dddoc/html_devel/SPEC_Index_Esa.html I could see the definition Index > but from this page I had no idea I could overload the SAValue, or indeed if= I could whether it would be an argument to the IndexEsa template or to the= Index template. Are you saying I need to specialise the metafunction SAVal= ue to overload it for my index? Presumably this must be done in the seqan n= amespace? Is the SAValue for the index not a template argument to Index? Metafunctions shall not only be used by a user to determine a type of a fun= ction or data structure. They are as well defining the types of data struct= ures, i.e. if you change the return type of a metafunction you can change t= he behaviour and the types of data structures in the library. This is a bit= different to the STL, where types are defined in a trait structure - an ad= ditional template argument. Changing the return type of a metafunction is only possible for a type whic= h is more special than the default type (it was already specialized for). O= f course there is a SAValue specialization done for all Index classes of th= e form Index which returns the SAValue::Type. To be mo= re special you can either specialize for TText or TSpec. The example in my = last mail did the latter. You will find the list of metafunctions specializ= ed for a certain type under "Metafunction" on its documentation page. Yes, the specialization needs to be done in the seqan namespace. SAValue is= not an argument of the Index as there are many more metafunctions (see Fib= re, DefaultIndexCreator, ...) would blow the interface. We intentionally le= ft a single extra argument (TSpec) open for enhancements. If you want to cr= eate a specialized index you can define your own struct (MyIndex in the exa= mple) and give it as TSpec. You can then change the behaviour of your index= by specializing only the types/tags you want to change like SAValue, Fibre= , ... Just like you would do it for functions in OOP. Thanks, John. Cheers, David --_000_38D74D8CB8DD47C8B3D0493CB49C41BDfuberlinde_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I tend to look in the reference section of the documentation so I was here:
ht= tp://www.seqan.de/dddoc/html_devel/SPEC_Index_Esa.html

I could see the definition
Index<TText, IndexEsa<> >

but from this page I had no idea I could overload the SAValue, or indeed if I could whether it would be an argument to the IndexEsa template or to the Index template. Are you saying I need to specialise the metafunction SAValue to overload it for my index? Presumably this must be done in the seqan namespace? Is the SAValue for the index not a template argument to Index?
<= div>
Metafunctions shall not only be used by a user to d= etermine a type of a function or data structure. They are as well defining = the types of data structures, i.e. if you change the return type of a metaf= unction you can change the behaviour and the types of data structures in th= e library. This is a bit different to the STL, where types are defined in a= trait structure - an additional template argument.

Changing the return type of a metafunction is only possible for a type wh= ich is more special than the default type (it was already specialized for).= Of course there is a SAValue specialization done for all Index classes of = the form Index<TText, TSpec> which returns the SAValue<TText>::= Type. To be more special you can either specialize for TText or TSpec. The = example in my last mail did the latter. You will find the list of meta= functions specialized for a certain type under "Metafunction" on its docume= ntation page.

Yes, the specialization needs to be = done in the seqan namespace. SAValue is not an argument of the Index as the= re are many more metafunctions (see Fibre, DefaultIndexCreator, ...) would = blow the interface. We intentionally left a single extra argument (TSpec) o= pen for enhancements. If you want to create a specialized index you can def= ine your own struct (MyIndex in the example) and give it as TSpec. You can = then change the behaviour of your index by specializing only the types/tags= you want to change like SAValue, Fibre, ... Just like you would do it for = functions in OOP.


Thanks,
John.

Cheers,
David
<= /html>= --_000_38D74D8CB8DD47C8B3D0493CB49C41BDfuberlinde_-- From jer15@hermes.cam.ac.uk Thu Jul 28 15:56:44 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmR4r-00073Z-TJ>; Thu, 28 Jul 2011 15:56:41 +0200 Received: from ppsw-52.csi.cam.ac.uk ([131.111.8.152]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmR4r-0008J4-On>; Thu, 28 Jul 2011 15:56:41 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from wifi-host-18.mrc-bsu.cam.ac.uk ([193.60.87.18]:53270) by ppsw-52.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.159]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QmR4r-0004UQ-EV (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Thu, 28 Jul 2011 14:56:41 +0100 Message-ID: <4E316A99.2080104@mail.cryst.bbk.ac.uk> Date: Thu, 28 Jul 2011 14:56:41 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> In-Reply-To: <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.152 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311861401-00005A17-8B0DDA60/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 13:56:44 -0000 On 28/07/11 12:26, Weese, David wrote: > >> I tend to look in the reference section of the documentation so I was >> here: >> http://www.seqan.de/dddoc/html_devel/SPEC_Index_Esa.html >> >> I could see the definition >> Index > >> >> but from this page I had no idea I could overload the SAValue, or >> indeed if I could whether it would be an argument to the IndexEsa >> template or to the Index template. Are you saying I need to >> specialise the metafunction SAValue to overload it for my index? >> Presumably this must be done in the seqan namespace? Is the SAValue >> for the index not a template argument to Index? > > Metafunctions shall not only be used by a user to determine a type of > a function or data structure. They are as well defining the types of > data structures, i.e. if you change the return type of a metafunction > you can change the behaviour and the types of data structures in the > library. This is a bit different to the STL, where types are defined > in a trait structure - an additional template argument. > > Changing the return type of a metafunction is only possible for a type > which is more special than the default type (it was already > specialized for). Of course there is a SAValue specialization done for > all Index classes of the form Index which returns the > SAValue::Type. To be more special you can either specialize for > TText or TSpec. The example in my last mail did the latter. You will > find the list of metafunctions specialized for a certain type under > "Metafunction" on its documentation page. > > Yes, the specialization needs to be done in the seqan namespace. > SAValue is not an argument of the Index as there are many more > metafunctions (see Fibre, DefaultIndexCreator, ...) would blow the > interface. We intentionally left a single extra argument (TSpec) open > for enhancements. If you want to create a specialized index you can > define your own struct (MyIndex in the example) and give it as TSpec. > You can then change the behaviour of your index by specializing only > the types/tags you want to change like SAValue, Fibre, ... Just like > you would do it for functions in OOP. > OK, so if I've understood correctly I think the following should work. I'm trying to set up a ESA that uses a SAValue of type unsigned. I get a load of compile errors about No match for ‘getValueI2(unsigned int)’. Can you let me know what I'm doing wrong or if this is not possible. #include #include using namespace seqan; struct global {}; // specialise for our index namespace seqan { template< typename TText > struct SAValue< Index< TText, IndexEsa< global > > > { typedef unsigned Type; }; } // namespace seqan typedef String< Dna5 > string_t; ///< A string of Dna5. typedef StringSet< string_t > string_set_t; ///< StringSet type. typedef Index< string_set_t, IndexEsa< global > > index_t; int main( int argc, char * argv[] ) { string_set_t sequences; index_t index( sequences ); Iterator< index_t, TopDown<> >::Type it( index ); return 0; } From wengl@uci.edu Thu Jul 28 19:25:35 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmUL0-0006pv-Mf>; Thu, 28 Jul 2011 19:25:34 +0200 Received: from smtp1.es.uci.edu ([128.200.80.31]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmUL0-0006R1-8v>; Thu, 28 Jul 2011 19:25:34 +0200 Received: from [128.195.53.171] (dhcp-053171.ics.uci.edu [128.195.53.171]) (authenticated bits=0) by smtp1.es.uci.edu (8.13.8/8.13.8) with ESMTP id p6SHPV5k005024 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Thu, 28 Jul 2011 10:25:32 -0700 X-UCInetID: wengl Message-ID: <4E319BB8.2060800@uci.edu> Date: Thu, 28 Jul 2011 10:26:16 -0700 From: Lingjie Weng User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10 MIME-Version: 1.0 To: seqan-dev@lists.fu-berlin.de References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E30C59C.4080309@uci.edu> <68D67F86-FAE5-44CB-B7E1-FB7353982755@fu-berlin.de> In-Reply-To: <68D67F86-FAE5-44CB-B7E1-FB7353982755@fu-berlin.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: 128.200.80.31 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311873934-00005A17-D7CDD224/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000016, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Benin.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] align_myer algorithm in seqan/align/align_myer X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 17:25:35 -0000 Hi Manuel, Thanks a lot. Can you point me to the file of the Banded globaAlginment() algorithm (Gotoh/NW) Thanks, Lingjie On 07/28/2011 03:41 AM, Holtgrewe, Manuel wrote: > Am 28.07.2011 um 04:12 schrieb Lingjie Weng: > >> Hi, >> >> I was wondering if you have combined Ukkonen algorithm into the >> align_myers algorithm in Seqan_Release_1.3/seqan/align/align_myers.h. >> If yes, can you explain how you update the "last active cell"? > No, it's implemented in the Finder MyersUkkonen, though. > >> In addition, do you have a banded MyersUkkneon algorithm for global >> alignment. The one I found under >> Seqan_Release_1.3/seqan/find/find_myers_ukkonen is not applicable to >> global alignment. > Yes, you are right. The banded Myers Ukkonen not applicable to global alignments. > > You can try to use the banded globalAlignment() algorithm (Gotoh/NW). > > Bests, > Manuel > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev > From manuel.holtgrewe@fu-berlin.de Fri Jul 29 01:50:22 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmaLL-0003uk-U7>; Fri, 29 Jul 2011 01:50:20 +0200 Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost1.zedat.fu-berlin.de (Exim 4.69) with esmtp (envelope-from ) id <1QmaLL-0007vC-Rn>; Fri, 29 Jul 2011 01:50:19 +0200 Received: from 91-65-212-104-dynip.superkabel.de ([91.65.212.104] helo=[192.168.0.100]) by inpost2.zedat.fu-berlin.de (Exim 4.69) with esmtpsa (envelope-from ) id <1QmaLL-0005wv-Nd>; Fri, 29 Jul 2011 01:50:19 +0200 Message-Id: From: Manuel Holtgrewe To: SeqAn Development In-Reply-To: <4E319BB8.2060800@uci.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v936) Date: Fri, 29 Jul 2011 01:50:19 +0200 References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E30C59C.4080309@uci.edu> <68D67F86-FAE5-44CB-B7E1-FB7353982755@fu-berlin.de> <4E319BB8.2060800@uci.edu> X-Mailer: Apple Mail (2.936) X-Originating-IP: 91.65.212.104 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311897019-00005A17-7DBC284B/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] align_myer algorithm in seqan/align/align_myer X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jul 2011 23:50:22 -0000 Does this help you? http://www.seqan.de/dddoc/html_devel/FUNCTION.global_Alignment.html Bests, Manuel Am 28.07.2011 um 19:26 schrieb Lingjie Weng: > Hi Manuel, > > Thanks a lot. Can you point me to the file of the Banded > globaAlginment() algorithm (Gotoh/NW) > > Thanks, > Lingjie > > On 07/28/2011 03:41 AM, Holtgrewe, Manuel wrote: >> Am 28.07.2011 um 04:12 schrieb Lingjie Weng: >> >>> Hi, >>> >>> I was wondering if you have combined Ukkonen algorithm into the >>> align_myers algorithm in Seqan_Release_1.3/seqan/align/=20 >>> align_myers.h. >>> If yes, can you explain how you update the "last active cell"? >> No, it's implemented in the Finder MyersUkkonen, though. >> >>> In addition, do you have a banded MyersUkkneon algorithm for global >>> alignment. The one I found under >>> Seqan_Release_1.3/seqan/find/find_myers_ukkonen is not applicable to >>> global alignment. >> Yes, you are right. The banded Myers Ukkonen not applicable to =20 >> global alignments. >> >> You can try to use the banded globalAlignment() algorithm (Gotoh/NW). >> >> Bests, >> Manuel >> _______________________________________________ >> seqan-dev mailing list >> seqan-dev@lists.fu-berlin.de >> https://lists.fu-berlin.de/listinfo/seqan-dev >> > > > _______________________________________________ > seqan-dev mailing list > seqan-dev@lists.fu-berlin.de > https://lists.fu-berlin.de/listinfo/seqan-dev --=20 Manuel Holtgrewe manuel.holtgrewe@fu-berlin.de Freie Universit=E4t Berlin http://www.inf.fu-berlin.de/ Institut f=FCr Informatik Phone: +49 30 838 75246 Takustra=DFe 9 Algorithmic Bioinformatics 14195 Berlin Room 021 From weese@campus.fu-berlin.de Fri Jul 29 10:49:40 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmilH-0007He-ML>; Fri, 29 Jul 2011 10:49:39 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmilH-0004T1-KQ>; Fri, 29 Jul 2011 10:49:39 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmilH-0006uc-F3>; Fri, 29 Jul 2011 10:49:39 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 29 Jul 2011 10:49:39 +0200 From: "Weese, David" To: SeqAn Development Date: Fri, 29 Jul 2011 10:49:37 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxNzHMbr672jjv5R1ava8yt/chfmg== Message-ID: References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> <4E316A99.2080104@mail.cryst.bbk.ac.uk> In-Reply-To: <4E316A99.2080104@mail.cryst.bbk.ac.uk> Accept-Language: de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311929379-00005A17-BD88A773/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.003247, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 08:49:40 -0000 > OK, so if I've understood correctly I think the following should work.=20 > I'm trying to set up a ESA that uses a SAValue of type unsigned. I get a= =20 > load of compile errors about No match for =91getValueI2(unsigned int)=92.= =20 > Can you let me know what I'm doing wrong or if this is not possible. >=20 I get them too. Originally global positions were meant for suffix array con= struction algorithms that cannot cope with multiple sequences and hence hav= e to use the concatenation of all sequences. The positions in the suffix ar= ray are then related to the concatenation (=3D global positions). Downstrea= m algorithms like suffix tree iterators should work with local as well as w= ith global positions. It seems that the suffix array construction algorithm Skew7Multi assumes th= at the suffix array contains local positions, I will have a look at it. For now, I would recommend to use local (bitcompressed) positions as SAValu= e. David= From jer15@hermes.cam.ac.uk Fri Jul 29 11:02:25 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qmixc-0007pz-1q>; Fri, 29 Jul 2011 11:02:24 +0200 Received: from ppsw-41.csi.cam.ac.uk ([131.111.8.141]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1Qmixb-0003yi-Tw>; Fri, 29 Jul 2011 11:02:24 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:33536 helo=[192.168.1.4]) by ppsw-41.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.156]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1Qmixb-00053m-Rz (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Fri, 29 Jul 2011 10:02:23 +0100 Message-ID: <4E32771E.9090806@mail.cryst.bbk.ac.uk> Date: Fri, 29 Jul 2011 10:02:22 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> <4E316A99.2080104@mail.cryst.bbk.ac.uk> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.141 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311930144-00005A17-1DBA1146/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Algerien.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 09:02:25 -0000 On 29/07/11 09:49, Weese, David wrote: >> OK, so if I've understood correctly I think the following should work. >> I'm trying to set up a ESA that uses a SAValue of type unsigned. I get a >> load of compile errors about No match for ‘getValueI2(unsigned int)’. >> Can you let me know what I'm doing wrong or if this is not possible. >> > I get them too. Originally global positions were meant for suffix array construction algorithms that cannot cope with multiple sequences and hence have to use the concatenation of all sequences. The positions in the suffix array are then related to the concatenation (= global positions). Downstream algorithms like suffix tree iterators should work with local as well as with global positions. > It seems that the suffix array construction algorithm Skew7Multi assumes that the suffix array contains local positions, I will have a look at it. > For now, I would recommend to use local (bitcompressed) positions as SAValue. > Thanks for looking into it. Some tests I've been doing on traversing the array and storing large numbers of occurrences suggest global positions are more efficient. Obviously these were tests where the array was using local positions as its SAValues. I wanted to see if I could speed things up further by using an array with global positions for SAValues. So if you can get this working I'd be keen to try it. I think I didn't mention yet but I just made a publication that references Seqan you might like to know about. Do you have a project page that lists third party work that uses Seqan? http://nar.oxfordjournals.org/content/early/2011/07/23/nar.gkr574.abstract Thanks, John. From manuel.holtgrewe@fu-berlin.de Fri Jul 29 11:23:23 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmjHt-0000BM-N3>; Fri, 29 Jul 2011 11:23:21 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmjHt-0001rq-L3>; Fri, 29 Jul 2011 11:23:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmjHt-0001Of-Fr>; Fri, 29 Jul 2011 11:23:21 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 29 Jul 2011 11:23:21 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Fri, 29 Jul 2011 11:23:20 +0200 Thread-Topic: [Seqan-dev] Time complexity of posGlobalize Thread-Index: AcxN0SjgXmnLpOLFQ3ursQbvhSFV5g== Message-ID: <2682A7D3-69B9-4B67-9859-F4845BD9B8CD@fu-berlin.de> References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> <4E316A99.2080104@mail.cryst.bbk.ac.uk> <4E32771E.9090806@mail.cryst.bbk.ac.uk> In-Reply-To: <4E32771E.9090806@mail.cryst.bbk.ac.uk> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311931401-00005A17-E8B4395A/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.007409, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Gabun.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 09:23:23 -0000 Am 29.07.2011 um 11:02 schrieb John Reid: > I think I didn't mention yet but I just made a publication that=20 > references Seqan you might like to know about. Do you have a project=20 > page that lists third party work that uses Seqan? >=20 > http://nar.oxfordjournals.org/content/early/2011/07/23/nar.gkr574.abstrac= t So far, we're collecting users in the wiki. I've added a reference to your = publication there. http://trac.mi.fu-berlin.de/seqan/wiki/SeqAnUsers Maybe we could also add your organization/employer? Could you send me a nam= e and URL? Bests, Manuel= From jer15@hermes.cam.ac.uk Fri Jul 29 11:29:16 2011 Received: from relay1.zedat.fu-berlin.de ([130.133.4.67]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmjNa-0000PG-6U>; Fri, 29 Jul 2011 11:29:14 +0200 Received: from ppsw-51.csi.cam.ac.uk ([131.111.8.151]) by relay1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmjNa-00075D-3R>; Fri, 29 Jul 2011 11:29:14 +0200 X-Cam-AntiVirus: no malware found X-Cam-SpamDetails: not scanned X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/ Received: from cpc6-dals15-2-0-cust115.hari.cable.virginmedia.com ([82.35.196.116]:35192 helo=[192.168.1.4]) by ppsw-51.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.158]:587) with esmtpsa (PLAIN:jer15) (TLSv1:DHE-RSA-CAMELLIA256-SHA:256) id 1QmjNZ-00011P-Yv (Exim 4.72) for seqan-dev@lists.fu-berlin.de (return-path ); Fri, 29 Jul 2011 10:29:13 +0100 Message-ID: <4E327D69.8070706@mail.cryst.bbk.ac.uk> Date: Fri, 29 Jul 2011 10:29:13 +0100 From: John Reid User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: SeqAn Development References: <4E2D3C1F.5010801@mail.cryst.bbk.ac.uk> <4D8A95A9-E0C5-449E-9C96-3F169ABB71FE@campus.fu-berlin.de> <4E2FF724.6010502@mail.cryst.bbk.ac.uk> <4E30025C.5080309@mail.cryst.bbk.ac.uk> <71FF7950-9B04-4094-82F8-89809C7AB0EF@fu-berlin.de> <4E311069.1010204@mail.cryst.bbk.ac.uk> <38D74D8C-B8DD-47C8-B3D0-493CB49C41BD@fu-berlin.de> <4E316A99.2080104@mail.cryst.bbk.ac.uk> <4E32771E.9090806@mail.cryst.bbk.ac.uk> <2682A7D3-69B9-4B67-9859-F4845BD9B8CD@fu-berlin.de> In-Reply-To: <2682A7D3-69B9-4B67-9859-F4845BD9B8CD@fu-berlin.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: "J.E. Reid" X-Originating-IP: 131.111.8.151 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311931754-00005A17-ACEF6B2D/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.001169, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Botsuana.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=RATWARE_GECKO_BUILD Subject: Re: [Seqan-dev] Time complexity of posGlobalize X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 09:29:16 -0000 On 29/07/11 10:23, Holtgrewe, Manuel wrote: > Am 29.07.2011 um 11:02 schrieb John Reid: > >> I think I didn't mention yet but I just made a publication that >> references Seqan you might like to know about. Do you have a project >> page that lists third party work that uses Seqan? >> >> http://nar.oxfordjournals.org/content/early/2011/07/23/nar.gkr574.abstract > So far, we're collecting users in the wiki. I've added a reference to your publication there. > > http://trac.mi.fu-berlin.de/seqan/wiki/SeqAnUsers > > Maybe we could also add your organization/employer? Could you send me a name and URL? MRC Biostatistics Unit http://www.mrc-bsu.cam.ac.uk/index.html Regards, John. From manuel.holtgrewe@fu-berlin.de Fri Jul 29 13:30:34 2011 Received: from outpost1.zedat.fu-berlin.de ([130.133.4.66]) by list1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmlGz-0004uK-Nz>; Fri, 29 Jul 2011 13:30:33 +0200 Received: from relay2.zedat.fu-berlin.de ([130.133.4.80]) by outpost1.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmlGz-00048M-La>; Fri, 29 Jul 2011 13:30:33 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by relay2.zedat.fu-berlin.de (Exim 4.69) for seqan-dev@lists.fu-berlin.de with esmtp (envelope-from ) id <1QmlGz-0004HV-GL>; Fri, 29 Jul 2011 13:30:33 +0200 Received: from exchange6.fu-berlin.de ([160.45.9.133]) by exchange6.fu-berlin.de ([160.45.9.133]) with mapi; Fri, 29 Jul 2011 13:30:33 +0200 From: "Holtgrewe, Manuel" To: SeqAn Development Date: Fri, 29 Jul 2011 13:30:32 +0200 Thread-Topic: New I/O Code in SeqAn, includes BAM/SAM I/O Thread-Index: AcxN4u19gVM8x3w1S1CFiDa0lV4N0A== Message-ID: <68046539-573B-401A-99EF-69F2E2E22B0B@fu-berlin.de> Accept-Language: en-US, de-DE Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: 160.45.9.133 X-purgate: clean X-purgate-type: clean X-purgate-ID: 151147::1311939033-00005A17-D242F3C3/0-0/0-0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.238628, version=1.2.2 X-Spam-Flag: NO X-Spam-Checker-Version: SpamAssassin 3.0.4 on Burundi.ZEDAT.FU-Berlin.DE X-Spam-Level: X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED Subject: [Seqan-dev] ANN: New I/O Code in SeqAn, includes BAM/SAM I/O X-BeenThere: seqan-dev@lists.fu-berlin.de X-Mailman-Version: 2.1.11 Precedence: list Reply-To: SeqAn Development List-Id: SeqAn Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2011 11:30:35 -0000 Dear all, the SeqAn team is happy to announce the availability of our updated I/O cod= e. The code is available in the current SVN version and will be available i= n the 1.4 release we are planning later this year. New features and improvements include: * A new stream layer that allows for reading and writing to C++/C=20 streams as well as streams compressed with gzip or bz2. * Improvements to our FASTA/FASTQ I/O code in terms of=20 performance and robustness. * A tokenizing/parsing interface that allows for easily writing=20 your own parsers. * An interface for reading SAM and BAM files record-wise. The code is fairly stable but might undergo small changes before the releas= e. We are sure that it is highly useful, nevertheless. Please try it out an= d report any problems to our mailing list or bug tracker. More information is available in our Tutorial http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial Specifically, the following new chapters are interesting for the new system= : http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/FileIO2 http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/ReadingSequenceFiles http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/Parsing http://trac.mi.fu-berlin.de/seqan/wiki/Tutorial/SamBamIO Bests, Manuel