FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

[Seqan-dev] Random access of large FASTA file

<-- thread -->
<-- date -->
  • From: Johannes Dröge <johdro@mpi-inf.mpg.de>
  • To: seqan-dev@lists.fu-berlin.de
  • Date: Wed, 6 Jul 2011 16:28:12 +0200
  • Organization: Universität Düsseldorf/Max-Planck-Institut für Informatik Saarbrücken
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: [Seqan-dev] Random access of large FASTA file

Hello,
I am using Seqan to access a large FASTA file. In this case, I am importing the whole RefSeq DB for random access (into memory or memory-mapped). This can be quite a huge file, so I decided to go for a dynamic strategy writing a generic SequenceStorage object. It works well for

typedef seqan::String< seqan::Dna5 > StringType; //(default type)
typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType;

but not for
typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType;

Here is the Code that imports the data using the MMap-Trick from the HowTo and put it into a

StringSet< StringType > data_;

with an index data structure 

std::map< std::string, long unsigned int > id2pos_;

--------------------------------------------------------------------------------
seqan::MultiSeqFile db_sequences;
seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY );
seqan::split( db_sequences, seqan::Fasta() );

for( unsigned int i = 0; i < num_records; ++i ) {
	StringType seq;
	seqan::assignSeq( seq, db_sequences[i], fasta_format_ );
	
	std::string id;
	seqan::assignSeqId( id, db_sequences[i], fasta_format_ );
	id2pos_[ extractFastaCommentField( id, "gi" ) ] = seqan::assignValueById( data_, seq );
}
--------------------------------------------------------------------------------

1) seqan::assignValueById() will cause a segfault at sequence number 33,924 out of 276,313 when using a StringSet with mmap strings.

2) Also, I don't know how to define a StringSet using array strings.

3) Using a regular Dna5 string, the how operation will take about 5 minutes. A packed string requires much longer to load. Is there any way to speed this up? I could think of a (binary) sink for a StingSet to avoid parsing and recoding every time I load the DB sequences. Is there anything like this (planned)?

I appreciate your help!

Gruß Johannes



<-- thread -->
<-- date -->
  • Follow-Ups:
    • Re: [Seqan-dev] Random access of large FASTA file
      • From: Johannes Dröge <johdro@mpi-inf.mpg.de>
    • Re: [Seqan-dev] Random access of large FASTA file
      • From: "Weese, David" <weese@campus.fu-berlin.de>
  • seqan-dev - July 2011 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal