Re: [Seqan-dev] Random access of large FASTA file
Hi Johannes,
I assume the value of num_records less or equal to length(db_sequences). Looking at your code it seems that you try to use a memory mapped string as a temporary variable in a large loop. Maybe not the best idea, as it would create a temporary file and deletes it in every iteration. It could be that the temporary could not be opened, you could test that with a #define SEQAN_DEBUG before including any SeqAn header.
You should at least move all the instantiations out of the loop. Still I dont think you need a memory mapped string (seq) to store a single sequence of a multi fasta file. Also I cannot see, where you store the read sequences. It would make sense to use a single StringSet<String<..,MMap<> >, Owner<ConcatDirect<> > > data_ that stores multiple sequences using a single memory mapped string.
HTH. If the problem still remains, please create a bug ticket with source code and example files.
Cheers,
David
--
David Weese weese@inf.fu-berlin.de
Freie Universität Berlin http://www.inf.fu-berlin.de/
Institut für Informatik Phone: +49 30 838 75246
Takustraße 9 Algorithmic Bioinformatics
14195 Berlin Room 021
Am 06.07.2011 um 16:28 schrieb Johannes Dröge:
> Hello,
> I am using Seqan to access a large FASTA file. In this case, I am importing the whole RefSeq DB for random access (into memory or memory-mapped). This can be quite a huge file, so I decided to go for a dynamic strategy writing a generic SequenceStorage object. It works well for
>
> typedef seqan::String< seqan::Dna5 > StringType; //(default type)
> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType;
>
> but not for
> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType;
>
> Here is the Code that imports the data using the MMap-Trick from the HowTo and put it into a
>
> StringSet< StringType > data_;
>
> with an index data structure
>
> std::map< std::string, long unsigned int > id2pos_;
>
> --------------------------------------------------------------------------------
> seqan::MultiSeqFile db_sequences;
> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY );
> seqan::split( db_sequences, seqan::Fasta() );
>
> for( unsigned int i = 0; i < num_records; ++i ) {
> StringType seq;
> seqan::assignSeq( seq, db_sequences[i], fasta_format_ );
>
> std::string id;
> seqan::assignSeqId( id, db_sequences[i], fasta_format_ );
> id2pos_[ extractFastaCommentField( id, "gi" ) ] = seqan::assignValueById( data_, seq );
> }
> --------------------------------------------------------------------------------
>
> 1) seqan::assignValueById() will cause a segfault at sequence number 33,924 out of 276,313 when using a StringSet with mmap strings.
>
> 2) Also, I don't know how to define a StringSet using array strings.
>
> 3) Using a regular Dna5 string, the how operation will take about 5 minutes. A packed string requires much longer to load. Is there any way to speed this up? I could think of a (binary) sink for a StingSet to avoid parsing and recoding every time I load the DB sequences. Is there anything like this (planned)?
>
> I appreciate your help!
>
> Gruß Johannes
>
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev