Re: [Seqan-dev] Random access of large FASTA file

Johannes Dröge <johdro@mpi-inf.mpg.de> · Fri, 8 Jul 2011 16:05:30 +0200

Sorry, I still don't get it.
How can [ MutiFastaFile  ==> Dna5String ==>  StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > ] work, if it copies the value of the sequence?
Doesn't assignSeq() copy the value into the Dna5String seq?

What happens when I use appendValue to add seq to the StringSet, where does it actually reside (it should still be in the MultiFasta file).

I need to access the MultiFastaFile (on the hard disk) as a regular StringSet to read its contents on demand, not copy its sequences into a new memory-mapped file.

Johannes

Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David:
> Hi,
> 
> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/EfficientImportOfMillionsOfSequences and simply change:
> 
> StringSet<String<Dna5Q> > seqs;
> 
> into:
> 
> StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > seqs;
> 
> That should do what you want.
> 
> Regards,
> David
> 
> 
> Am 07.07.2011 um 17:11 schrieb Johannes Dröge:
> 
> > Hello David,
> > thank for your comments. I am still not confident with the design concept of memory mapped single strings in Seqan. The idea of the loop in any case is to create a StringSet which type depends on the choosen StringType. So it works this way:
> > 
> > 1) create temporary sequences object
> > 2) assign content from memory mapped multi-fasta file (MultiSeqFile)
> > 3) store in StringSet which will have ownership (I guess this is done via a copy constructor)
> > 
> > This works fine for standard and packed string types. I would also like to have a StringSet that contains strings that are actually memory mapped from the original multi-fasta file. I thought that the assignSeq function would appropriately handle this when I use it with default-constructed memory mapped sequence object. I seems I misunderstood the design of this sequence type. Is there any way to construct such a StringSet I have in mind?
> > 
> > Gruß Johannes
> > 
> > 
> > Am Donnerstag, 7. Juli 2011 16:14:34 schrieb Weese, David:
> >> Hi Johannes,
> >> 
> >> I assume the value of num_records less or equal to length(db_sequences). Looking at your code it seems that you try to use a memory mapped string as a temporary variable in a large loop. Maybe not the best idea, as it would create a temporary file and deletes it in every iteration. It could be that the temporary could not be opened, you could test that with a #define SEQAN_DEBUG before including any SeqAn header.
> >> You should at least move all the instantiations out of the loop. Still I dont think you need a memory mapped string (seq) to store a single sequence of a multi fasta file. Also I cannot see, where you store the read sequences. It would make sense to use a single StringSet<String<..,MMap<> >, Owner<ConcatDirect<> > >  data_ that stores multiple sequences using a single memory mapped string.
> >> 
> >> HTH. If the problem still remains, please create a bug ticket with source code and example files.
> >> 
> >> Cheers,
> >> David
> >> --
> >> David Weese				weese@inf.fu-berlin.de
> >> Freie Universität Berlin		http://www.inf.fu-berlin.de/
> >> Institut für Informatik			Phone: +49 30 838 75246
> >> Takustraße 9					Algorithmic Bioinformatics
> >> 14195 Berlin					Room 021 
> >> 
> >> Am 06.07.2011 um 16:28 schrieb Johannes Dröge:
> >> 
> >>> Hello,
> >>> I am using Seqan to access a large FASTA file. In this case, I am importing the whole RefSeq DB for random access (into memory or memory-mapped). This can be quite a huge file, so I decided to go for a dynamic strategy writing a generic SequenceStorage object. It works well for
> >>> 
> >>> typedef seqan::String< seqan::Dna5 > StringType; //(default type)
> >>> typedef seqan::String< seqan::Dna5, seqan::Packed<> > StringType;
> >>> 
> >>> but not for
> >>> typedef seqan::String< seqan::Dna5, seqan::MMap<> > StringType;
> >>> 
> >>> Here is the Code that imports the data using the MMap-Trick from the HowTo and put it into a
> >>> 
> >>> StringSet< StringType > data_;
> >>> 
> >>> with an index data structure 
> >>> 
> >>> std::map< std::string, long unsigned int > id2pos_;
> >>> 
> >>> --------------------------------------------------------------------------------
> >>> seqan::MultiSeqFile db_sequences;
> >>> seqan::open( db_sequences.concat, filename.c_str(), seqan::OPEN_RDONLY );
> >>> seqan::split( db_sequences, seqan::Fasta() );
> >>> 
> >>> for( unsigned int i = 0; i < num_records; ++i ) {
> >>> 	StringType seq;
> >>> 	seqan::assignSeq( seq, db_sequences[i], fasta_format_ );
> >>> 	
> >>> 	std::string id;
> >>> 	seqan::assignSeqId( id, db_sequences[i], fasta_format_ );
> >>> 	id2pos_[ extractFastaCommentField( id, "gi" ) ] = seqan::assignValueById( data_, seq );
> >>> }
> >>> --------------------------------------------------------------------------------
> >>> 
> >>> 1) seqan::assignValueById() will cause a segfault at sequence number 33,924 out of 276,313 when using a StringSet with mmap strings.
> >>> 
> >>> 2) Also, I don't know how to define a StringSet using array strings.
> >>> 
> >>> 3) Using a regular Dna5 string, the how operation will take about 5 minutes. A packed string requires much longer to load. Is there any way to speed this up? I could think of a (binary) sink for a StingSet to avoid parsing and recoding every time I load the DB sequences. Is there anything like this (planned)?
> >>> 
> >>> I appreciate your help!
> >>> 
> >>> Gruß Johannes
> > 
> > _______________________________________________
> > seqan-dev mailing list
> > seqan-dev@lists.fu-berlin.de
> > https://lists.fu-berlin.de/listinfo/seqan-dev
> 
> 
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev
>