Re: [Seqan-dev] Random access of large FASTA file



Am 08.07.2011 um 16:05 schrieb Johannes Dröge:

> Sorry, I still don't get it.
> How can [ MutiFastaFile  ==> Dna5String ==>  StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > ] work, if it copies the value of the sequence?
> Doesn't assignSeq() copy the value into the Dna5String seq?

assignSeq *extracts* the sequence information from a block that may contain a header, a sequence interspersed by newlines, quality values, etc.
If want to get sequence substrings of an unprocessed Fasta file, they may contain whitespace.

> 
> What happens when I use appendValue to add seq to the StringSet, where does it actually reside (it should still be in the MultiFasta file).

As assignSeq(seq, ...) extracts the sequence character-by-character there is no association between seq and the Fasta file.

> 
> I need to access the MultiFastaFile (on the hard disk) as a regular StringSet to read its contents on demand, not copy its sequences into a new memory-mapped file.

Then you need to keep the split MultiSeqFile and extract the sequences on demand with assignSeq.
If you access the sequences very often I would recommend to fill a StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > >  (see my last mail) which also resides on your hard disk but can be accessed without assignSeq.

> 
> Johannes
> 
> 
> Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David:
>> Hi,
>> 
>> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/EfficientImportOfMillionsOfSequences and simply change:
>> 
>> StringSet<String<Dna5Q> > seqs;
>> 
>> into:
>> 
>> StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > seqs;
>> 
>> That should do what you want.
>> 
>> Regards,
>> David
>> 
>>