Re: [Seqan-dev] Random access of large FASTA file

Johannes Dröge <johdro@mpi-inf.mpg.de> · Thu, 21 Jul 2011 10:58:51 +0200

Hello David,
since I got the concepts now, here are my plans and suggestions... Maybe you can comment on them.

Since accessing a regular unprocessed FASTA file by means of a StringSet via MMap is not possible. One has to use a split() (first pass) and assignSeq() (second pass). As you suggested, it might be better to process the file and access it directly via a StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > >. This implies that one has to load and import these sequences, then save them into a file and open it again, when one needs them. Is there any documentation on this (or in some implemenation)? I can image this can be done simply by giving it a filename throught the constructor or so...

Also I noticed, that when I use an in-memory representation of a stringset it will take quite long to open and import it. This is mainly due to the first pass split(), which is logically not really necessary as one could iterate through the file without previously determining all split points. To avoid this, one can as well use the above strategy of saving a processed Owner<ConcatDirect> string as well, I guess. Process the file, save it again and load it as raw by copying it completely into memory and access it through a StringSet. The considerably slower assignSeq() to packed strings would probably also profit from this concept.

By the way, is there any experience about the (reading) access speed to packed strings compared to normal ones and how much space they save on average?

I noticed that all of this boils down to generic StringSet dump() and load() functions...

Is my thinking correct? If yes, I would probably go for the second since I need an (in-memory, optionally MMap) frequent random access object of a large FASTA file that minimizes loading time.

Gruß Johannes

Am Freitag, 8. Juli 2011 19:35:20 schrieb Weese, David:
> 
> Am 08.07.2011 um 16:05 schrieb Johannes Dröge:
> 
> > Sorry, I still don't get it.
> > How can [ MutiFastaFile  ==> Dna5String ==>  StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > ] work, if it copies the value of the sequence?
> > Doesn't assignSeq() copy the value into the Dna5String seq?
> 
> assignSeq *extracts* the sequence information from a block that may contain a header, a sequence interspersed by newlines, quality values, etc.
> If want to get sequence substrings of an unprocessed Fasta file, they may contain whitespace.
> 
> > 
> > What happens when I use appendValue to add seq to the StringSet, where does it actually reside (it should still be in the MultiFasta file).
> 
> As assignSeq(seq, ...) extracts the sequence character-by-character there is no association between seq and the Fasta file.
> 
> > 
> > I need to access the MultiFastaFile (on the hard disk) as a regular StringSet to read its contents on demand, not copy its sequences into a new memory-mapped file.
> 
> Then you need to keep the split MultiSeqFile and extract the sequences on demand with assignSeq.
> If you access the sequences very often I would recommend to fill a StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > >  (see my last mail) which also resides on your hard disk but can be accessed without assignSeq.
> 
> > 
> > Johannes
> > 
> > 
> > Am Donnerstag, 7. Juli 2011 20:28:00 schrieb Weese, David:
> >> Hi,
> >> 
> >> follow the howto on http://trac.mi.fu-berlin.de/seqan/wiki/HowTo/EfficientImportOfMillionsOfSequences and simply change:
> >> 
> >> StringSet<String<Dna5Q> > seqs;
> >> 
> >> into:
> >> 
> >> StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > > seqs;
> >> 
> >> That should do what you want.
> >> 
> >> Regards,
> >> David
> >> 
> >> 
> 
> 
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev
>