Thanks for you nice explanation. So open() is the magic function do apply to a ConcatDirect StringSet. Sorry for bugging you further, before I start with the implementation I have some final questions... My primary target is the in-memory sequence storage, as I wrote, so for this I do not need to specialize the limits object. I suppose open() works with regular Alloc strings as well? Where are temporary MMap files usually stored? Can I somehow use read-only file access with persistent MMap files? Gruß Johannes Am Donnerstag, 21. Juli 2011 15:02:14 schrieb Weese, David: > > Am 21.07.2011 um 10:58 schrieb Johannes Dröge: > > Hello David, > > Hi, > > since I got the concepts now, here are my plans and suggestions... Maybe you can comment on them. > > Since accessing a regular unprocessed FASTA file by means of a StringSet via MMap is not possible. One has to use a split() (first pass) and assignSeq() (second pass). As you suggested, it might be better to process the file and access it directly via a StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > >. This implies that one has to load and import these sequences, then save them into a file and open it again, when one needs them. Is there any documentation on this (or in some implemenation)? I can image this can be done simply by giving it a filename throught the constructor or so... > > Yes, you could do it this way. This requires either: > 1) a persistent StringSet which you construct once and reopen everytime > 2) a StringSet which uses a temporary file to store the concatenated sequences (e.g. StringSet<String<Dna5Q, MMap<> >, Owner<ConcatDirect<> > >) that you generate from the fasta file everytime you start your application and that is deleted automatically in the StringSet destructor > > 2) is the easiest way, but the conversion is certainly more time consuming > > 1) requires to make both members (not only the concatenated sequence) of the ConcatDirect persistent strings. The second member is limits and stores the sequence breakpoints in concat. By default it is an Alloc String. In your application you can specialize: > > template <> > struct StringSetLimits<TYourStringSet> > { > typedef typename Size< TYourStringSet >::Type TSize_; > typedef String<TSize_, MMap<> > Type; > }; > > to use a MMap<> String instead. Before doing anything with your StringSet simply call: > > TYourStringSet stringSet; > open(stringSet.concat, "yourfile.concat"); // assigns a file to the mmap string > open(stringSet.limits, "yourfile.limits"); // if not called, a temporary file is created = non-persistent > > // append your sequences (when runned for the first time) > // or > // use the sequences (later) > // > // save() is not required as the string is always in sync with the file on disk > > Cheers, > David > > > Also I noticed, that when I use an in-memory representation of a stringset it will take quite long to open and import it. This is mainly due to the first pass split(), which is logically not really necessary as one could iterate through the file without previously determining all split points. To avoid this, one can as well use the above strategy of saving a processed Owner<ConcatDirect> string as well, I guess. Process the file, save it again and load it as raw by copying it completely into memory and access it through a StringSet. The considerably slower assignSeq() to packed strings would probably also profit from this concept. > > By the way, is there any experience about the (reading) access speed to packed strings compared to normal ones and how much space they save on average? > > I noticed that all of this boils down to generic StringSet dump() and load() functions... > > Is my thinking correct? If yes, I would probably go for the second since I need an (in-memory, optionally MMap) frequent random access object of a large FASTA file that minimizes loading time. > > Gruß Johannes > > > > > -- > David Weese weese@inf.fu-berlin.de<mailto:weese@inf.fu-berlin.de> > Freie Universität Berlin http://www.inf.fu-berlin.de/ > Institut für Informatik Phone: +49 30 838 75246 > Takustraße 9 Algorithmic Bioinformatics > 14195 Berlin Room 021 > >