I would recommend you to use a Double-Pass MMap RecordReader as described here:
I'm not sure how much compression on disk will help you, e.g. where the overhead is.
You could also use the GZFile Stream and use a Single-Pass RecordReader for this. The question is whether your disk (for reading compressed data) or your CPU (for decompressing the data) is then the bottleneck.
From: John Reid [firstname.lastname@example.org]
Sent: Tuesday, June 26, 2012 4:20 PM
To: SeqAn Development
Subject: Re: [Seqan-dev] Performance advice for whole genome ESA
I've done some more reading ( http://trac.seqan.de/wiki/HowTo/EfficientImportOfMillionsOfSequences) and as far as I can tell I should just be using memory mapped files as a mechanism to read large sequence sets into main memory. Likewise this is the area where compression on disk could help. If I want to iterate over a ESA I'm best off copying the sequences into a standard seqan StringSet in main memory and creating the ESA on top of that. Please let me know if I've got the wrong end of the stick.
On 21/06/12 16:33, John Reid wrote:
Hi, I'm reading the whole mouse genome into a seqan::IndexEsa based on a seqan::StringSet. At the moment I have the genome (2,730,871,774 bp) stored in one uncompressed fasta file on disk. Once I have the genome loaded I'm iterating over it many times looking at all the words < about 20bp. I'm wondering if there is a better way to go about this. Should I be looking at memory mapped files and/or compression on disk? Any pointers or advice would be welcome. Thanks, John. _______________________________________________ seqan-dev mailing list email@example.com https://lists.fu-berlin.de/listinfo/seqan-dev