[Seqan-dev] Performance advice for whole genome ESA

From: John Reid &lt; j.reid@mail.cryst.bbk.ac.uk &gt;
To: SeqAn Development &lt; seqan-dev@lists.fu-berlin.de &gt;
Date: Thu, 21 Jun 2012 16:33:26 +0100
Reply-to: SeqAn Development &lt; seqan-dev@lists.fu-berlin.de &gt;
Subject: [Seqan-dev] Performance advice for whole genome ESA

John Reid <j.reid@mail.cryst.bbk.ac.uk> · Thu, 21 Jun 2012 16:33:26 +0100

Hi,

I'm reading the whole mouse genome into a seqan::IndexEsa based on a
seqan::StringSet. At the moment I have the genome (2,730,871,774 bp)
stored in one uncompressed fasta file on disk. Once I have the genome
loaded I'm iterating over it many times looking at all the words < about
20bp. I'm wondering if there is a better way to go about this. Should I
be looking at memory mapped files and/or compression on disk? Any
pointers or advice would be welcome.

Thanks,
John.