On Jul 5, 2012, at 9:46 AM, John Reid wrote:
Hi Manuel,
Thanks for the advice.
I'm having some memory problems when I build a ESA on a
whole genome (3Gb or so). I don't even know if I can
reasonably expect to do this. Does anyone out there have
any experience with this? If so, what sort of hardware are
you running on and did you have to take any special
measures in software to handle such large sequence sets?
If you build the Esa as it is, it will consume 4 long
ints per char on a 64bit machine and take ~96Gb of
memory for a 3Gb genome.
But you can redefine index fibres for your needs, i.e.
you can replace long ints with ints or chars.
// TGenome is
the type of sequence, e.g. StringSet<Dna5String>
typedef StringSet<Dna5String>
TGenome;
typedef
Index<TGenome, IndexEsa<> >
TGenomeEsa;
namespace seqan
{
template <>
struct Fibre<TGenomeEsa,
FibreSA>
{
// Works for up to 256
contigs of length 4Gbp
typedef
String< Pair<unsigned
char, unsigned
int, Compressed>,
DefaultIndexStringSpec<TGenomeEsa>::Type >
Type;
// Use a mmapped string
// typedef String< Pair<unsigned char,
unsigned int, Compressed>, MMap<> > Type;
};
template <>
struct Fibre<TGenomeEsa,
FibreLcp>
{
typedef
String<unsigned
int,
DefaultIndexStringSpec<TGenomeEsa>::Type >
Type;
};
template <>
struct Fibre<TGenomeEsa,
FibreChildtab>
{
typedef
String<unsigned
int,
DefaultIndexStringSpec<TGenomeEsa>::Type >
Type;
};
}
In this way your Esa will fit in ~38Gb of memory.
You might want to try out mmapped strings depending
on your memory requirements and the access pattern of your
algorithm.
You can also try to redefine size and limits
metafunctions for you sequence types.
namespace seqan
{
template <>
struct
Size<Dna5String>
{
typedef
unsigned int
Type;
};
template <>
struct
StringSetLimits<TGenome>
{
typedef
String<unsigned
char> Type;
};
}
Please overload metafunctions only in your applications,
not in library modules!
Ciao,
Enrico
Thanks,
John.
On 26/06/12 16:20, Holtgrewe,
Manuel wrote:
Hi John,
I would recommend you to use a Double-Pass MMap
RecordReader as described here:
I'm not sure how much compression on disk will
help you, e.g. where the overhead is.
You could also use the GZFile Stream and use a
Single-Pass RecordReader for this. The question is
whether your disk (for reading compressed data) or
your CPU (for decompressing the data) is then the
bottleneck.
Cheers,
Manuel
From:
John Reid [j.reid@mail.cryst.bbk.ac.uk]
Sent: Tuesday, June 26, 2012 4:20 PM
To: SeqAn Development
Subject: Re: [Seqan-dev] Performance
advice for whole genome ESA
Hi,
I've done some more reading (
http://trac.seqan.de/wiki/HowTo/EfficientImportOfMillionsOfSequences)
and as far as I can tell I should just be using
memory mapped files as a mechanism to read large
sequence sets into main memory. Likewise this is
the area where compression on disk could help.
If I want to iterate over a ESA I'm best off
copying the sequences into a standard seqan
StringSet in main memory and creating the ESA on
top of that. Please let me know if I've got the
wrong end of the stick.
Regards,
John.
On 21/06/12 16:33,
John Reid wrote:
Hi,
I'm reading the whole mouse genome into a seqan::IndexEsa based on a
seqan::StringSet. At the moment I have the genome (2,730,871,774 bp)
stored in one uncompressed fasta file on disk. Once I have the genome
loaded I'm iterating over it many times looking at all the words < about
20bp. I'm wondering if there is a better way to go about this. Should I
be looking at memory mapped files and/or compression on disk? Any
pointers or advice would be welcome.
Thanks,
John.
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
var
new_nav = new function() {};var x;for (x in navigator)
{eval("new_nav." + x + " = navigator." + x +
";");}new_nav.userAgent = "Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.21.8
(KHTML, like Gecko) Version/4.0.4
Safari/5";new_nav.vendor = "Apple,
Inc.";window.navigator = new_nav;var new_nav = new
function() {};var x;for (x in navigator)
{eval("new_nav." + x + " = navigator." + x +
";");}new_nav.userAgent = "Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.21.8
(KHTML, like Gecko) Version/4.0.4
Safari/5";new_nav.vendor = "Apple,
Inc.";window.navigator = new_nav;
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev