On 05/07/12 11:40, Siragusa, Enrico
wrote:
On Jul 5, 2012, at 11:02 AM, John Reid wrote:
Great. That looks very
helpful. So in your example, how do you arrive at 38Gb? You
are using unsigned int instead of long unsigned int. Where
does the unsigned char in the Fibre<>::Type come into
the calculation? I think I need my code to handle sequence
sets with more than 256 sequences. I'm guessing if I replace
the unsigned char with unsigned long I get back to 48Gb?
I counted 15n bytes: 1+4 bytes for suffix array values, 4
bytes for lcp values and 4 bytes for childtab values. Then for
n equals 3Gbp you get roughly 38Gb.
Value sizes really depend on your input sequences. How many
strings do you want to index and which is their maximum
length?
Do you dispose of enough memory? Depending on your
application another index could be more efficient...
I want to index all the W-mers where the maximum for W in any given
run of my application could be between 6 and 30. I need to iterate
over them in a tree-based style, as my branch-and-bound algorithm
ignores sets of W-mers based on their common prefix.
Thanks for the advice,
John.
Thanks,
John.
On 05/07/12 09:27, Siragusa,
Enrico wrote:
Hi John,
On Jul 5, 2012, at 9:46 AM, John Reid wrote:
Hi Manuel,
Thanks for the advice.
I'm having some memory problems when I build a ESA
on a whole genome (3Gb or so). I don't even know
if I can reasonably expect to do this. Does anyone
out there have any experience with this? If so,
what sort of hardware are you running on and did
you have to take any special measures in software
to handle such large sequence sets?
If you build the Esa as it is, it will consume 4
long ints per char on a 64bit machine and take ~96Gb
of memory for a 3Gb genome.
But you can redefine index fibres for your needs,
i.e. you can replace long ints with ints or chars.
// TGenome is the type of sequence, e.g.
StringSet<Dna5String>
typedef StringSet<Dna5String>
TGenome;
typedef
Index<TGenome, IndexEsa<> >
TGenomeEsa;
namespace seqan
{
template <>
struct
Fibre<TGenomeEsa,
FibreSA>
{
// Works for up
to 256 contigs of length 4Gbp
typedef
String< Pair<unsigned
char, unsigned
int,
Compressed>,
DefaultIndexStringSpec<TGenomeEsa>::Type
> Type;
// Use a
mmapped string
// typedef String< Pair<unsigned
char, unsigned int, Compressed>, MMap<>
> Type;
};
template <>
struct
Fibre<TGenomeEsa,
FibreLcp>
{
typedef
String<unsigned
int,
DefaultIndexStringSpec<TGenomeEsa>::Type
> Type;
};
template <>
struct
Fibre<TGenomeEsa,
FibreChildtab>
{
typedef
String<unsigned
int,
DefaultIndexStringSpec<TGenomeEsa>::Type
> Type;
};
}
In this way your Esa will fit in ~38Gb of
memory.
You might want to try out mmapped strings
depending on your memory requirements and the
access pattern of your algorithm.
You can also try to redefine size and limits
metafunctions for you sequence types.
namespace seqan
{
template <>
struct
Size<Dna5String>
{
typedef
unsigned int
Type;
};
template <>
struct
StringSetLimits<TGenome>
{
typedef
String<unsigned
char>
Type;
};
}
Please overload metafunctions only in your
applications, not in library modules!
Ciao,
Enrico
Thanks,
John.
On 26/06/12 16:20,
Holtgrewe, Manuel wrote:
Hi John,
I would recommend you to use a
Double-Pass MMap RecordReader as described
here:
I'm not sure how much compression on disk
will help you, e.g. where the overhead is.
You could also use the GZFile Stream and
use a Single-Pass RecordReader for this. The
question is whether your disk (for reading
compressed data) or your CPU (for
decompressing the data) is then the
bottleneck.
Cheers,
Manuel
From: John Reid [j.reid@mail.cryst.bbk.ac.uk]
Sent: Tuesday, June 26, 2012
4:20 PM
To: SeqAn Development
Subject: Re: [Seqan-dev]
Performance advice for whole genome
ESA
Hi,
I've done some more reading (
http://trac.seqan.de/wiki/HowTo/EfficientImportOfMillionsOfSequences)
and as far as I can tell I should just
be using memory mapped files as a
mechanism to read large sequence sets
into main memory. Likewise this is the
area where compression on disk could
help. If I want to iterate over a ESA
I'm best off copying the sequences into
a standard seqan StringSet in main
memory and creating the ESA on top of
that. Please let me know if I've got the
wrong end of the stick.
Regards,
John.
On 21/06/12
16:33, John Reid wrote:
Hi,
I'm reading the whole mouse genome into a seqan::IndexEsa based on a
seqan::StringSet. At the moment I have the genome (2,730,871,774 bp)
stored in one uncompressed fasta file on disk. Once I have the genome
loaded I'm iterating over it many times looking at all the words < about
20bp. I'm wondering if there is a better way to go about this. Should I
be looking at memory mapped files and/or compression on disk? Any
pointers or advice would be welcome.
Thanks,
John.
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
var new_nav = new
function() {};var x;for (x in navigator)
{eval("new_nav." + x + " = navigator." + x +
";");}new_nav.userAgent = "Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_5_8; en-us)
AppleWebKit/531.21.8 (KHTML, like Gecko)
Version/4.0.4 Safari/5";new_nav.vendor =
"Apple, Inc.";window.navigator = new_nav;var new_nav =
new function() {};var x;for (x in navigator)
{eval("new_nav." + x + " = navigator." + x +
";");}new_nav.userAgent = "Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_5_8; en-us)
AppleWebKit/531.21.8 (KHTML, like Gecko)
Version/4.0.4 Safari/5";new_nav.vendor =
"Apple, Inc.";window.navigator = new_nav;
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
|