Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA

From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Date: Thu, 05 Jul 2012 12:44:50 +0100
Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Subject: Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA

On 05/07/12 11:40, Siragusa, Enrico wrote:

On Jul 5, 2012, at 11:02 AM, John Reid wrote:

Great. That looks very helpful. So in your example, how do you arrive at 38Gb? You are using unsigned int instead of long unsigned int. Where does the unsigned char in the Fibre<>::Type come into the calculation? I think I need my code to handle sequence sets with more than 256 sequences. I'm guessing if I replace the unsigned char with unsigned long I get back to 48Gb?

I counted 15n bytes: 1+4 bytes for suffix array values, 4 bytes for lcp values and 4 bytes for childtab values. Then for n equals 3Gbp you get roughly 38Gb.

Value sizes really depend on your input sequences. How many strings do you want to index and which is their maximum length?

Do you dispose of enough memory? Depending on your application another index could be more efficient...

I want to index all the W-mers where the maximum for W in any given run of my application could be between 6 and 30. I need to iterate over them in a tree-based style, as my branch-and-bound algorithm ignores sets of W-mers based on their common prefix.

Thanks for the advice,
John.

Thanks,
John.

On 05/07/12 09:27, Siragusa, Enrico wrote:
Hi John,
On Jul 5, 2012, at 9:46 AM, John Reid wrote:

Hi Manuel,

Thanks for the advice.

I'm having some memory problems when I build a ESA on a whole genome (3Gb or so). I don't even know if I can reasonably expect to do this. Does anyone out there have any experience with this? If so, what sort of hardware are you running on and did you have to take any special measures in software to handle such large sequence sets?

If you build the Esa as it is, it will consume 4 long ints per char on a 64bit machine and take ~96Gb of memory for a 3Gb genome.

But you can redefine index fibres for your needs, i.e. you can replace long ints with ints or chars.

// TGenome is the type of sequence, e.g. StringSet<Dna5String>

typedef StringSet<Dna5String>      TGenome;

typedef Index<TGenome, IndexEsa<> > TGenomeEsa;

namespace seqan

{

template <>

struct Fibre<TGenomeEsa, FibreSA>

{

// Works for up to 256 contigs of length 4Gbp

typedef String< Pair<unsigned char, unsigned int, Compressed>,

DefaultIndexStringSpec<TGenomeEsa>::Type > Type;

// Use a mmapped string

// typedef String< Pair<unsigned char, unsigned int, Compressed>, MMap<> > Type;

};



template <>

struct Fibre<TGenomeEsa, FibreLcp>

{

typedef String<unsigned int, DefaultIndexStringSpec<TGenomeEsa>::Type > Type;

};



template <>

struct Fibre<TGenomeEsa, FibreChildtab>

{

typedef String<unsigned int, DefaultIndexStringSpec<TGenomeEsa>::Type > Type;

};

}

In this way your Esa will fit in ~38Gb of memory.

You might want to try out mmapped strings depending on your memory requirements and the access pattern of your algorithm.

You can also try to redefine size and limits metafunctions for you sequence types.

namespace seqan

{

template <>

struct Size<Dna5String>

{

typedef unsigned int Type;

};



template <>

struct StringSetLimits<TGenome>

{

typedef String<unsigned char> Type;

};

}

Please overload metafunctions only in your applications, not in library modules!

Ciao,

Enrico
Thanks,
John.

On 26/06/12 16:20, Holtgrewe, Manuel wrote:
Hi John,

I would recommend you to use a Double-Pass MMap RecordReader as described here:

http://trac.seqan.de/wiki/Tutorial/ReadingSequenceFiles#DocumentReadingAPI

I'm not sure how much compression on disk will help you, e.g. where the overhead is.

You could also use the GZFile Stream and use a Single-Pass RecordReader for this. The question is whether your disk (for reading compressed data) or your CPU (for decompressing the data) is then the bottleneck.

http://trac.seqan.de/wiki/Tutorial/FileIO2#CompressedStreams

Cheers,

Manuel
From: John Reid [j.reid@mail.cryst.bbk.ac.uk]
Sent: Tuesday, June 26, 2012 4:20 PM
To: SeqAn Development
Subject: Re: [Seqan-dev] Performance advice for whole genome ESA
Hi,

I've done some more reading ( http://trac.seqan.de/wiki/HowTo/EfficientImportOfMillionsOfSequences) and as far as I can tell I should just be using memory mapped files as a mechanism to read large sequence sets into main memory. Likewise this is the area where compression on disk could help. If I want to iterate over a ESA I'm best off copying the sequences into a standard seqan StringSet in main memory and creating the ESA on top of that. Please let me know if I've got the wrong end of the stick.

Regards,
John.

On 21/06/12 16:33, John Reid wrote:
Hi,

I'm reading the whole mouse genome into a seqan::IndexEsa based on a
seqan::StringSet. At the moment I have the genome (2,730,871,774 bp)
stored in one uncompressed fasta file on disk. Once I have the genome
loaded I'm iterating over it many times looking at all the words < about
20bp. I'm wondering if there is a better way to go about this. Should I
be looking at memory mapped files and/or compression on disk? Any
pointers or advice would be welcome.

Thanks,
John.

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
var new_nav = new function() {};var x;for (x in navigator) {eval("new_nav." + x + " = navigator." + x + ";");}new_nav.userAgent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/5";new_nav.vendor = "Apple, Inc.";window.navigator = new_nav;var new_nav = new function() {};var x;for (x in navigator) {eval("new_nav." + x + " = navigator." + x + ";");}new_nav.userAgent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/5";new_nav.vendor = "Apple, Inc.";window.navigator = new_nav;
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

<-- thread -->

<-- date -->

Follow-Ups:
- Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
  - From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>

References:
- Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
  - From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
- Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
  - From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
- Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
  - From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
- Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
  - From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>

seqan-dev - July 2012 - Archives indexes sorted by:
[ thread ] [ subject ] [ author ] [ date ]
Complete archive of the seqan-dev mailing list
More info on this list...