FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA

<-- thread -->
<-- date -->
  • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
  • To: seqan-dev@lists.fu-berlin.de
  • Date: Thu, 05 Jul 2012 15:30:43 +0100
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA


On 05/07/12 15:09, Siragusa, Enrico wrote:

On Jul 5, 2012, at 1:44 PM, John Reid wrote:


On 05/07/12 11:40, Siragusa, Enrico wrote:

On Jul 5, 2012, at 11:02 AM, John Reid wrote:

Great. That looks very helpful. So in your example, how do you arrive at 38Gb? You are using unsigned int instead of long unsigned int. Where does the unsigned char in the Fibre<>::Type come into the calculation? I think I need my code to handle sequence sets with more than 256 sequences. I'm guessing if I replace the unsigned char  with unsigned long I get back to 48Gb?

I counted 15n bytes: 1+4 bytes for suffix array values, 4 bytes for lcp values and 4 bytes for childtab values. Then for n equals 3Gbp you get roughly 38Gb.
Value sizes really depend on your input sequences. How many strings do you want to index and which is their maximum length?
Do you dispose of enough memory? Depending on your application another index could be more efficient...

I want to index all the W-mers where the maximum for W in any given run of my application could be between 6 and 30. I need to iterate over them in a tree-based style, as my branch-and-bound algorithm ignores sets of W-mers based on their common prefix.

Ok the values in the code snippet should work for any genome, i.e. they work fine for hg18/hg19.

Concerning index construction: if I am right, the Esa for StringSets should be built on external memory by default.
Concerning index querying: if you don't have 40Gb of memory, then overload fibres to be memory mapped (as in the commented line in the code snippet). In this way only a small part of the index will be kept in memory.
I do have 40Gb of memory.

Alternatively, if you only need the top of the tree along with some sparse subtrees, you could try using a lazy suffix tree (Wotd index in SeqAn) instead of the Esa.
The Wotd provides the same iterators interface as the Esa. Moreover, you can overload the Wotd FibreSA metafunction exactly in the same way.
I have to iterate over the tree many times, each time with sparse subtrees. Over many iterations I should visit all of the nodes to a given depth but I will try both to see what works best.

Or if you are very limited by memory you might want to try the FM-Index (it is not yet in the core library).
The constructed FM-Index would fit into 3 Gb of memory.
This sounds interesting. I'll give it a go if none of the above works but it sounds like I won't need to.

Thanks again,
John.
<-- thread -->
<-- date -->
  • References:
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
    • Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA
      • From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
  • seqan-dev - July 2012 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal