FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] Performance advice for whole genome ESA

<-- thread -->
<-- date -->
  • From: "Holtgrewe, Manuel" <manuel.holtgrewe@fu-berlin.de>
  • To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Date: Tue, 26 Jun 2012 15:20:54 +0000
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: Re: [Seqan-dev] Performance advice for whole genome ESA

Hi John,

I would recommend you to use a Double-Pass MMap RecordReader as described here:

http://trac.seqan.de/wiki/Tutorial/ReadingSequenceFiles#DocumentReadingAPI

I'm not sure how much compression on disk will help you, e.g. where the overhead is.

You could also use the GZFile Stream and use a Single-Pass RecordReader for this. The question is whether your disk (for reading compressed data) or your CPU (for decompressing the data) is then the bottleneck.

http://trac.seqan.de/wiki/Tutorial/FileIO2#CompressedStreams

Cheers,
Manuel


From: John Reid [j.reid@mail.cryst.bbk.ac.uk]
Sent: Tuesday, June 26, 2012 4:20 PM
To: SeqAn Development
Subject: Re: [Seqan-dev] Performance advice for whole genome ESA

Hi,

I've done some more reading ( http://trac.seqan.de/wiki/HowTo/EfficientImportOfMillionsOfSequences) and as far as I can tell I should just be using memory mapped files as a mechanism to read large sequence sets into main memory. Likewise this is the area where compression on disk could help. If I want to iterate over a ESA I'm best off copying the sequences into a standard seqan StringSet in main memory and creating the ESA on top of that. Please let me know if I've got the wrong end of the stick.

Regards,
John.


On 21/06/12 16:33, John Reid wrote:
Hi,

I'm reading the whole mouse genome into a seqan::IndexEsa based on a
seqan::StringSet. At the moment I have the genome (2,730,871,774 bp)
stored in one uncompressed fasta file on disk. Once I have the genome
loaded I'm iterating over it many times looking at all the words < about
20bp. I'm wondering if there is a better way to go about this. Should I
be looking at memory mapped files and/or compression on disk? Any
pointers or advice would be welcome.

Thanks,
John.

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev


<-- thread -->
<-- date -->
  • References:
    • [Seqan-dev] Performance advice for whole genome ESA
      • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
    • Re: [Seqan-dev] Performance advice for whole genome ESA
      • From: John Reid <j.reid@mail.cryst.bbk.ac.uk>
  • seqan-dev - June 2012 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal