FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] Saving an index for StringSet to disk: lots of files

<-- thread -->
<-- date -->
  • From: Rahn, René <Rene.Rahn@fu-berlin.de>
  • To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Date: Wed, 21 Mar 2018 10:26:13 +0100
  • Subject: Re: [Seqan-dev] Saving an index for StringSet to disk: lots of files

Hi Lucas, 

thanks for writing us.
But in the future I would like you to move such conversations to our issue tracker. https://github.com/seqan/seqan/issues
You will access a much bigger range of people.

In general, I am not familiar with the serialisation of SA indices, but I kind of think your approach can be done more efficiently with a q-gram Index.
You can build it over the set of reads, and then query for every k-mer the occurrences in your reads. 
Using a OpenAddressing Q-Gram index will keep the memory in a range of 30% more space than actually occurring q-grams.
Building it is fairly fast and thus could be facilitated to run different q-gram sizes (needs recreation of the index).

Please have a look at: http://seqan.readthedocs.io/en/master/Tutorial/DataStructures/Indices/QgramIndex.html
and http://seqan.readthedocs.io/en/master/Tutorial/Algorithms/PatternMatching/IndexedPatternMatching.html

IHTH,

René


On 16. Mar 2018, at 21:40, Lucas van Dijk <info@lucasvandijk.nl> wrote:

Hi all,

I'm trying to build a simple (short) read filter: given a list of k-mers, keep only reads that contains at least one of the given k-mers. This will be used to analyse the behaviour of a tool we're working on. It doesn't need to be super memory efficient, it'll mostly be a debugging tool.

My strategy was to use SeqAn and built a suffix array index for the StringSet of reads, and quickly enumerate which reads contain a given k-mer. I got a simple prototype working that reads the whole FASTQ file in memory as StringSet, and then build an Index<StringSet<Dna5String>, IndexSa<>>.

Because I want to filter the original FASTQ file multiple times with different sets of k-mers, I thought it would be a good idea to store the index on disk, to be re-used next time. In addition to the main ".sa" file, however, it creates an incredible amount of files with pattern "*.txt.NUM". I suspect a separate file for every sequence in the FASTQ file. These files are very small too: smaller than 1kb (probably because these are short reads). This amount of files really slows down simple operations like `ls`, plus it's really messy.

Furthermore, this was just a relatively small test FASTQ file. If my assumption that a separate file is created for every sequence in the FASTQ file is true, then using this approach for higher coverage samples with many more sequences will not be usable. I feel I will hit some file system limit on the number of files, especially if I have multiple FASTQ samples in the same directory.

Can this behaviour be changed? Why isn't everything stored in a single ".sa" file?

Current source code for reference:
https://github.com/lrvdijk/read-filter

Kind regards,
Lucas
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

---

René Rahn
Ph.D. Student (de.NBI - CIBI)
--------------------------------
Tel:  (+49) 30 838 72974
Mail: rene.rahn@fu-berlin.de
--------------------------------
Institute of Computer Science
Algorithmic Bioinformatics (ABI)
--------------------------------
Freie Universität Berlin
Takustraße 9
14195 Berlin
--------------------------------

<-- thread -->
<-- date -->
  • References:
    • [Seqan-dev] Saving an index for StringSet to disk: lots of files
      • From: Lucas van Dijk <info@lucasvandijk.nl>
  • seqan-dev - March 2018 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal