Subject: [Seqan-dev] Saving an index for StringSet to disk: lots of files
Hi all,
I'm trying
to build a simple (short) read filter: given a list of k-mers, keep only
reads that contains at least one of the given k-mers. This will be used
to analyse the behaviour of a tool we're working on. It doesn't need to
be super memory efficient, it'll mostly be a debugging tool.
My
strategy was to use SeqAn and built a suffix array index for the
StringSet of reads, and quickly enumerate which reads contain a given
k-mer. I got a simple prototype working that reads the whole FASTQ file
in memory as StringSet, and then build an
Index<StringSet<Dna5String>, IndexSa<>>.
Because
I want to filter the original FASTQ file multiple times with different
sets of k-mers, I thought it would be a good idea to store the index on
disk, to be re-used next time. In addition to the main ".sa" file,
however, it creates an incredible amount of files with pattern
"*.txt.NUM". I suspect a separate file for every sequence in the FASTQ
file. These files are very small too: smaller than 1kb (probably because
these are short reads). This amount of files really slows down simple
operations like `ls`, plus it's really messy.
Furthermore,
this was just a relatively small test FASTQ file. If my assumption that
a separate file is created for every sequence in the FASTQ file is
true, then using this approach for higher coverage samples with many
more sequences will not be usable. I feel I will hit some file system
limit on the number of files, especially if I have multiple FASTQ
samples in the same directory.
Can this behaviour be changed? Why isn't everything stored in a single ".sa" file?