Re: [Seqan-dev] Saving an index for StringSet to disk: lots of files

From: Hannes Hauswedell &lt; hannes.hauswedell@fu-berlin.de &gt;
To: Lucas van Dijk &lt; info@lucasvandijk.nl &gt;
Date: Fri, 23 Mar 2018 11:58:54 +0100
Cc: seqan-dev@lists.fu-berlin.de
Organization: MPI MolGen / FU-Berlin
Subject: Re: [Seqan-dev] Saving an index for StringSet to disk: lots of files

Hannes Hauswedell <hannes.hauswedell@fu-berlin.de> · Fri, 23 Mar 2018 11:58:54 +0100

Am Freitag 16 März 2018, 20:40:00 schrieb Lucas van Dijk:
> Hi all,
> 
> I'm trying to build a simple (short) read filter: given a list of k-mers,
> keep only reads that contains at least one of the given k-mers. This will
> be used to analyse the behaviour of a tool we're working on. It doesn't
> need to be super memory efficient, it'll mostly be a debugging tool.
> 
> My strategy was to use SeqAn and built a suffix array index for the
> StringSet of reads, and quickly enumerate which reads contain a given
> k-mer. I got a simple prototype working that reads the whole FASTQ file in
> memory as StringSet, and then build an Index<StringSet<Dna5String>,
> IndexSa<>>.

You should use the StringSet<Dna5String, Owncer<ConcatDirect<>>> 
specialisation here, it will result in only two files being written for the 
StringSet (the concatenation of all strings plus a vector of delimiters).

Hope that helps,
Hannes
-- 
Hannes Hauswedell

Scientific staff & PhD candidate
Freie Universität Berlin / Max Planck Institute for Molecular Genetics

address     Institut für Informatik
            Takustraße 9
            Room 019
            14195 Berlin
telephone   +49 (0)30 838-75241
fax         +49 (0)30 838-75218
e-mail      hannes.hauswedell@fu-berlin.de