Am Freitag 16 März 2018, 20:40:00 schrieb Lucas van Dijk: > Hi all, > > I'm trying to build a simple (short) read filter: given a list of k-mers, > keep only reads that contains at least one of the given k-mers. This will > be used to analyse the behaviour of a tool we're working on. It doesn't > need to be super memory efficient, it'll mostly be a debugging tool. > > My strategy was to use SeqAn and built a suffix array index for the > StringSet of reads, and quickly enumerate which reads contain a given > k-mer. I got a simple prototype working that reads the whole FASTQ file in > memory as StringSet, and then build an Index<StringSet<Dna5String>, > IndexSa<>>. You should use the StringSet<Dna5String, Owncer<ConcatDirect<>>> specialisation here, it will result in only two files being written for the StringSet (the concatenation of all strings plus a vector of delimiters). Hope that helps, Hannes -- Hannes Hauswedell Scientific staff & PhD candidate Freie Universität Berlin / Max Planck Institute for Molecular Genetics address Institut für Informatik Takustraße 9 Room 019 14195 Berlin telephone +49 (0)30 838-75241 fax +49 (0)30 838-75218 e-mail hannes.hauswedell@fu-berlin.de