[Seqan-dev] Creating a Dependent StringSet

Hendrik Weisser <hw5@sanger.ac.uk> · Wed, 24 Feb 2016 10:17:25 +0000

Hi!

I'm working on some code in OpenMS that matches peptides to a protein 

sequence database using SeqAn algorithms. There are two cases:

1. An exact string search of the whole peptide/protein sets, followed by 

an approximate search of only unmatched peptides against only proteins 

with sequence ambiguities.

2. An approximate string search of the whole peptide/protein sets.

I'm starting out with StringSets that store the peptides and proteins, 

which are used in the exact search. For the approximate search, I then 

use Dependent StringSets that I fill with either a subset of the 

peptides/proteins (case 1) or with the whole sets (case 2).

My problem is that in case 2, initializing the Dependent StringSet for 

the proteins is agonizingly slow. For a big protein database with 

>300,000 entries it takes more than a day! It is much, much slower than 

generating the Owner StringSet in the first place.

This is the code I use:

seqan::StringSet<seqan::Peptide> prot_DB; // full protein DB
[...]
seqan::StringSet<seqan::Peptide, seqan::Dependent<seqan::Generous> > prot_DB_SA;  // for approx. search
Size length_prot_DB = length(prot_DB);
reserve(prot_DB_SA, length_prot_DB);
for (Size i = 0; i < length_prot_DB; ++i)
{
  assignValueById(prot_DB_SA, prot_DB, i);
}

Is this just not a good way of doing it?

Based on the documentation of "assignValueById" for StringSets 

(http://docs.seqan.de/seqan/master/class_StringSet.html#StringSet%23assignValueById 

- note that the function is erroneously called "getValueById") I would 

have expected that "assignValueById(prot_DB_SA, prot_DB)" would append 

the whole StringSet, but that doesn't compile.

This is all based on SeqAn 1.4, which OpenMS still uses. Has this 

problem been solved already in the newer releases?

Cheers

Hendrik

--

The Wellcome Trust Sanger Institute is operated by Genome Research 

Limited, a charity registered in England with number 1021457 and a 

company registered in England with number 2742969, whose registered 

office is 215 Euston Road, London, NW1 2BE.