Hi!I'm working on some code in OpenMS that matches peptides to a protein sequence database using SeqAn algorithms. There are two cases: 1. An exact string search of the whole peptide/protein sets, followed by an approximate search of only unmatched peptides against only proteins with sequence ambiguities.
2. An approximate string search of the whole peptide/protein sets.I'm starting out with StringSets that store the peptides and proteins, which are used in the exact search. For the approximate search, I then use Dependent StringSets that I fill with either a subset of the peptides/proteins (case 1) or with the whole sets (case 2).
My problem is that in case 2, initializing the Dependent StringSet for the proteins is agonizingly slow. For a big protein database with >300,000 entries it takes more than a day! It is much, much slower than generating the Owner StringSet in the first place.
This is the code I use:
seqan::StringSet<seqan::Peptide> prot_DB; // full protein DB [...] seqan::StringSet<seqan::Peptide, seqan::Dependent<seqan::Generous> > prot_DB_SA; // for approx. search Size length_prot_DB = length(prot_DB); reserve(prot_DB_SA, length_prot_DB); for (Size i = 0; i < length_prot_DB; ++i) { assignValueById(prot_DB_SA, prot_DB, i); }
Is this just not a good way of doing it?Based on the documentation of "assignValueById" for StringSets (http://docs.seqan.de/seqan/master/class_StringSet.html#StringSet%23assignValueById - note that the function is erroneously called "getValueById") I would have expected that "assignValueById(prot_DB_SA, prot_DB)" would append the whole StringSet, but that doesn't compile.
This is all based on SeqAn 1.4, which OpenMS still uses. Has this problem been solved already in the newer releases?
Cheers Hendrik --The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.