Re: [Seqan-dev] question about the efficiency of the sequan sequence classes

From: "Holtgrewe, Manuel" <manuel.holtgrewe@fu-berlin.de>
To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Date: Wed, 28 Aug 2013 14:13:54 +0200
Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Subject: Re: [Seqan-dev] question about the efficiency of the sequan sequence classes

Hi Daniel,

it depends on your application and what you do with your strings. Using the SeqAn library can yield more elegant and faster code than using std::string or self-written string classes but it depends on the actual use case.

For Sequences, there are two aspects:

(1) Using SeqAn's Dna5, Dna for characters stores the alphabet as numbers 0..3/4 internally. This makes it easier for indices and mappings since they can work directly and efficiently on the ordinal value (ordValue).

For example, if you are counting the nucleotide content along strings, you can simply have a 4-element container (String in this case) for each position in your reads (thus a String of Strings). Thus, you do not need a possible mapping for 'A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, 'N' => 4 since the mapping is done beforehand.

String<String<unsigned> > counters;

for (unsigned i = 0; i < length(reads); ++i)

{

// Increase number of counters if reads[i] is longer than the previous reads.

if (length(counters) < length(reads[i]))

{

unsigned oldSize = length(counters);

resize(counters, length(reads[i]));

for (unsigned j = oldSize; j < length(counters); ++j)

resize(counters[j], 5, 0);

}

// Count nucleotides for each position in reads[i];

for (unsigned j = 0; j < length(reads[i]); ++j)

counters[ordValue(reads[i][j])] += 1;

}

(2) SeqAn's String class allows additionally giving an alternative implementation. The default implementation simply uses an array and would store a Dna character in a Byte. By using the Packed String, you can byte-compress four 4-character DNA characters into one Byte (each only needs 2 bits). This comes at the cost of some computation but in this case leads to a 4x memory consumption direction.

We as library writers can now combine these two aspects of sequences and alphabets with generic programming and write algorithms that allow the user to change the alphabet type and the string implementation depending on the user's requirements and get the best possible implementation for this case. Because template specialization allows us to decide for the the correct implementation of ordValue(), length() etc. at *compile time*, we do not need virtual functions and thus no cost for runtime polymorphism.

If you want to use the algorithms in the SeqAn library then you could benefit from using SeqAn sequences. However, many algorithms also work with std::string and without knowing your application and code it is hard to make any promise on acceleartion.

Cheers,

Manuel

From: Bartha Dániel [daniel.bartha@gmail.com]
Sent: Wednesday, August 28, 2013 11:49 AM
To: SeqAn Development
Subject: [Seqan-dev] question about the efficiency of the sequan sequence classes

Hi All,

i have a big queston there. I wrote an application, that currently uses my own custom std::string based implementation for some dna mutation stuff. I basically have to access every simple character in the dna, and then do something with them, but that is not important for the question.

I tend to rewrite the whole app with seqan, but it only has sense, if the manipulation and accessing of the seqan classes significant faster is, than my own. I read about the effectiveness in the Motivation chapter, but does anybody have any experience about the concrete yield of possible acceleration?

Thanks!

Regards: Daniel

Live long and prosper

Bartha Dániel
MTA-VMRI, 2013

<-- thread -->

<-- date -->

Follow-Ups:
- Re: [Seqan-dev] question about the efficiency of the sequan sequence classes
  - From: Bartha Dániel <daniel.bartha@gmail.com>

References:
- [Seqan-dev] question about the efficiency of the sequan sequence classes
  - From: Bartha Dániel <daniel.bartha@gmail.com>

seqan-dev - August 2013 - Archives indexes sorted by:
[ thread ] [ subject ] [ author ] [ date ]
Complete archive of the seqan-dev mailing list
More info on this list...