Re: [Seqan-dev] question about the efficiency of the sequan sequence classes

From: Rahn, René <rene.maerker@fu-berlin.de>
To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Date: Fri, 13 Sep 2013 12:34:45 +0200
Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Subject: Re: [Seqan-dev] question about the efficiency of the sequan sequence classes

Hey Daniel,

I tried out your code examples below. I did have some surprising observations but there are different from what you where reporting. I replaced some of your functionality. I adapted the select_event function to simply return the complement of a given base. I removed the randomness factor to select the index and simply used every index to be converted. I loaded the chr22 sequence of the human genome (~50 Mb) and measured the time of running 50 times a) the replicate function and b) the inner loop with the assignment. I did the experiments with the seqan::String<Dna5>, std::vector<Dna5> , std::basic_string<Dna5> and std::string. I also implemented a replicate3 function which performs best as it reduces the number of copying whole Strings.

I did the parsing over the index with an c++11 range-based for loop and the standard for loop.

Here are my results built in release mode on a 2.3 GHz Core i7.

All times are the sum of 50 experiments.

C++11 style:

Seqan String Time: 11.18 s. Inner Loop: 2.58064 s.

STL Vector Time: 10.9798 s. Inner Loop: 2.53835 s.

STL Basic String Dna5 Time: 10.6501 s. Inner Loop: 3.94554 s.

STL Basic String Char Time: 11.4799 s. Inner Loop: 4.85506 s.

replicate3 Time: 8.67172 s. Inner Loop: 2.52474 s.

C++98 style

Seqan String Time: 11.0828 s. Inner Loop: 2.49667 s.

STL Vector Time: 10.9178 s. Inner Loop: 2.54614 s.

STL Basic String Dna5 Time: 10.9048 s. Inner Loop: 4.20024 s.

STL Basic String Char Time: 12.3184 s. Inner Loop: 5.61231 s.

repliacte3 Time: 9.55719 s. Inner Loop: 3.30052 s.

As you can see the replicate3 function outperforms the other versions, however the inner loop gets slower when using the standard for loop, and I am not quite sure that I completely understand why, because I can't observe the same performance drop in the replicate2 function.

However, when comparing results with the C++11 version the assignment of the seqan::String is like the std::vector and faster than the std::string versions.

Can you please give us some information about the dimension of you problem. How many sequences are you replicating? How long are the sequences?

Please consider the following performance boosters. Always prefer passing parameters by const-reference over passing them by copy (as long as you are sure these are not just simple types). Copying a big container with many values is slower than copying a 4/8 Byte reference :).

I also appended the benchmark file. So maybe you can run the tests on your machine and report your experience.

Kind regards,

Ren�

Am 11.09.2013 um 15:43 schrieb Bartha D�niel <daniel.bartha@gmail.com>:

Hi Manuel and People there,

i promised to report over the performance comparsion between seqan::String<seqan::Dna5> and std::string. So here are the (for me) surprising results:

I replaced the strings and chars with the seqan types in all over my source files. I access the characters in the seqan strings trough [] operator and corrected the functions where needed.

The program does its job, but its 5 times slower then the simple std implementation! Thats not exactly what i expected, i thought it will be a little slower or much faster, but not this extreme slowdown.

I suppose it happens because i dont use seqan the right way. Do you have an idea, whats the reason? I paste here the responsible two functions, it would be great, if someone could spend a couple of minutes.

Dna5 eventspace::select_event(Dna5 base, double p)
{

    /**this function does only gives back a Dna5 char, if the random number i give is in some of the pre-stored intervals, so nothing special**/

    for(event e : E[base])
    {

        if(e.a > p)
        {
            if(p >= e.b)
            {
                return e.to;

                //which is a seqan::Dna5 character

            }
        }
    }
}

seqan::String<seqan::Dna5> replicate2(framework& sys, seqan::String<seqan::Dna5> seq, default_random_engine engine)
{
    uniform_real_distribution<> ur_dist(0, sys.Getscale());

    //this and the default_random_engine are needed for real random number generation

    vector<double> probs(length(seq));

    vector<int> index;

    for(unsigned i=0; i<probs.size(); ++i)
    {
        probs[i]=ur_dist(engine);
        if(probs[i] > sys.lookup[seq[i]])index.push_back(i);
    }
    for(unsigned i : index)
    {
       seq[i]=sys.events.select_event(seq[i],probs[i]);

       /**so practically one Dna5 = the other Dna5 variable, with assign() is it even a little slower**/

    }
return seq;}

Do you have any idea, or is this slowdown maybe normal?

Thanks, regards:

Daniel

Live long and prosper

Bartha D�niel
MTA-VMRI, 2013

2013/8/28 Bartha D�niel <daniel.bartha@gmail.com>

Hi Manuel (and other c++ fellows),

i try it, and tell you, if it's better.

But there is an other problem now, and there was a discussion about in februar already.(https://lists.fu-berlin.de/pipermail/seqan-dev/2013-February/msg00002.htm

I dont know if it is solved or not, but i still/again get exact the same error message:

/usr/include/seqan/bam_io/cigar.h||In function �bool seqan::operator<(const seqan::CigarElement<TOperation, TCount>&, const seqan::CigarElement<TOperation, TCount>&)�:|
/usr/include/seqan/bam_io/cigar.h|120|error: parse error in template argument list|
||=== Build finished: 1 errors, 0 warnings (0 minutes, 2 seconds) ===|

This is caused by the including of #include <seqan/seq_io.h>, and the program is completly empty (return 0;...). I use ubuntu linux amd64, and g++ 4.7.3.

I bypass the usage of this header now, but it doesn't seems to be uniqe.

Thank you very much again, and have a good day!

Daniel

Live long and prosper

Bartha D�niel
MTA-VMRI, 2013

2013/8/28 Holtgrewe, Manuel <manuel.holtgrewe@fu-berlin.de>

Hi Daniel,

it depends on your application and what you do with your strings. Using the SeqAn library can yield more elegant and faster code than using std::string or self-written string classes but it depends on the actual use case.

For Sequences, there are two aspects:

(1) Using SeqAn's Dna5, Dna for characters stores the alphabet as numbers 0..3/4 internally. This makes it easier for indices and mappings since they can work directly and efficiently on the ordinal value (ordValue).

For example, if you are counting the nucleotide content along strings, you can simply have a 4-element container (String in this case) for each position in your reads (thus a String of Strings). Thus, you do not need a possible mapping for 'A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, 'N' => 4 since the mapping is done beforehand.

String<String<unsigned> > counters;

for (unsigned i = 0; i < length(reads); ++i)

{

// Increase number of counters if reads[i] is longer than the previous reads.

if (length(counters) < length(reads[i]))

{

unsigned oldSize = length(counters);

resize(counters, length(reads[i]));

for (unsigned j = oldSize; j < length(counters); ++j)

resize(counters[j], 5, 0);

}

// Count nucleotides for each position in reads[i];

for (unsigned j = 0; j < length(reads[i]); ++j)

counters[ordValue(reads[i][j])] += 1;

}

(2) SeqAn's String class allows additionally giving an alternative implementation. The default implementation simply uses an array and would store a Dna character in a Byte. By using the Packed String, you can byte-compress four 4-character DNA characters into one Byte (each only needs 2 bits). This comes at the cost of some computation but in this case leads to a 4x memory consumption direction.

We as library writers can now combine these two aspects of sequences and alphabets with generic programming and write algorithms that allow the user to change the alphabet type and the string implementation depending on the user's requirements and get the best possible implementation for this case. Because template specialization allows us to decide for the the correct implementation of ordValue(), length() etc. at *compile time*, we do not need virtual functions and thus no cost for runtime polymorphism.

If you want to use the algorithms in the SeqAn library then you could benefit from using SeqAn sequences. However, many algorithms also work with std::string and without knowing your application and code it is hard to make any promise on acceleartion.

Cheers,

Manuel

From: Bartha D�niel [daniel.bartha@gmail.com]
Sent: Wednesday, August 28, 2013 11:49 AM
To: SeqAn Development
Subject: [Seqan-dev] question about the efficiency of the sequan sequence classes

Hi All,

i have a big queston there. I wrote an application, that currently uses my own custom std::string based implementation for some dna mutation stuff. I basically have to access every simple character in the dna, and then do something with them, but that is not important for the question.

I tend to rewrite the whole app with seqan, but it only has sense, if the manipulation and accessing of the seqan classes significant faster is, than my own. I read about the effectiveness in the Motivation chapter, but does anybody have any experience about the concrete yield of possible acceleration?

Thanks!

Regards: Daniel

Live long and prosper

Bartha D�niel
MTA-VMRI, 2013

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev