Subject: Re: [Seqan-dev] question about the efficiency of the sequan sequence classes
Sorry I meant "it is not always best to pass-by-reference"
On 13/09/13 14:15, John Reid wrote:
Hi Rene,
That looks like an interesting test but I think it is worth
pointing out that it is not always best to pass-by-value as you
assume. If you're interested Dave Abrahams explains why in this
article:
I tried out your code examples below. I did have some
surprising observations but there are different from what
you where reporting. I replaced some of your functionality.
I adapted the select_event function to simply return the
complement of a given base. I removed the randomness factor
to select the index and simply used every index to be
converted. I loaded the chr22 sequence of the human genome
(~50 Mb) and measured the time of running 50 times a) the
replicate function and b) the inner loop with the
assignment. I did the experiments with the
seqan::String<Dna5>, std::vector<Dna5> ,
std::basic_string<Dna5> and std::string. I also
implemented a replicate3 function which performs best as it
reduces the number of copying whole Strings.
I did the parsing over the index with an c++11
range-based for loop and the standard for loop.
Here are my results built in release mode on a 2.3 GHz
Core i7.
All times are the sum of 50 experiments.
C++11 style:
Seqan StringTime: 11.18 s. Inner
Loop: 2.58064 s.
STL VectorTime: 10.9798 s. Inner Loop: 2.53835 s.
STL Basic String Dna5Time: 10.6501 s. Inner Loop: 3.94554 s.
STL Basic String CharTime: 11.4799 s. Inner Loop: 4.85506 s.
replicate3Time:
8.67172 s. Inner Loop: 2.52474 s.
C++98 style
Seqan StringTime: 11.0828 s. Inner
Loop: 2.49667 s.
STL VectorTime: 10.9178 s. Inner Loop: 2.54614 s.
STL Basic String Dna5Time: 10.9048 s. Inner Loop: 4.20024 s.
STL Basic String CharTime: 12.3184 s. Inner Loop: 5.61231 s.
repliacte3Time: 9.55719 s. Inner Loop: 3.30052 s.
As you can see the replicate3 function outperforms the
other versions, however the inner loop gets slower when
using the standard for loop, and I am not quite sure that I
completely understand why, because I can't observe the same
performance drop in the replicate2 function.
However, when comparing results with the C++11 version
the assignment of the seqan::String is like the std::vector
and faster than the std::string versions.
Can you please give us some information about the
dimension of you problem. How many sequences are you
replicating? How long are the sequences?
Please consider the following performance boosters.
Always prefer passing parameters by const-reference over
passing them by copy (as long as you are sure these are not
just simple types). Copying a big container with many values
is slower than copying a 4/8 Byte reference :).
I also appended the benchmark file. So maybe you can run
the tests on your machine and report your experience.
i promised to report over the performance
comparsion between
seqan::String<seqan::Dna5> and
std::string. So here are the (for me)
surprising results:
I replaced the strings and chars with the seqan
types in all over my source files. I access the
characters in the seqan strings trough []
operator and corrected the functions where
needed.
The program does its job, but its 5 times slower
then the simple std implementation! Thats not
exactly what i expected, i thought it will be a
little slower or much faster, but not this extreme
slowdown.
I suppose it happens because i dont use seqan the
right way. Do you have an idea, whats the reason? I
paste here the responsible two functions, it would
be great, if someone could spend a couple of
minutes.
This is caused by the including of #include
<seqan/seq_io.h>, and the program is
completly empty (return 0;...). I use ubuntu
linux amd64, and g++ 4.7.3.
I bypass the usage of this header now, but
it doesn't seems to be uniqe.
it depends on your application
and what you do with your strings.
Using the SeqAn library can yield
more elegant and faster code than
using std::string or self-written
string classes but it depends on
the actual use case.
For Sequences, there are two
aspects:
(1) Using SeqAn's Dna5, Dna for
characters stores the alphabet as
numbers 0..3/4 internally. This
makes it easier for indices and
mappings since they can work
directly and efficiently on the
ordinal value (ordValue).
For example, if you are
counting the nucleotide content
along strings, you can simply have
a 4-element container (String in
this case) for each position in
your reads (thus a String of
Strings). Thus, you do not need a
possible mapping for 'A' => 0,
'C' => 1, 'G' => 2, 'T'
=> 3, 'N' => 4 since the
mapping is done beforehand.
String<String<unsigned>
> counters;
for
(unsigned i = 0; i <
length(reads); ++i)
{
//
Increase number of counters if
reads[i] is longer than the
previous reads.
if
(length(counters) <
length(reads[i]))
{
unsigned oldSize =
length(counters);
resize(counters,
length(reads[i]));
for (unsigned j = oldSize; j
< length(counters); ++j)
resize(counters[j], 5, 0);
}
//
Count nucleotides for each
position in reads[i];
for (unsigned j = 0; j <
length(reads[i]); ++j)
counters[ordValue(reads[i][j])]
+= 1;
}
(2) SeqAn's String class allows
additionally giving an alternative
implementation. The default
implementation simply uses an
array and would store a Dna
character in a Byte. By using the
Packed String, you can
byte-compress four 4-character DNA
characters into one Byte (each
only needs 2 bits). This comes at
the cost of some computation but
in this case leads to a 4x memory
consumption direction.
We as library writers can now
combine these two aspects of
sequences and alphabets with
generic programming and write
algorithms that allow the user to
change the alphabet type and the
string implementation depending on
the user's requirements and get
the best possible implementation
for this case. Because template
specialization allows us to decide
for the the correct implementation
of ordValue(), length() etc. at
*compile time*, we do not need
virtual functions and thus no cost
for runtime polymorphism.
If you want to use the
algorithms in the SeqAn library
then you could benefit from using
SeqAn sequences. However, many
algorithms also work with
std::string and without knowing
your application and code it is
hard to make any promise on
acceleartion.
Cheers,
Manuel
From:
Bartha D�niel [daniel.bartha@gmail.com] Sent: Wednesday,
August 28, 2013 11:49 AM To: SeqAn Development Subject: [Seqan-dev]
question about the
efficiency of the sequan
sequence classes
Hi All,
i have a big queston
there. I wrote an
application, that
currently uses my
own custom
std::string based
implementation for
some dna mutation
stuff. I basically
have to access every
simple character in
the dna, and then do
something with them,
but that is not
important for the
question.
I tend to rewrite the
whole app with seqan,
but it only has sense,
if the manipulation
and accessing of the
seqan classes
significant faster is,
than my own. I read
about the
effectiveness in the
Motivation chapter,
but does anybody have
any experience about
the concrete yield of
possible acceleration?