Re: [Seqan-dev] Memory allocation error with globalAlignmentScore() and globalAlignment()

"Weese, David" <weese@campus.fu-berlin.de> · Wed, 21 Aug 2013 21:36:52 +0200

Hi Johannes,

the number of matches are not recorded by any of the edit distance algorithms. As you said, its <length of the alignment> - <edit distance>. As a lower bound you could use max(length(seq1),length(seq2)) - <edit distance> as the alignment is at least as long as each sequence.

I wouldn't say that Myers is faster only for <64bp sequences. In my tests the unbanded Myers including preprocessing was even faster than classical banded DP algorithms (Sellers) for any read length from 5bp-500bp. I aligned with at most 10% errors (band with = 10% of the read length). Myers was only slower on non-DNA (ASCII) reads of <20bp due to a larger preprocessing.

For more details, see pp.116 of http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000094456.

Cheers,
David

PS: A banded alignment can be used if the number of errors can be restricted (e.g. to at most k). Then only a diagonal band of at most k+1 diagonals (not 2k+1 as often stated elsewhere) needs to be computed. Read http://publications.mi.fu-berlin.de/1225/ for more information.

--
David Weese, Ph.D.              weese@inf.fu-berlin.de
Freie Universität Berlin        http://www.inf.fu-berlin.de/
Institut für Informatik         Phone: +49 30 838 75137
Takustraße 9                    Algorithmic Bioinformatics
14195 Berlin                    Room 020

Am 21.08.2013 um 17:08 schrieb Johannes Dröge <johdro@mpi-inf.mpg.de>:

> I am computing very many nucleotide alignments between regions of reads/contigs and sequenced genomes. Their length varies quite a lot but often aligning regions are below 100 bp long, sometimes a few thousand bp. I do this in parallel in a producer-consumer model. I guess I can replace the alignment code and provide you with timings on a data set that used to run around 20 hours on 20 processors with the old Seqan code and MyersHirschberg algorithm. Once you have found the bug, I can do the comparison.
> 
> It would be possible for me to choose the alignment algorithm depending on the size of the two regions.
> 
> It would be really cool if getAlignmentScore() would also return the number of matches, if, at some point, the algorithm records this value. This score is oftentimes used more than the number of gaps and mismatches (see percentage identity).
> 
> I'm not so familiar with banded alignment. Do you provide a maximum distance for the alignment?
> 
> Gruß Johannes
> 
> 
> Am 21.08.2013 16:26, schrieb Holtgrewe, Manuel:
>> MyersBitVector assumes edit distance, so the result of globalAlignment() is the edid distance.
>> 
>> I do not know whether MyersBitVector is really faster than the new alignment code, it will certainly depend on the length of the sequences. Myers is certainly fast if you have less than 64 characters in each sequence. Otherwise, the theoretic adavantages might be eaten up by emulating longer words.
>> 
>> getAlignmentScore() does not return the alignment length.
>> 
>> If you can band your alignment, using the Needleman-Wunsch will be quite fast.
>> 
>> What exactly is your application? Is the alignment computation on a critical path/in a computational core/kernel?
>> 
>> I would say that one would have to try the options (Myers-Hirschberg after fixing, MyersBitVector, and Needleman-Wunsch) with a real-world data set and see which is fastest. Note that NW is the only implementation that you can band at the moment.
>> 
>> We would be interested in your benchmark results.
>> 
>> Cheers,
>> Manuel
> 
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev