I am computing very many nucleotide alignments between regions of reads/contigs and sequenced genomes. Their length varies quite a lot but often aligning regions are below 100 bp long, sometimes a few thousand bp. I do this in parallel in a producer-consumer model. I guess I can replace the alignment code and provide you with timings on a data set that used to run around 20 hours on 20 processors with the old Seqan code and MyersHirschberg algorithm. Once you have found the bug, I can do the comparison. It would be possible for me to choose the alignment algorithm depending on the size of the two regions. It would be really cool if getAlignmentScore() would also return the number of matches, if, at some point, the algorithm records this value. This score is oftentimes used more than the number of gaps and mismatches (see percentage identity). I'm not so familiar with banded alignment. Do you provide a maximum distance for the alignment? Gruß Johannes Am 21.08.2013 16:26, schrieb Holtgrewe, Manuel:
MyersBitVector assumes edit distance, so the result of globalAlignment() is the edid distance. I do not know whether MyersBitVector is really faster than the new alignment code, it will certainly depend on the length of the sequences. Myers is certainly fast if you have less than 64 characters in each sequence. Otherwise, the theoretic adavantages might be eaten up by emulating longer words. getAlignmentScore() does not return the alignment length. If you can band your alignment, using the Needleman-Wunsch will be quite fast. What exactly is your application? Is the alignment computation on a critical path/in a computational core/kernel? I would say that one would have to try the options (Myers-Hirschberg after fixing, MyersBitVector, and Needleman-Wunsch) with a real-world data set and see which is fastest. Note that NW is the only implementation that you can band at the moment. We would be interested in your benchmark results. Cheers, Manuel