[Seqan-dev] end-to-end consensus alignment for short sequences

Matei David <matei@cs.toronto.edu> · Wed, 14 Oct 2015 16:26:22 -0400

Hi,

First of all, thanks for a great library. I have some questions about
using the library in my project (not developing it per se), I hope this
is the right venue for asking.

I would like to do end-to-end consensus alignment between several
DNA sequences. I currently have this implemented as follows:

    for (unsigned i = 0; i < v.size(); ++i)
    {
        seqan::appendRead(store, v[i].data());
        seqan::appendAlignedRead(
            store, i, 0, 0, 0 + (int)v[i].size());
    }
    options.useContigID = true;
    options.usePositions = true;
    seqan::consensusAlignment(store, options);
    // seqan::reAlignment(store, 0, 1, 10, false); // needed?
    result = store.contigStore[0].seq;

This works ok, but it has 2 shortcomings:

1. I noticed after some runs is that the library only processes the
read store if it contains sequences with length greater than 20bp. I'm
not sure if all sequences, or only some of them, need to be of this
length for it to work. If all sequences are shorter than 20bp, it seems
that the consensus output is just the first sequence. So, for instance:

Reads:
A
AA
AA

Consensus alignment:
A
.

So the output is "A", not "AA" as I would expect. Is there a way to
force the consensus algorithm to run regardless of the size of the
sequences? I couldn't find such an option documented anywhere.

As a workaround, I tried to add an "N" * 20 prefix to all sequences,
but now I'm occasionally getting strange results such as:

Reads:
NNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTAG
NNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTAG
NNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTAG
NNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTAG

Consensus alignment:
NNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTNNNNNNNNNNNNNNNNNNNN--TTTTTTTTTTTTTTTTTTTAG
........................................TAG
                                        ....................**.....................
                                        ....................TT.....................
                                         ...................**N....................

In case this gets wrapped, the idea is that the output of the consensus
alignment contains (unexpectedly) more than 1 initial run of Ns.

I tried to do reAlignment() with a bandwidth of 20, but it doesn't
always fix it. Is the 20 1- or 2-sided?

Note- if I were to use real bases instead of Ns, then it would become
nontrivial to remove them from the consensus sequence, because they
might get mixed up with the real sequence.

Is there a way to use seqan to do the consensus on short sequences as
well?

2. The second problem I noticed is that the alignments are not always
end-to-end. This is usually seen only when the tails are homopolymers:

Reads:
TTTTTTTTTTTTTTTTTTTTGAA
TTTTTTTTTTTTTTTTTTTGAGA
TTTTTTTTTTTTTTTTTTTTGAGA
TTTTTTTTTTTTTTTTTTTGAGA
TTTTTTTTTTTTTTTTTTTGAGA
TTTTTTTTTTTTTTTTTTTGAGA

Consensus alignment:
TTTTTTTTTTTTTTTTTTTTGAGA
......................A
........................
 .......................
 .......................
 .......................
 .......................

^ - note the spaces here

Here, I would have hoped the consensus to be the sequence which occurs 4
times, with one less T than the in the largest sequence.

As before, I can potentially work around this by adding a common
homopolymer prefix and suffix, then removing it from the output:

Reads:
CCCCCTTTTTTTTTTTTTTTTTTTTGAACCCCC
CCCCCTTTTTTTTTTTTTTTTTTTGAGACCCCC
CCCCCTTTTTTTTTTTTTTTTTTTTGAGACCCCC
CCCCCTTTTTTTTTTTTTTTTTTTGAGACCCCC
CCCCCTTTTTTTTTTTTTTTTTTTGAGACCCCC
CCCCCTTTTTTTTTTTTTTTTTTTGAGACCCCC

Consensus alignment: (note: one less T)
CCCCC-TTTTTTTTTTTTTTTTTTTGAGACCCCC
.....T.....................*......
.....*............................
.....T............................
.....*............................
.....*............................
.....*............................

However, this works ok only if the homopolymer I add doesn't get
mixed up with the rest of the input sequences. I wonder if there is
a way to get seqan to do end-to-end (re-)alignment, with no free end
gaps.

Thanks,
Matei