FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] bandedChainAlignment default failing due to default k

<-- thread -->
<-- date -->
  • From: Brett Bowman <bnbowman@gmail.com>
  • To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Date: Mon, 8 Jun 2015 10:46:27 -0700
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: Re: [Seqan-dev] bandedChainAlignment default failing due to default k

If you're asking for the seed chain used used generate the above example, it's given in the above code:
"""
appendValue(seedChain, TSeed( 0,  0, 14));
appendValue(seedChain, TSeed(30, 31, 14));
"""

So the entire SeedChain is just equivalent to
"""
[ (0, 0, 14, 14),  (30, 31, 44, 45) ]
"""

Unless you mean something else by SeedChain?


Also I found another case of this in a larger sequence (over 100bp on each side), such that I don't think it's a simple size-of-the-matrix issue:

Query: 
TAACTCATCATCAGCAATCCGAACAAAACCGCTGAACCGACCCTCGGACAAACAACAACGGAGGAGGAGAGAGGAGAGAGGAGCAGTTCAGGCGGTTGATTACGGAGGATTAGAAATAC
Reference:
ACGCCATAGTGACTGGCGATGCTGTCGGAATGGACGATATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGATATCGTCCATTCCGACAGCATCGCCAGTCACTATGGC

With a single matching 19mer seed in the dead center:
Query      50  AACAACAACGGAGGAGGAG  69
Reference  50  AACAACAACGGAGGAGGAG  69

What I except to see is something like this:
      0     .    :    .    :    .    :    .    :    .    : 
        ------TAACTCAT-CATCAGCAAT-CCGAACAAAAC-C--GC-TGA-AC
              |   | || |  | | | | |||  |    | |  ||  || | 
        GCTTGGT---T-ATGC--CGGTACTGCCGGGC----CTCTTGCGGGATAT

     50     .    :    .    :    .    :    .    :    .    : 
        CGACCCTCGGACAAACAACAACGGAGGAGGA-GAGAGGAGAGAG-GA---
        |  | |||  || |||||||||||||||||| || | |||||||  |   
        C-TCTCTC-AAC-AACAACAACGGAGGAGGAGGAAAAGAGAGAGATATCC

    100     .    :    .    :    .    :    .    :    
        -GC-AGTTCAGG--CGGTTGATTACGGAGGATTAGA--AATAC
         || ||   |||  |||   | ||| | | | || |  ||   
        CGCAAG---AGGCCCGG--CAGTACCG-GCA-TA-ACCAA---


Which I can then trim down to the core region I want here:
       0     .    :    .    :    .    :  
        AACAACAACGGAGGAGGA-GAGAGGAGAGAG
        |||||||||||||||||| || | |||||||
        AACAACAACGGAGGAGGAGGAAAAGAGAGAG


But instead, if my bandExtension is below 14 I get this garbage instead:
      0     .    :    .    :    .    :    .    :    .    : 
        -------------------------------------TAACTCATCATCA
                                             || ||| || |||
        GCTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCTC-TC-TCA

     50     .    :    .    :    .    :    .    :    .    : 
        GCAATCCGAACAAAACCGCTGAACCGACCCTCGGACAAACAACAACGGAG
         |||  | |||  || ||  | |  |      |||  || ||  |  || 
        ACAA--C-AAC--AA-CG--G-A--G----GAGGAGGAA-AA-GA--GA-

    100     .    :    .    :    .    :    .    :    .    : 
        GAGGAGA----G-AGGAGAGAG---GAGCAGTTCAGGCGGTTGATTACGG
        || || |    | |  |||| |   | |||| | |  |||   || ||  
        GA-GATATCCCGCA--AGAG-GCCCG-GCAG-T-A-CCGG--CATAACCA

    150     .    :     
        AGGATTAGAAATAC
        |             
        A-------------


-Brett

On Wed, Jun 3, 2015 at 8:28 AM, Rahn, René <rene.maerker@fu-berlin.de> wrote:
Hi Brett, 

the k-Parameter specifies the band width around seeds given by the passed seed chain. This is compliant to the banded chain alignment algorithm described in the LAGAN paper.
So instead of taking the seed as is, one can compute a banded DP around the seed to find the optimal trace into and out of the seed.

My best guess, in your example is, that the entire alignment matrix is to small for a band extension of 15. 
Nevertheless it should work, but maybe this is an uncovered border case. Can you please supply the seed chain?
It shouldn’t be to much on these small sequences?

Cheers,

René

On 23 May 2015, at 00:53, Brett Bowman <bnbowman@gmail.com> wrote:

I'm trying to align two highly similar sequences found via Kmer search:
Query = "ATCTCTCTCAACAAAACAACGAGGAGGAGTGAAAAGAGAGAGAT"
Reference = "ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT"

The expected alignment looks like this:

Score: 80
      0     .    :    .    :    .    :    .    :    .  
        ATCTCTCTCAACAA-AACAAC-GAGGAGGAGTGAAAAGAGAGAGAT
        |||||||||||||| |||||| ||||||||| ||||||||||||||
        ATCTCTCTCAACAACAACAACGGAGGAGGAG-GAAAAGAGAGAGAT

But when I align it using the default values suggested by the tutorial, it doesn't show any inserted gaps at all, and I wind up with this instead:

Score: 80
      0     .    :    .    :    .    :    .    :     
        ATCTCTCTCAACAAAACAACGAGGAGGAGTGAAAAGAGAGAGAT
        |||||||||||||| |  |   |  |  | | |||         
        ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGA

I finally traced it down to the k-value (bandExtension value) passed into the alignment algorithm - values of K <= 13 succeed and generate the top-most alignment, while the values of 14-15 like the default (15) report the low-quality alignment.  

Yet oddly, both alignments report the correct alignment score at the end - so it's not failing, precisely.  It's just not storing or displaying the correct alignment.

So I have two questions:
1) What exactly does the  k / bandExtension variable do?
2) What is going on here?

My code is pasted below for your use.

Sincerely,
-Brett

"""
     #include <seqan/seeds.h>

     using namespace seqan;

     DnaString query = "ATCTCTCTCAACAAAACAACGAGGAGGAGTGAAAAGAGAGAGAT";
     DnaString ref   = "ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT";
 
     typedef Seed<Simple> TSeed;
     String<TSeed> seedChain;
     appendValue(seedChain, TSeed( 0,  0, 14));
     appendValue(seedChain, TSeed(30, 31, 14));
     Score<int, Simple> scoringScheme(2, -1, -2);
 
     Align<DnaString, ArrayGaps> alignment1;
     resize(rows(alignment1), 2);
     assignSource(row(alignment1, 0), query);
     assignSource(row(alignment1, 1), ref);
 
     Align<DnaString, ArrayGaps> alignment2;
     resize(rows(alignment2), 2);
     assignSource(row(alignment2, 0), query);
     assignSource(row(alignment2, 1), ref);
 
     int result1 = bandedChainAlignment(alignment1, seedChain, scoringScheme, 14);
     std::cout << "Score: " << result1 << std::endl;
     std::cout << alignment1 << std::endl;
 
     int result2 = bandedChainAlignment(alignment2, seedChain, scoringScheme, 13);
     std::cout << "Score: " << result2 << std::endl;
     std::cout << alignment2 << std::endl;
"""
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

---

René Rahn
Ph.D. Student
--------------------------------
Tel:  (+49) 30 838 75137
Mail: rene.rahn@fu-berlin.de
--------------------------------
Institute of Computer Science
Algorithmic Bioinformatics (ABI)
--------------------------------
Freie Universität Berlin
Takustraße 9
14195 Berlin
--------------------------------


_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev


<-- thread -->
<-- date -->
  • Follow-Ups:
    • Re: [Seqan-dev] bandedChainAlignment default failing due to default k
      • From: Rahn, René <rene.maerker@fu-berlin.de>
  • References:
    • Re: [Seqan-dev] bandedChainAlignment default failing due to default k
      • From: Rahn, René <rene.maerker@fu-berlin.de>
  • seqan-dev - June 2015 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal