FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] Finder and Pattern

<-- thread -->
<-- date -->
  • From: "Weese, David" <weese@campus.fu-berlin.de>
  • To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Date: Thu, 16 May 2013 09:57:32 +0200
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: Re: [Seqan-dev] Finder and Pattern

Hi Stephen,

mmh, I don't see why it shouldn't work. Maybe you can send us an example dataset and your full source code for debugging.

David

--
David Weese weese@inf.fu-berlin.de
Freie Universität Berlin http://www.inf.fu-berlin.de/
Institut für Informatik Phone: +49 30 838 75137
Takustraße 9 Algorithmic Bioinformatics
14195 Berlin Room 020

Am 14.05.2013 um 13:16 schrieb "Henderson, Stephen" <s.henderson@ucl.ac.uk>:

Hi
I have yet another question. I have used the countKmers function to read through a fastq file and identify over-represented 5mers (e.g. AAAAA and TTTTT commonly) e.g.  std::vector< std::string > kmers

I now want to plot their fraction at positions across the reads so I iterate through the fast records again trying to count any occurence of each of kmer and it's position. This is essentially the last step of the fastQC program (though seqan seems a lot faster) i.e.:


// more above

NumericMatrix kmerMatrix(kmers.size(), numcycles-4);
seqan::Dna5String seq;         

    for( ; seqIt1 != end(seqs); ++seqIt1)
    {
      seq = value(seqIt1);
      Finder<Dna5String> finder(seq);
      
      for(unsigned i=0;  i< kmers.size(); ++i)
      {
       Pattern<Dna5String, Horspool> pattern(kmers[i]);
        while (find(finder, pattern))
            kmerMatrix(i,beginPosition(finder)) += 1;  
      }
    
    }  
  
    return kmerMatrix; // this to be plot in R using ggplot2
}



The above compiles fine and I think looks liek what I want but I get an odd result with a 5mer skip. I see counts for the first kmer (AAAAA) at first position, then 6th, 11th, 16th etc,


The next kmer at 6th,11th, 16th and so on for all the other kmers: 

# this is the c++ function called from within R
> kmerPlotCpp('test3.fastq', as.vector(kmers$kmers))

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
 [1,] 1826    0    0    0    0 1189    0    0    0     0  1403     0     0     0     0  1371     0     0     0     0  1328
 [2,]    0    0    0    0    0  641    0    0    0     0   482     0     0     0     0   504     0     0     0     0   493
 [3,]    0    0    0    0    0  424    0    0    0     0   545     0     0     0     0   545     0     0     0     0   578
... etc etc


I don't understand why beginPosition(finder) is skipping in 5s. Do you?

Also... is this the best approach? Is there a simpler/more efficient way?

thx (Again)
Stephen Henderson
UCL Cancer Institute

ps I have tried to clear(finder) at the end of the outer loop but this compiled and crashed runtime-- strangely?
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

<-- thread -->
<-- date -->
  • References:
    • [Seqan-dev] Finder and Pattern
      • From: "Henderson, Stephen" <s.henderson@ucl.ac.uk>
  • seqan-dev - May 2013 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal