[Seqan-dev] Finder and Pattern

From: "Henderson, Stephen" <s.henderson@ucl.ac.uk>
To: "seqan-dev@lists.fu-berlin.de" <seqan-dev@lists.fu-berlin.de>
Date: Tue, 14 May 2013 11:16:20 +0000
Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Subject: [Seqan-dev] Finder and Pattern

I have yet another question. I have used the countKmers function to read through a fastq file and identify over-represented 5mers (e.g. AAAAA and TTTTT commonly) e.g. std::vector< std::string > kmers

I now want to plot their fraction at positions across the reads so I iterate through the fast records again trying to count any occurence of each of kmer and it's position. This is essentially the last step of the fastQC program (though seqan seems a lot faster) i.e.:

// more above

NumericMatrix kmerMatrix(kmers.size(), numcycles-4);

seqan::Dna5String seq;

for( ; seqIt1 != end(seqs); ++seqIt1)

{

seq = value(seqIt1);

Finder<Dna5String> finder(seq);

for(unsigned i=0; i< kmers.size(); ++i)

{

Pattern<Dna5String, Horspool> pattern(kmers[i]);

while (find(finder, pattern))

kmerMatrix(i,beginPosition(finder)) += 1;

}

return kmerMatrix; // this to be plot in R using ggplot2

}

The above compiles fine and I think looks liek what I want but I get an odd result with a 5mer skip. I see counts for the first kmer (AAAAA) at first position, then 6th, 11th, 16th etc,

The next kmer at 6th,11th, 16th and so on for all the other kmers:

# this is the c++ function called from within R

> kmerPlotCpp('test3.fastq', as.vector(kmers$kmers))

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]

[1,] 1826 0 0 0 0 1189 0 0 0 0 1403 0 0 0 0 1371 0 0 0 0 1328

[2,] 0 0 0 0 0 641 0 0 0 0 482 0 0 0 0 504 0 0 0 0 493

[3,] 0 0 0 0 0 424 0 0 0 0 545 0 0 0 0 545 0 0 0 0 578

... etc etc

I don't understand why beginPosition(finder) is skipping in 5s. Do you?

Also... is this the best approach? Is there a simpler/more efficient way?

thx (Again)

Stephen Henderson

UCL Cancer Institute

ps I have tried to clear(finder) at the end of the outer loop but this compiled and crashed runtime-- strangely?

<-- thread -->

<-- date -->

Follow-Ups:
- Re: [Seqan-dev] Finder and Pattern
  - From: "Weese, David" <weese@campus.fu-berlin.de>

seqan-dev - May 2013 - Archives indexes sorted by:
[ thread ] [ subject ] [ author ] [ date ]
Complete archive of the seqan-dev mailing list
More info on this list...