Hi
I have yet another question. I have used the countKmers function to read through a fastq file and identify over-represented 5mers (e.g. AAAAA and TTTTT commonly) e.g. std::vector< std::string > kmers
I now want to plot their fraction at positions across the reads so I iterate through the fast records again trying to count any occurence of each of kmer and it's position. This is essentially the last step of the fastQC program (though seqan seems a lot
faster) i.e.:
// more above
NumericMatrix kmerMatrix(kmers.size(), numcycles-4);
seqan::Dna5String seq;
for( ; seqIt1 != end(seqs); ++seqIt1)
{
seq = value(seqIt1);
Finder<Dna5String> finder(seq);
for(unsigned i=0; i< kmers.size(); ++i)
{
Pattern<Dna5String, Horspool> pattern(kmers[i]);
while (find(finder, pattern))
kmerMatrix(i,beginPosition(finder)) += 1;
}
}
return kmerMatrix; // this to be plot in R using ggplot2
}
The above compiles fine and I think looks liek what I want but I get an odd result with a 5mer skip. I see counts for the first kmer (AAAAA) at first position, then 6th, 11th, 16th etc,
The next kmer at 6th,11th, 16th and so on for all the other kmers:
# this is the c++ function called from within R
> kmerPlotCpp('test3.fastq', as.vector(kmers$kmers))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,] 1826 0 0 0 0 1189 0 0 0 0 1403 0 0 0 0 1371 0 0 0 0 1328
[2,] 0 0 0 0 0 641 0 0 0 0 482 0 0 0 0 504 0 0 0 0 493
[3,] 0 0 0 0 0 424 0 0 0 0 545 0 0 0 0 545 0 0 0 0 578
... etc etc
I don't understand why beginPosition(finder) is skipping in 5s. Do you?
Also... is this the best approach? Is there a simpler/more efficient way?
thx (Again)
Stephen Henderson
UCL Cancer Institute
ps I have tried to clear(finder) at the end of the outer loop but this compiled and crashed runtime-- strangely?
|