Hi,it all depends on what type of String you give to the recordReader. It is designed to check for the SequenceType's alphabet and use that as a criterium, so the following should fail on non-ACGTN:
CharString seqIds; Dna5String faSeqs;RecordReader<String<char, MMap<> >, DoublePass<Mapped> > refReader(seqMMapString);
int read2out = read2(seqIds, faSeqs, refReader, Fasta()); but the following should accept any alphabetical[1]: CharString seqIds; CharString faSeqs;RecordReader<String<char, MMap<> >, DoublePass<Mapped> > refReader(seqMMapString);
int read2out = read2(seqIds, faSeqs, refReader, Fasta());This won't however convert anything. That could be done later, maybe using ModifiedString<> to avoid copying.
Regards, Hannes[1] I just checked that because I want to read a sequence including gaps, which fails (because '-' is non alphabetical --working on patch--). However the mentioned characters Y and M should not be a problem.
On 04/26/2012 06:29 AM, Holtgrewe, Manuel wrote:
If I remember correctly, the<seqan/stream.h> interface will not allow you to read non-ACGTN characters when using read(..., Fasta()) but return an error value != 0 in this case. Bernd, what you can do right now is to load your reads into CharStrings and convert them to Dna5. In the long term, I guess we will need more control for the users over I/O behaviour. At the moment, the assumptions are fairly conservative, e.g. do not allow non-ACTG(N) characters for Dna(5) and tailored to what you find in NGS reads and whole genome data. Cheers, Manuel ________________________________________ From: Weese, David [weese@campus.fu-berlin.de] Sent: Wednesday, April 25, 2012 10:44 PM To: SeqAn Development Subject: Re: [Seqan-dev] reading fasta with non-DNA5 characters Hi, actually it should automatically convert every non-ACGT character to N (or A for Dna targets). Have you already tried reading your files into string over Dna5 alphabets? Cheers, David -- David Weese weese@inf.fu-berlin.de Freie Universität Berlin http://www.inf.fu-berlin.de/ Institut für Informatik Phone: +49 30 838 75246 Takustraße 9 Algorithmic Bioinformatics 14195 Berlin Room 021 Am 25.04.2012 um 11:08 schrieb Bernd Jagla:Hi, I have a couple of genome seqeunces that contain characters other than ACTGN (i.e. Y, M,...)... Is there a way to read those sequences in as well and automatically convert those non conforming letters to N? Thanks, Bernd PS: I am using: RecordReader<String<char, MMap<> >, DoublePass<Mapped> > refReader(seqMMapString); int read2out = read2(seqIds, faSeqs, refReader, Fasta()); for reading in the data and get an INVALID_FORMAT error... _______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev_______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev _______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev