Hi Theo,
The FASTQ format is a good example for the file format fuzziness in bioinformatics. There is no real standard, only an article [1] telling what is there. The supplemental material is a tar.gz file that contains an example which says different ids for sequence
and quality meta are an error.
That said, what is the source of the FASTQ file? Ignoring the quality meta would not be a big change in the parser and be a change that I would be quite willing to make given that a "major" source of FASTQ generates such files.
In a future version, such things should be configurable when reading FASTQ but alas we do not currently have time to make the change to make the I/O of sequences more configurable.
HTH,
Manuel
From: Theodore Omtzigt [theo@stillwater-sc.com]
Sent: Wednesday, January 30, 2013 1:31 AM To: SeqAn Dev List Subject: Re: [Seqan-dev] Zlib linking errors under 64bit Windows 7 I have got the linking to work outside of the SeqAn CMake build environment, so I have now at least isolated it to be a problem in the SeqAn build environment.
However, I also run into a format error on a fastq file that is passing with the standard code from Genome Research Lab (kseq.h). The problem occurs in the last test in this code fragment from read_fasta_fastq.h template <typename TIdString, typename TQualString, typename TFile, typename TPass> inline int _readQualityBlock(TQualString & qual, RecordReader<TFile, TPass > & reader, unsigned const seqLength, TIdString const & meta, Fastq const & /*tag*/) { // READ AND CHECK QUALITIES' META if (atEnd(reader)) return EOF_BEFORE_SUCCESS; if (value(reader) != '+') return RecordReader<TFile, TPass >::INVALID_FORMAT; goNext(reader); if (resultCode(reader)) return resultCode(reader); if (atEnd(reader)) // empty ID, no sequence, this is legal? TODO return 0; CharString qualmeta_buffer; int res = readLine(qualmeta_buffer, reader); if (res && res == EOF_BEFORE_SUCCESS) return EOF_BEFORE_SUCCESS; else if (res) return RecordReader<TFile, TPass >::INVALID_FORMAT; // meta string has to be empty or identical to sequence's meta if ((qualmeta_buffer != "") && (qualmeta_buffer != meta)) return RecordReader<TFile, TPass >::INVALID_FORMAT; ... and the test fails because of the qualmeta_buffer not being equal to meta. + meta {data_begin=0x001757e0 "EAS18:1:1:1:1:1119:0/1ÍÍÍÍÍÍÍÍÍÍÍÍÍÍýýýý««««««««" data_end=0x001757f6 "ÍÍÍÍÍÍÍÍÍÍÍÍÍÍýýýý««««««««" data_capacity=32 } const seqan::String<char,seqan::Alloc<void> > & + qualmeta_buffer {data_begin=0x0017dc40 "EAS18:1:1:1:1:1119:0/1 ltrim=1 rtrim=38ÍÍÍÍÍÍÍÍÍÍÍÍÍÍýýýý««««««««þîþîþîþ" data_end=0x0017dc67 "ÍÍÍÍÍÍÍÍÍÍÍÍÍÍýýýý««««««««þîþîþîþ" data_capacity=49 } seqan::String<char,seqan::Alloc<void> > that section: " ltrim=1 rtrim=38" appears to be a format difference that kseq.h accepts as a valid quality segment, but read_fasta_fastq.h does not. So, this question now has bifurcated into a second question and that is what is considered a valid fastq format? |