Hi Stephen, a) If you read with double pass I/O (I recommend to ONLY use MMap String for double-pass I/O because of memory usage for buffering), and read into a ConcatDirect StringSet then the memory usage should be 1 byte per Dna5Q including char and memory. b) Dna5Q uses one byte, if you store the sequence and qualities separately, you one byte for each the char and one byte for the quality, thus twice the amount of memory. Note that your input file have to use PHRED scores. c) There is a small overhead (bitmasks, bit shifts, lookup tables) for retrieving the character/quality from a Dna5Q character, but if you iterate sequentially over the data, using Dna5Q will most probably be faster since you only need to transfer 1/2 of the data from main memory to registers. d) Not exactly. Dna5Q is one character, String<Dna5Q> is a string of Dna5Q characters. DnaQ (not Dna5Q) stores the character in the first two bits and the quality in the remaining bits. Each alphabet entry can have qualities 0..62. For Dna5Q, the story is a bit more complicated since N automatically implies a quality of 0, thus the pair analogy is broken here. HTH, Manuel ________________________________________ From: Henderson, Stephen [s.henderson@ucl.ac.uk] Sent: Tuesday, April 23, 2013 1:28 PM To: seqan-dev@lists.fu-berlin.de Subject: Re: [Seqan-dev] Dna5Q - how do you access qual or seq ? (Leon Kuchenbecker) Thanks That is exactly the sort of syntactic headslap I imagined (doh!). Cheers. While I'm here just out of curiosity do you know a) The size in RAM a fastq file takes up when single pass read vs double pass (~ per million 75mers)? b) Does Dna5Q and the alternative of a seqs, qual charstring take up the same space in RAM? c) Is there any extra overhead in iterating or extracting elements from Dna5Q as opposed to a simple charstring? d) Is Dna5Q something like a Vector of <Pair> ??? ... anyway thanks again. Stephen ----------------------- Hi Stephen! > //where seqs is the StringSet of Dna5Q read in above > seqan::Dna5Q seq= seqs[0]; > > Rcpp::Rcout << getQualityValue(seq[0]) << std::endl; // error > [...] seqan::Dna5Q is a type for storing a single character from the ACGTN alphabet including the associated quality. If you want to store a sequence of such characters, you need String<Dna5Q>. Thus, the correct types for your variables would be > seqan::StringSet<seqan::CharString> ids; > seqan::StringSet<seqan::String<seqan::Dna5Q> > seqsQ; Your usage of getQualityValue() was in principle correct, just that you were trying to subscript a single character. Cheers Leon _______________________________________________ seqan-dev mailing list seqan-dev@lists.fu-berlin.de https://lists.fu-berlin.de/listinfo/seqan-dev