Re: [Seqan-dev] {Disarmed} Re: Performance advice for whole genome ESA

John Reid <j.reid@mail.cryst.bbk.ac.uk> · Sat, 07 Jul 2012 09:01:30 +0100

On 06/07/12 18:51, Siragusa, Enrico wrote:
> On 6 Jul 2012, at 14:55, John Reid wrote:
>
>> On 05/07/12 11:40, Siragusa, Enrico wrote:
>>> On Jul 5, 2012, at 11:02 AM, John Reid wrote:
>>>
>>>> Great. That looks very helpful. So in your example, how do you arrive at 38Gb? You are using unsigned int instead of long unsigned int. Where does the unsigned char in the Fibre<>::Type come into the calculation? I think I need my code to handle sequence sets with more than 256 sequences. I'm guessing if I replace the unsigned char  with unsigned long I get back to 48Gb?
>>> I counted 15n bytes: 1+4 bytes for suffix array values, 4 bytes for lcp values and 4 bytes for childtab values. Then for n equals 3Gbp you get roughly 38Gb.
>>> Value sizes really depend on your input sequences. How many strings do you want to index and which is their maximum length?
>>> Do you dispose of enough memory? Depending on your application another index could be more efficient...
>>>
>> I have one more question if you don't mind. What are the constraints on the string sets I can pass to an index for which I have overloaded these types?
>>
>> For example, I think the types in the FibreSA String of Pairs limit the number of sequences and then the number of items in each sequence. But what about the types for the FibreLcp and the FibreChildtab? Do these need to be the same as the second type in the FibreSA Pair or do they relate to something different? Sorry my knowledge of the internals of suffix arrays is not up to speed.
> Here [1] is a fast introduction to the Esa.
>
> Visually the suffix array represent the leaves of the suffix tree. It stores the position of all suffixes of the text sorted in lexicographical order. Then the max value in the suffix array equals the length of the text.
> For a text consisting of a set of strings we build a generalized suffix array containing pairs (i,j) where i is the index of the string in the set and j is the suffix position.
> If you overload the FibreSA with (unsigned char, unsigned int) the Esa will be able to index a set of up to 256 strings, each one long up to 2^32 characters.
> In this way you can index hg18/19 and any other genome containing not more than 256 contigs.
>
> The lcp table stores the longest common prefix of all pairs of adjacent suffixes. Visually lcp values represent edge lengths in the suffix tree.
> The max lcp value corresponds to the length of the longest path from the root to a branching node, which can be at most the length of the text.
> With some tricks we could spare some space here. If you only visit the top of the suffix tree up to depth 20 or so, then all big lcp values > 20 are superfluous.
> Unfortunately all lcp values are needed at construction time to build the childtab, so you cannot simply redefine FibreLcp to use unsigned chars.
> In practice we could spare 9Gb here, but actually in SeqAn there is no easy way to do it...
>
> The childtab stores informations for tree traversal, i.e. it represents parent-siblings-children of each node.
> Childtab values store table intervals, therefore the max value does not exceed the total length of the indexed text. 
> In practice here you need a type able to index the whole array, i.e. unsigned int for big arrays up to 2^32 elements is enough.
>
> If I understood correctly, you want to index a reference genome and not a large set of short sequences.
> If you want to do the latter, then of course you have to use different data types.
> For example, if you want to index up to 16M reads not longer than 256bp you can set:
> FibreSA: (unsigned short, unsigned char)
> FibreLcp: unsigned char
> FibreChildtab: unsigned int
>
> Ciao,
> Enrico
>
> [1] http://theorie.informatik.uni-ulm.de/Personen/mibrahim/TheEnahancedSuffixArray.pdf
>
That's very helpful. Thanks,
John.