Hi John,
On Jul 5, 2012, at 9:46 AM, John Reid wrote:
If you build the Esa as it is, it will consume 4 long ints per char on a 64bit machine and take ~96Gb of memory for a 3Gb genome.
But you can redefine index fibres for your needs, i.e. you can replace long ints with ints or chars.
// TGenome is the type of sequence, e.g. StringSet<Dna5String>
typedef StringSet<Dna5String>
TGenome;
typedef Index<TGenome, IndexEsa<> > TGenomeEsa;
namespace seqan
{
template <>
struct Fibre<TGenomeEsa,
FibreSA>
{
// Works for up to 256 contigs of length 4Gbp
typedef String< Pair<unsigned
char, unsigned
int, Compressed>,
DefaultIndexStringSpec<TGenomeEsa>::Type > Type;
// Use a mmapped string
// typedef String< Pair<unsigned char, unsigned int, Compressed>, MMap<> > Type;
};
template <>
struct Fibre<TGenomeEsa,
FibreLcp>
{
typedef String<unsigned
int, DefaultIndexStringSpec<TGenomeEsa>::Type > Type;
};
template <>
struct Fibre<TGenomeEsa,
FibreChildtab>
{
typedef String<unsigned
int, DefaultIndexStringSpec<TGenomeEsa>::Type > Type;
};
}
In this way your Esa will fit in ~38Gb of memory.
You might want to try out mmapped strings depending on your memory requirements and the access pattern of your algorithm.
You can also try to redefine size and limits metafunctions for you sequence types.
namespace seqan
{
template <>
struct Size<Dna5String>
{
typedef
unsigned int Type;
};
template <>
struct StringSetLimits<TGenome>
{
typedef String<unsigned
char> Type;
};
}
Please overload metafunctions only in your applications, not in library modules!
Ciao,
Enrico
|