Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
- From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
- To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
- Date: Sun, 23 Nov 2014 14:21:49 +0100
- Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
- Subject: Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
Hi Brett,
This traversal works on a QGram Index too! For example, replace the TIndex definition at
https://github.com/seqan/seqan/blob/develop/core/demos/index/index_iterator.cpp#L10 with:
typedef Index<CharString, IndexQGram<UngappedShape<3>
> > TIndex;
and the demo will iterate all text q-grams up to length 3 in lexicographic order. Note that the basic QGram Index is practically limited to short q-qgrams, as it relies on a direct addressing hash table of size O(a^q), where a is the alphabet
size. For instance, a 14-Gram Index over the Dna alphabet requires >= 1GB of memory.
Alternatively, you can index Dna up to 31-grams using an OpenAddressing QGram Index:
typedef Index<DnaString, IndexQGram<UngappedShape<q>, OpenAddressing> >
TIndex;
but the above top-down iteration won't work here, as the open addressing hash table maintains the q-grams unsorted. You can iterate all text q-grams in random order by accessing the internal BucketMap Fibre of the OpenAddressing QGram Index. The above
demo becomes:
int main ()
{
typedef
Index<CharString,
IndexQGram<UngappedShape<3>,
OpenAddressing> > TIndex;
typedef
typename
Fibre<TIndex,
FibreBucketMap>::Type
const TBucketMap;
typedef
typename
Size<TBucketMap>::Type TSize;
typedef
typename
Spec<TBucketMap>::Type THash;
TIndex index("tobeornottobe");
indexCreate(index,
FibreSADir());
TBucketMap & bucketMap =
getFibre(index,
FibreBucketMap());
CharString qgram;
for (TSize i =
0; i <
length(bucketMap.qgramCode); ++i)
{
THash hash = bucketMap.qgramCode[i];
if (hash ==
TBucketMap::EMPTY)
continue;
unhash(qgram, hash,
3);
std::cout << qgram <<
std::endl;
}
return
0;
}
I think that QGram Indices should provide a generic iterator over the text q-grams - this looks like an important use case that nobody encountered so far...
You could remove certain q-grams after construction, but this requires manipulating the internal QGram Index Fibres yourself, as there’s no public interface for doing it.
In the basic QGram Index this is not easy/convenient, as the q-gram occurrences have to be removed from the SA Fibre (i.e. the partial suffix array).
Otherwise, you can prune q-grams on the fly during the top-down traversal - the function countOccurrences(it) returns the number of occurrences of the current q-gram.
In the OpenAddressing QGram Index, as far as I know, you should be able to set some hash values to EMPTY.
|
- References:
- [Seqan-dev] Trouble Hashing Kmers into Qgram Index
- From: Brett Bowman <bnbowman@gmail.com>
- Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
- From: Rahn, René <rene.maerker@fu-berlin.de>
- Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
- From: Brett Bowman <bnbowman@gmail.com>
- [Seqan-dev] Trouble Hashing Kmers into Qgram Index
-
seqan-dev - November 2014 - Archives indexes sorted by:
[ thread ] [ subject ] [ author ] [ date ] - Complete archive of the seqan-dev mailing list
- More info on this list...