Hi Brett,
This traversal works on a QGram Index too! For example, replace the TIndex definition at
https://github.com/seqan/seqan/blob/develop/core/demos/index/index_iterator.cpp#L10 with:
typedef Index<CharString, IndexQGram<UngappedShape<3>
> > TIndex;
and the demo will iterate all text q-grams up to length 3 in lexicographic order. Note that the basic QGram Index is practically limited to short q-qgrams, as it relies on a direct addressing hash table of size O(a^q), where a is the alphabet
size. For instance, a 14-Gram Index over the Dna alphabet requires >= 1GB of memory.
Alternatively, you can index Dna up to 31-grams using an OpenAddressing QGram Index:
typedef Index<DnaString, IndexQGram<UngappedShape<q>, OpenAddressing> >
TIndex;
but the above top-down iteration won't work here, as the open addressing hash table maintains the q-grams unsorted. You can iterate all text q-grams in random order by accessing the internal BucketMap Fibre of the OpenAddressing QGram Index. The above
demo becomes:
int main ()
{
typedef
Index<CharString,
IndexQGram<UngappedShape<3>,
OpenAddressing> > TIndex;
typedef
typename
Fibre<TIndex,
FibreBucketMap>::Type
const TBucketMap;
typedef
typename
Size<TBucketMap>::Type TSize;
typedef
typename
Spec<TBucketMap>::Type THash;
TIndex index("tobeornottobe");
indexCreate(index,
FibreSADir());
TBucketMap & bucketMap =
getFibre(index,
FibreBucketMap());
CharString qgram;
for (TSize i =
0; i <
length(bucketMap.qgramCode); ++i)
{
THash hash = bucketMap.qgramCode[i];
if (hash ==
TBucketMap::EMPTY)
continue;
unhash(qgram, hash,
3);
std::cout << qgram <<
std::endl;
}
return
0;
}
I think that QGram Indices should provide a generic iterator over the text q-grams - this looks like an important use case that nobody encountered so far...
You could remove certain q-grams after construction, but this requires manipulating the internal QGram Index Fibres yourself, as there’s no public interface for doing it.
In the basic QGram Index this is not easy/convenient, as the q-gram occurrences have to be removed from the SA Fibre (i.e. the partial suffix array).
Otherwise, you can prune q-grams on the fly during the top-down traversal - the function countOccurrences(it) returns the number of occurrences of the current q-gram.
In the OpenAddressing QGram Index, as far as I know, you should be able to set some hash values to EMPTY.
|