Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index

From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Date: Sun, 23 Nov 2014 14:21:49 +0100
Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
Subject: Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index

Hi Brett,

On Nov 21, 2014, at 2:05 AM, Brett Bowman <bnbowman@gmail.com> wrote:

Thanks Rene, those solutions worked for me.

I have two follow-up Qgram Index questions:

1) Is there any way to iterate over all of the Kmers stored in a Qgram index? You have sample code for the other index types here (http://seqan.readthedocs.org/en/master/Tutorial/IndexIterators.html), but it doesn't seem to work for me for the QgramIndx:

"""

typedef Iterator<TIndex, TopDown<ParentLinks<>>> TIter;

TIter::Type it(refSetIndex);

do {

// Print theletters from the root to the current node

std::cout << representative(it) << std::endl;

if (!goDown(it) && !goRight(it))

while (goUp(it) && !goRight(it));

} while (!isRoot(it));

'"""

This traversal works on a QGram Index too! For example, replace the TIndex definition at https://github.com/seqan/seqan/blob/develop/core/demos/index/index_iterator.cpp#L10 with:

    typedef Index<CharString, IndexQGram<UngappedShape<3>
 > > TIndex;

and the demo will iterate all text q-grams up to length 3 in lexicographic order. Note that the basic QGram Index is practically limited to short q-qgrams, as it relies on a direct addressing hash table of size O(a^q), where a is the alphabet size. For instance, a 14-Gram Index over the Dna alphabet requires >= 1GB of memory.

Alternatively, you can index Dna up to 31-grams using an OpenAddressing QGram Index:

    typedef Index<DnaString, IndexQGram<UngappedShape<q>, OpenAddressing> >
 TIndex;

but the above top-down iteration won't work here, as the open addressing hash table maintains the q-grams unsorted. You can iterate all text q-grams in random order by accessing the internal BucketMap Fibre of the OpenAddressing QGram Index. The above demo becomes:

int main ()

{

   
typedef
Index<CharString,
IndexQGram<UngappedShape<3>,
OpenAddressing> > TIndex;

    
typedef 
typename 
Fibre<TIndex,

FibreBucketMap>::Type

const               TBucketMap;

    
typedef 
typename 
Size<TBucketMap>::Type                                  TSize;

    
typedef 
typename 
Spec<TBucketMap>::Type                                  THash;

   
TIndex index("tobeornottobe");

    
indexCreate(index, 
FibreSADir());

    TBucketMap & bucketMap =

getFibre(index, 
FibreBucketMap());

    
CharString qgram;

    
for (TSize i =

0; i < 
length(bucketMap.qgramCode); ++i)

{

        
THash hash = bucketMap.qgramCode[i];

        
if (hash == 
TBucketMap::EMPTY)

continue;

        
unhash(qgram, hash, 
3);

        
std::cout << qgram <<

std::endl;

}

    
return 
0;

}

I think that QGram Indices should provide a generic iterator over the text q-grams - this looks like an important use case that nobody encountered so far...

2) Is there any way to mask or remove certain Kmers from an Index once it's been created? I.e., to hide or delete common, low-complexity Kmers like homopolymers

You could remove certain q-grams after construction, but this requires manipulating the internal QGram Index Fibres yourself, as there’s no public interface for doing it.

In the basic QGram Index this is not easy/convenient, as the q-gram occurrences have to be removed from the SA Fibre (i.e. the partial suffix array).

Otherwise, you can prune q-grams on the fly during the top-down traversal - the function countOccurrences(it) returns the number of occurrences of the current q-gram.

In the OpenAddressing QGram Index, as far as I know, you should be able to set some hash values to EMPTY.

-Brett

On Wed, Nov 19, 2014 at 4:25 AM, Rahn, René <rene.maerker@fu-berlin.de> wrote:

Hey,

the problem lies in the hash(indexShape(index), kmer) function.

The hash function only takes an iterator or pointer as second argument.

So in your case the simple solution would be to call the function hash(indexShape(index), it), while using an iterator to iterate over the text instead of using the unsigned integer. This little behaviour does not seem obvious, as in the tutorial you pass a string literal. But in fact a string literal is just a const char *, so it works.

Here you see an example code to work with a) using the finder interface and b) directly operating on the index.

IHTH.

cheers,

René

typedef Index<DnaString, IndexQGram<UngappedShape<12> > > TQGramIndex;

typedef Finder<TQGramIndex> TFinder;

typedef Iterator<DnaString, Standard>::Type TIterator;

typedef Fibre<TQGramIndex, QGramShape>::Type TShape;

TQGramIndex index(ref);

indexRequire(index, QGramSADir()); // On-demand index creation.

// a) Using the finder interface.

TFinder finder(index);

for (TIterator it = begin(query, Standard()); it != end(query, Standard()) - 12; ++it)

{

std::cout << "Occ at: ";

while(find(finder, infix(query, it, it+12)))

std::cout << position(finder) << " ";

std::cout << std::endl;

clear(finder); // Clear finder for next search.

}

// b) Using the index interface.

TShape & shape = indexShape(index);

hashInit(shape, begin(query, Standard()));

for (TIterator it = begin(query, Standard()); it != end(query, Standard()) - 12; ++it)

{

std::cout << "Occ at: ";

hashNext(shape, it);

for (unsigned i = 0; i < length(getOccurrences(index, shape)); ++i)

std::cout << getOccurrences(index, shape)[i] << " ";

std::cout << std::endl;

}

On 19 Nov 2014, at 07:34, Brett Bowman <bnbowman@gmail.com> wrote:

Sorry to be a bother, but I decided to go back to the docs and start fresh and I'm still having issues.

We're using Qgrams, so I trying to write a simple loop to search for all 12mers from a query sequence "seq" in the index formed from a StringSet of references "refSeq", building on the example code in the Q-Gram Index tutorial:

"""

TFinder qgramFinder(index);

for (size_t i = 0; i < length(seq)-12; i++)

{

TInfix::Type kmer = infix(seq, i, i+12);

std::cout << kmer << std::endl;

seqan::hash(indexShape(index), kmer);

for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i)

{

std::cout << getOccurrences(index, indexShape(index))[i] << std::endl;

}

std::cout << std::endl;

}

"""

but I'm getting a fairly obtuse error message I don't quite understand, and doesn't point me back to any specific lines in my code.

"""

/usr/include/seqan/index/shape_base.h:535:55: error: indirection requires pointer operand ('seqan::Segment<const seqan::String<seqan::SimpleType<unsigned

char, seqan::Dna5_>, seqan::Alloc<void> >, seqan::InfixSegment>' invalid)

hash * ValueSize<TValue>::VALUE + ordValue((TValue)*it),

^~~

/usr/include/seqan/index/shape_base.h:548:22: note: in instantiation of function template specialization 'seqan::_hashFixedShape<unsigned long,

seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Segment<const seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Alloc<void> >,

seqan::InfixSegment>, 12>' requested here

return me.hValue = _hashFixedShape(me.hValue, it, TValue(), UngappedShape<q>());

^

/home/bbowman/git/SRSLI/src/C++/SparseAlignment.hpp:102:16: note: in instantiation of function template specialization 'seqan::hash<seqan::SimpleType<unsigned

char, seqan::Dna5_>, 12, seqan::Segment<const seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Alloc<void> >, seqan::InfixSegment>

>' requested here

seqan::hash(indexShape(index), kmer);

^

/home/bbowman/git/SRSLI/src/C++/main.cpp:86:9: note: in instantiation of function template specialization 'FindSeeds2<FindSeedsConfig<12,

seqan::UngappedShape<12>, seqan::IndexQGram<seqan::UngappedShape<12>, seqan::Tag<seqan::Default_> > > >' requested here

FindSeeds2(querySeedHits, refSetIndex, idxAndRecord.second.Seq);

"""

Suggestions?

Sincerely,

Brett

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

---

René Rahn

Ph.D. Student

--------------------------------

Tel: (+49) 30 838 75277

Mail: rene.rahn@fu-berlin.de

--------------------------------

Institute of Computer Science

Algorithmic Bioinformatics (ABI)

--------------------------------

Freie Universität Berlin

Takustraße 9

14195 Berlin

--------------------------------

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

<-- thread -->

<-- date -->

References:
- [Seqan-dev] Trouble Hashing Kmers into Qgram Index
  - From: Brett Bowman <bnbowman@gmail.com>
- Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
  - From: Rahn, René <rene.maerker@fu-berlin.de>
- Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
  - From: Brett Bowman <bnbowman@gmail.com>

seqan-dev - November 2014 - Archives indexes sorted by:
[ thread ] [ subject ] [ author ] [ date ]
Complete archive of the seqan-dev mailing list
More info on this list...