FU Logo
  • Startseite
  • Kontakt
  • Impressum
  • Home
  • Listenauswahl
  • Anleitungen

Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index

<-- thread -->
<-- date -->
  • From: "Siragusa, Enrico" <Enrico.Siragusa@fu-berlin.de>
  • To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Date: Sun, 23 Nov 2014 14:21:49 +0100
  • Reply-to: SeqAn Development <seqan-dev@lists.fu-berlin.de>
  • Subject: Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index

Hi Brett,

On Nov 21, 2014, at 2:05 AM, Brett Bowman <bnbowman@gmail.com> wrote:

Thanks Rene, those solutions worked for me.

I have two follow-up Qgram Index questions:

1) Is there any way to iterate over all of the Kmers stored in a Qgram index?  You have sample code for the other index types here (http://seqan.readthedocs.org/en/master/Tutorial/IndexIterators.html), but it doesn't seem to work for me for the QgramIndx:
"""
 typedef Iterator<TIndex, TopDown<ParentLinks<>>> TIter;

 TIter::Type it(refSetIndex);
 do {
     // Print theletters from the root to the current node
     std::cout << representative(it) << std::endl;

     if (!goDown(it) && !goRight(it))
         while (goUp(it) && !goRight(it));
 } while (!isRoot(it));
'"""


This traversal works on a QGram Index too! For example, replace the TIndex definition at https://github.com/seqan/seqan/blob/develop/core/demos/index/index_iterator.cpp#L10 with:

    typedef Index<CharString, IndexQGram<UngappedShape<3> > > TIndex;

and the demo will iterate all text q-grams up to length 3 in lexicographic order. Note that the basic QGram Index is practically limited to short q-qgrams, as it relies on a direct addressing hash table of size O(a^q), where a is the alphabet size. For instance, a 14-Gram Index over the Dna alphabet requires >= 1GB of memory.


Alternatively, you can index Dna up to 31-grams using an OpenAddressing QGram Index:

    typedef Index<DnaString, IndexQGram<UngappedShape<q>, OpenAddressing> > TIndex;

but the above top-down iteration won't work here, as the open addressing hash table maintains the q-grams unsorted. You can iterate all text q-grams in random order by accessing the internal BucketMap Fibre of the OpenAddressing QGram Index. The above demo becomes:

int main ()
{
    typedef Index<CharString, IndexQGram<UngappedShape<3>, OpenAddressing> > TIndex;
    typedef typename Fibre<TIndex, FibreBucketMap>::Type const               TBucketMap;
    typedef typename Size<TBucketMap>::Type                                  TSize;
    typedef typename Spec<TBucketMap>::Type                                  THash;

    TIndex index("tobeornottobe");
    indexCreate(index, FibreSADir());

    TBucketMap & bucketMap = getFibre(index, FibreBucketMap());

    CharString qgram;
    for (TSize i = 0; i < length(bucketMap.qgramCode); ++i)
    {
        THash hash = bucketMap.qgramCode[i];

        if (hash == TBucketMap::EMPTY) continue;

        unhash(qgram, hash, 3);

        std::cout << qgram << std::endl;
    }

    return 0;
}

I think that QGram Indices should provide a generic iterator over the text q-grams - this looks like an important use case that nobody encountered so far...


2) Is there any way to mask or remove certain Kmers from an Index once it's been created?  I.e., to hide or delete common, low-complexity Kmers like homopolymers


You could remove certain q-grams after construction, but this requires manipulating the internal QGram Index Fibres yourself, as there’s no public interface for doing it.

In the basic QGram Index this is not easy/convenient, as the q-gram occurrences have to be removed from the SA Fibre (i.e. the partial suffix array).
Otherwise, you can prune q-grams on the fly during the top-down traversal - the function countOccurrences(it) returns the number of occurrences of the current q-gram.

In the OpenAddressing QGram Index, as far as I know, you should be able to set some hash values to EMPTY.


-Brett




On Wed, Nov 19, 2014 at 4:25 AM, Rahn, René <rene.maerker@fu-berlin.de> wrote:
Hey,

the problem lies in the hash(indexShape(index), kmer) function.
The hash function only takes an iterator or pointer as second argument. 
So in your case the simple solution would be to call the function hash(indexShape(index), it), while using an iterator to iterate over the text instead of using the unsigned integer. This little behaviour does not seem obvious, as in the tutorial you pass a string literal. But in fact a string literal is just a const char *, so it works.

Here you see an example code to work with a) using the finder interface and b) directly operating on the index.
IHTH.

cheers,

René 

    typedef Index<DnaString, IndexQGram<UngappedShape<12> > > TQGramIndex;
    typedef Finder<TQGramIndex> TFinder;
    typedef Iterator<DnaString, Standard>::Type TIterator;
    typedef Fibre<TQGramIndex, QGramShape>::Type TShape;

    TQGramIndex index(ref);
    indexRequire(index, QGramSADir());  // On-demand index creation.

    // a) Using the finder interface.
    TFinder finder(index);

    for (TIterator it = begin(query, Standard()); it != end(query, Standard()) - 12; ++it)
    {
        std::cout << "Occ at: ";
        while(find(finder, infix(query, it, it+12)))
            std::cout << position(finder) << " ";
        std::cout << std::endl;
        clear(finder);  // Clear finder for next search.
    }

    // b) Using the index interface.
    TShape & shape = indexShape(index);
    hashInit(shape, begin(query, Standard()));
    for (TIterator it = begin(query, Standard()); it != end(query, Standard()) - 12; ++it)
    {
        std::cout << "Occ at: ";
        hashNext(shape, it);
        for (unsigned i = 0; i < length(getOccurrences(index, shape)); ++i)
            std::cout << getOccurrences(index, shape)[i] << " ";
        std::cout << std::endl;

    }

On 19 Nov 2014, at 07:34, Brett Bowman <bnbowman@gmail.com> wrote:

Sorry to be a bother, but I decided to go back to the docs and start fresh and I'm still having issues.  

We're using Qgrams, so I trying to write a simple loop to search for all 12mers from a query sequence "seq" in the index formed from a StringSet of references "refSeq", building on the example code in the Q-Gram Index tutorial:
"""
  TFinder qgramFinder(index);
 for (size_t i = 0; i < length(seq)-12; i++)
 {
     TInfix::Type kmer = infix(seq, i, i+12);
     std::cout << kmer << std::endl;

     seqan::hash(indexShape(index), kmer);
     for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i)
     {
         std::cout << getOccurrences(index, indexShape(index))[i] << std::endl;
     }
     std::cout << std::endl;
 }
"""

but I'm getting a fairly obtuse error message I don't quite understand, and doesn't point me back to any specific lines in my code.

"""
/usr/include/seqan/index/shape_base.h:535:55: error: indirection requires pointer operand ('seqan::Segment<const seqan::String<seqan::SimpleType<unsigned
      char, seqan::Dna5_>, seqan::Alloc<void> >, seqan::InfixSegment>' invalid)
                        hash * ValueSize<TValue>::VALUE + ordValue((TValue)*it),
                                                                           ^~~
/usr/include/seqan/index/shape_base.h:548:22: note: in instantiation of function template specialization 'seqan::_hashFixedShape<unsigned long,
      seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Segment<const seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Alloc<void> >,
      seqan::InfixSegment>, 12>' requested here
                return me.hValue = _hashFixedShape(me.hValue, it, TValue(), UngappedShape<q>());
                                   ^
/home/bbowman/git/SRSLI/src/C++/SparseAlignment.hpp:102:16: note: in instantiation of function template specialization 'seqan::hash<seqan::SimpleType<unsigned
      char, seqan::Dna5_>, 12, seqan::Segment<const seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5_>, seqan::Alloc<void> >, seqan::InfixSegment>
      >' requested here
        seqan::hash(indexShape(index), kmer);
               ^
/home/bbowman/git/SRSLI/src/C++/main.cpp:86:9: note: in instantiation of function template specialization 'FindSeeds2<FindSeedsConfig<12,
      seqan::UngappedShape<12>, seqan::IndexQGram<seqan::UngappedShape<12>, seqan::Tag<seqan::Default_> > > >' requested here
        FindSeeds2(querySeedHits, refSetIndex, idxAndRecord.second.Seq);
"""

Suggestions?

Sincerely,
Brett
_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

---

René Rahn
Ph.D. Student
--------------------------------
Tel:  (+49) 30 838 75277
Mail: rene.rahn@fu-berlin.de
--------------------------------
Institute of Computer Science
Algorithmic Bioinformatics (ABI)
--------------------------------
Freie Universität Berlin
Takustraße 9
14195 Berlin
--------------------------------


_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev


_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev

<-- thread -->
<-- date -->
  • References:
    • [Seqan-dev] Trouble Hashing Kmers into Qgram Index
      • From: Brett Bowman <bnbowman@gmail.com>
    • Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
      • From: Rahn, René <rene.maerker@fu-berlin.de>
    • Re: [Seqan-dev] Trouble Hashing Kmers into Qgram Index
      • From: Brett Bowman <bnbowman@gmail.com>
  • seqan-dev - November 2014 - Archives indexes sorted by:
    [ thread ] [ subject ] [ author ] [ date ]
  • Complete archive of the seqan-dev mailing list
  • More info on this list...

Hilfe

  • FAQ
  • Dienstbeschreibung
  • ZEDAT Beratung
  • postmaster@lists.fu-berlin.de

Service-Navigation

  • Startseite
  • Listenauswahl

Einrichtung Mailingliste

  • ZEDAT-Portal
  • Mailinglisten Portal