[Seqan-dev] Parallelization of BGZF

"Holtgrewe, Manuel" <manuel.holtgrewe@fu-berlin.de> · Mon, 18 Jun 2012 20:31:39 +0000

Hi Neil,

I have cut out the relevant parts out of SeqAn for the parallelization of BGZF I/O.

The BGZF format (described on page 8 in http://samtools.sourceforge.net/SAM1.pdf) is basically a sequence of zlib compressed blocks. Using blocks creates some interesting possibilities, i.e. jumping to positions in such files and also the parallel compression and decompression of the files: Many blocks can be read at the same time in parallel, given that they are shown to the user in the order they have in the file.

I have setup a public Github repository that you can fork and make your modifications in. I don't know how familiar you are with Github, but after logging in you can make your own copy of my repository (called forking). Then you can checkout the repository using Git (checkout is called "clone") or Subversion (checkout is called checkout :), and make your modifications.

https://github.com/holtgrewe/bam_mt

The instructions in the Github repository are for Unix style operating systems, they should be easily translatable into the Windows world. (A good Subversion client for Windows is TortoiseSVN).

Note that you need zlib to have the compression necessary for BGZF I/O, you can install the SeqAn contribs, but IIRC you have already done so.

I would like us to use threadpool for the parallelization since this will allow us to fix the total number of threads for I/O by sharing one threadpool instance.

http://threadpool.sourceforge.net/

Priority queue tasks queues can help us to balance and order I/O packages. I have already added a copy of this library to the sandbox. Note that this library requires Boost. However, the library is a bit outdated so it might need adaption to work with a current Boost.Thread version.

I have copied over the sequential Bgzf stream specialization into the sandbox (and called it Bgzf2), copied over the tests and also a small application bam2bam for testing/benchmarking.

You use it like this

bam2bam IN.bam OUT.bam

The input BAM file should be the same as the output BAM file. Note that the BAM format is a bit ambiguous and reading and writing it might change its content slightly. However, once you have copied a file using the SeqAn bam_io module, reading and writing it again should keep its content identical.

If you want a small example BAM file (BAM file are BGZF compressed), then maybe use this one:

http://code.google.com/p/gasv/downloads/detail?name=Example.bam

After reading and writing, the md5sum was 9cb9ee685d908aaf4d1b2f4dff211bd0 for me.

OK, I hope this is enough to get you started. If there are any problems or issues or points you want to raise for discussion, don't hesitate to send an email.

Cheers,
Manuel
________________________________________
From: Neil Justice [njjustice@gmail.com]
Sent: Friday, June 01, 2012 3:21 PM
To: seqan-dev@lists.fu-berlin.de
Subject: Re: [Seqan-dev] seqan-dev Digest, Vol 33, Issue 1

Hi Manuel,

I'm looking to generally contribute performance-wise to SeqAn. My
domain knowledge of bioinformatics topics is near zero, so C++-centric
optimizations and singlethreaded to multithreaded optimizations are of
most interest to me.

In what files do the performance-critical algorithms you mentioned
reside? q-gram, ESA, WOTD, alignment, etc. How can performance tests
be written against these algorithms?

In what files do the non-parallel compression and decompression of
BGZF files reside? How can a performance test be written against this
code?

Thanks for the seqan-contrib link, that fixed the compiler errors.

Cheers,
Neil

On Fri, Jun 1, 2012 at 5:00 AM,  <seqan-dev-request@lists.fu-berlin.de> wrote:
> Send seqan-dev mailing list submissions to
>        seqan-dev@lists.fu-berlin.de
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://lists.fu-berlin.de/listinfo/seqan-dev
> or, via email, send a message with subject or body 'help' to
>        seqan-dev-request@lists.fu-berlin.de
>
> You can reach the person managing the list at
>        seqan-dev-owner@lists.fu-berlin.de
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of seqan-dev digest..."
>
>
> Today's Topics:
>
>   1. Re: seqan-dev Digest, Vol 32, Issue 8 (Holtgrewe, Manuel)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 31 May 2012 11:07:22 +0000
> From: "Holtgrewe, Manuel" <manuel.holtgrewe@fu-berlin.de>
> To: SeqAn Development <seqan-dev@lists.fu-berlin.de>
> Subject: Re: [Seqan-dev] seqan-dev Digest, Vol 32, Issue 8
> Message-ID:
>        <FCCAB9D80C3DAB47B5601C5B0E62872B096300@ex02a.campus.fu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi Neal,
>
> Regarding performance sensitive code: My main assumption here is that you are looking for an area to contribute to.  Maybe you could elaborate a bit on what you want to do. Do you want to look into performance-tweaking parts of the library or "generally contribute performance-wise" to the library? Below, I give some possible projects for both.
>
> The inner loops are inside the index (most interesting: q-gram, ESA, WOTD indices) and online string search finders (myers and banded myers algorithm are the most interesting here for edit distance). The index building algorithms are also a point for possible optimization. Also important to many applications is the SWIFT q-gram counting based and pigeonhole filter algorithms.
>
> Another larger area with inner loops are the alignment algorithms but we are in the process of rewriting this part of SeqAn. Hopefully this will be completed in 1-2 months so it becomes a possible target.
>
> I am not too sure whether these are low-hanging fruits for performance optimization because we looked into this in detail in most cases. That said, if you are interested in this area, I can give you the exact locations and explain how to write performance test programs with real-world input.
>
> If I may somewhat widen your question from "performance issues" to "performance-related", it might be easier to find a "big bang for the buck" problem to tackle. Here, improved implementations or parallelization of certain important library parts come to mind. Some examples:
>
> - Parallel compression and decompression of BGZF files (they consist of gzip compressed blocks and could thus be parallelized). This comes down to porting an existing threadpool library to C++11 and using C++11 threads for a parallel BGZF reader/writer and BAM file sorter. I have the beginning of this on my local disk but we could easily make this a github project. This requires no prior knowledge in sequence analysis algorithms but expertise in C++ and parallel programming.
>
> - Space efficient BWT construction. This basically consists of implementing an existing algorithm in SeqAn. More information can be found here: https://www.mi.fu-berlin.de/w/ABI/SpaceEfficientBWTConstruction. The requirement on sequence analysis algorithms is probably quite low here and the necessary parts could be learned on the fly.
>
> - Parallelization of suffix array construction. This is a large chunk of work and would require to rewrite the parallel external memory part of the algorithm and probably extend the external memory algorithms and data structures part of SeqAn to parallelism.
>
> Cheers,
> Manuel
>
> PS: A valuable resource for looking SeqAn-related things up besides the Tutorial is http://docs.seqan.de/seqan/dev2/.
>
>
> ------------------------------
>
> _______________________________________________
> seqan-dev mailing list
> seqan-dev@lists.fu-berlin.de
> https://lists.fu-berlin.de/listinfo/seqan-dev
>
>
> End of seqan-dev Digest, Vol 33, Issue 1
> ****************************************

_______________________________________________
seqan-dev mailing list
seqan-dev@lists.fu-berlin.de
https://lists.fu-berlin.de/listinfo/seqan-dev