Re: [Seqan-dev] lambda_indexer trouble

Hannes Hauswedell <hannes.hauswedell@fu-berlin.de> · Fri, 08 May 2015 11:21:43 +0200

Dear Jose,

I am sorry to hear that it is not working for you as expected. The next Lambda 
version will contain indexing methods that are more more memory efficient. I 
will still try to answer your questions between the lines:

Am Freitag, 8. Mai 2015, 10:27:01 schrieb Jose Manuel Duarte:
> I am having a lot of trouble using lambda_indexer to index full
> UniRef100 fasta files. I followed the instructions in the lamda website:
> 
> % /path/to/segmasker -infmt fasta -in db.fasta -outfmt interval -out db.seg
> 
> % bin/lambda_indexer -d db.fasta -s db.seg

That's correct.

> I first tried with the current UniRef100 release (2015_05), which is
> huge (26GB uncompressed fasta file) and then I came across the memory
> problems that are documented in lambda_indexer's help. So I ended up
> using "-a skew7ext" and ran it in the largest memory system I had
> available (128GB). The program ran, but at some point after "Generating
> Index..." it died with a segfault and no other information.

If this is a different error from the one below, it is unexpected. Can you 
open an issue for this in the seqan bug tracker with a link to the exact file 
used? Please note that the requirements for free disk space for skew are very 
high (see below).

> Then I decided to try on a smaller UniRef, so I took an older version
> (2012_06, only 8GB uncompressed fasta file). I ran again with "-a
> skew7ext" and this time it did go further, but eventually also died:
> 
> Dumping unreduced Subj Sequences… done.
> Generating Index…Asynchronous I/O operation failed (waitFor): "Success"
> [...]

This is always an indicator of running out of disk space in the TMPDIR.

> I'm pretty sure I have enough space available on the disk where I'm
> running it (>100GB). Is there anything obvious that I am doing wrong? Do
> you guys have any experience in indexing large files like this?

Indeed the requirements for disk space are quite high for skew. As described 
in the help-page, I have measured 30x. So if your file is 8GB and say 6GB of 
this is sequence data, than the external space requirement might well be 
180GB...

You might want to try the quicksort or quicksortbuckets algorithms. The don't 
require external disk space and if you have 128GB of RAM, this should be 
enough to build the index for your 8GB file.

> Apologies in advance if the message does not belong in the dev list. I
> couldn't find any more appropriate place to post it. I'd like to confirm
> this is a bug rather than me misusing the software before submitting an
> issue to github.

Feel free to write to this list, post on github or write to me directly! All 
ways are accepted :)

Best regards,
-- 
Hannes Hauswedell

PhD student
Max Planck Institute for Molecular Genetics / Freie Universität Berlin

address     Institut für Informatik
            Takustraße 9
            Room 019
            14195 Berlin
telephone   +49 (0)30 838-75241
fax         +49 (0)30 838-75218
e-mail      hannes.hauswedell@[molgen.mpg.de|fu-berlin.de]