Re: [Seqan-dev] lambda_indexer trouble
Dear Jose,
I am sorry to hear that it is not working for you as expected. The next Lambda
version will contain indexing methods that are more more memory efficient. I
will still try to answer your questions between the lines:
Am Freitag, 8. Mai 2015, 10:27:01 schrieb Jose Manuel Duarte:
> I am having a lot of trouble using lambda_indexer to index full
> UniRef100 fasta files. I followed the instructions in the lamda website:
>
> % /path/to/segmasker -infmt fasta -in db.fasta -outfmt interval -out db.seg
>
> % bin/lambda_indexer -d db.fasta -s db.seg
That's correct.
> I first tried with the current UniRef100 release (2015_05), which is
> huge (26GB uncompressed fasta file) and then I came across the memory
> problems that are documented in lambda_indexer's help. So I ended up
> using "-a skew7ext" and ran it in the largest memory system I had
> available (128GB). The program ran, but at some point after "Generating
> Index..." it died with a segfault and no other information.
If this is a different error from the one below, it is unexpected. Can you
open an issue for this in the seqan bug tracker with a link to the exact file
used? Please note that the requirements for free disk space for skew are very
high (see below).
> Then I decided to try on a smaller UniRef, so I took an older version
> (2012_06, only 8GB uncompressed fasta file). I ran again with "-a
> skew7ext" and this time it did go further, but eventually also died:
>
> Dumping unreduced Subj Sequences… done.
> Generating Index…Asynchronous I/O operation failed (waitFor): "Success"
> [...]
This is always an indicator of running out of disk space in the TMPDIR.
> I'm pretty sure I have enough space available on the disk where I'm
> running it (>100GB). Is there anything obvious that I am doing wrong? Do
> you guys have any experience in indexing large files like this?
Indeed the requirements for disk space are quite high for skew. As described
in the help-page, I have measured 30x. So if your file is 8GB and say 6GB of
this is sequence data, than the external space requirement might well be
180GB...
You might want to try the quicksort or quicksortbuckets algorithms. The don't
require external disk space and if you have 128GB of RAM, this should be
enough to build the index for your 8GB file.
> Apologies in advance if the message does not belong in the dev list. I
> couldn't find any more appropriate place to post it. I'd like to confirm
> this is a bug rather than me misusing the software before submitting an
> issue to github.
Feel free to write to this list, post on github or write to me directly! All
ways are accepted :)
Best regards,
--
Hannes Hauswedell
PhD student
Max Planck Institute for Molecular Genetics / Freie Universität Berlin
address Institut für Informatik
Takustraße 9
Room 019
14195 Berlin
telephone +49 (0)30 838-75241
fax +49 (0)30 838-75218
e-mail hannes.hauswedell@[molgen.mpg.de|fu-berlin.de]