Hi Hannes Thanks so much for the answers. Some comments below
The error looked different, just a segfault without a trace. But of course it must have been the disk space as you explain.If this is a different error from the one below, it is unexpected. Can you open an issue for this in the seqan bug tracker with a link to the exact file used? Please note that the requirements for free disk space for skew are very high (see below).
Alright I'll try that, thanks for the tip. But in the end of the day I would like to index the current UniRef100. Following your estimates I would need something like 600GB of disk space for that... I might be able still to try it, but surely in a few months from now UniRef100 will have a size that will be impossible to deal with. It's great that you guys are already working on new algos for indexing :)Indeed the requirements for disk space are quite high for skew. As described in the help-page, I have measured 30x. So if your file is 8GB and say 6GB of this is sequence data, than the external space requirement might well be 180GB... You might want to try the quicksort or quicksortbuckets algorithms. The don't require external disk space and if you have 128GB of RAM, this should be enough to build the index for your 8GB file.
On an unrelated note, I also tried out the pre-indexed nr files you guys distribute from your website. There I get this:
./bin/lambda -q query.fasta -d nr/nr.fasta -p blastp LAMBDA - the Local Aligner for Massive Biological DatA ====================================================== Version 0.4.7 Loading Subj Sequences… done.Loading Subj Ids…/home/mi/h4nn3s/takifugu/seqan-lambda-v0.4.7/core/include/seqan/basic/basic_exception.h:345 FAILED! (Uncaught exception of type std::bad_alloc: std::bad_alloc)
stack trace: 0 [0xa97a0e] seqan::ClassTest::fail() + 0xe 1 [0x8fd5a2] ./bin/lambda() 2 [0x1510ed6] __cxxabiv1::__terminate(void (*)()) + 0x6 3 [0x1510f03] ./bin/lambda() 4 [0x151131e] ./bin/lambda() 5 [0x151121d] operator new(unsigned long) + 0x7d6 [0xfb8dd2] void seqan::AssignString_<seqan::Tag<seqan::TagExact_> >::assign_<seqan::String<char, seqan::Alloc<void> >, seqan::String<char, seqan::External<seqan::ExternalConfigLarge<seqan::File<seqan::Async<void> >, 4194304u, 2u> > > const>(seqan::String<char, seqan::Alloc<void> >&, seqan::String<char, seqan::External<seqan::ExternalConfigLarge<seqan::File<seqan::Async<void> >, 4194304u, 2u> > > const&) + 0x192
7 [0x919266] ./bin/lambda()8 [0xfe74d0] int loadSubjects<(seqan::BlastFormatFile)8, (seqan::BlastFormatProgram)1, (seqan::BlastFormatGeneration)0, seqan::SimpleType<unsigned char, seqan::ReducedAminoAcid_<seqan::Tag<seqan::Murphy10_> > >, seqan::Score<int, seqan::ScoreMatrix<seqan::SimpleType<unsigned char, seqan::AminoAcid_>, seqan::Blosum62_> >, seqan::FMIndex<void, seqan::FMIndexConfig<void> > >(GlobalDataHolder<seqan::SimpleType<unsigned char, seqan::ReducedAminoAcid_<seqan::Tag<seqan::Murphy10_> > >, seqan::Score<int, seqan::ScoreMatrix<seqan::SimpleType<unsigned char, seqan::AminoAcid_>, seqan::Blosum62_> >, seqan::FMIndex<void, seqan::FMIndexConfig<void> >, (seqan::BlastFormatFile)8, (seqan::BlastFormatProgram)1, (seqan::BlastFormatGeneration)0>&, LambdaOptions const&) + 0x230 9 [0x14e15f5] int argConv2<(seqan::BlastFormatFile)8, (seqan::BlastFormatProgram)1, (seqan::BlastFormatGeneration)0, seqan::SimpleType<unsigned char, seqan::ReducedAminoAcid_<seqan::Tag<seqan::Murphy10_> > > >(LambdaOptions const&, seqan::Tag<seqan::BlastFormat_<(seqan::BlastFormatFile)8, (seqan::BlastFormatProgram)1, (seqan::BlastFormatGeneration)0> > const&, seqan::SimpleType<unsigned char, seqan::ReducedAminoAcid_<seqan::Tag<seqan::Murphy10_> > > const&) + 0x335
10 [0x15078ec] argConv0(LambdaOptions const&) + 0x6c 11 [0x8e5cb7] main + 0x3e7 12 [0x7fea204d0a40] __libc_start_main + 0xf0 13 [0x8e6299] ./bin/lambda() Aborted (core dumped)I am assuming that the nr db is protein and that I can do protein queries (blastp) against it, is that right?
Thanks for all the help! Jose