KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number. Experiments show that KmerGenie's choices lead to assemblies that are close to the best possible over all k-mer lengths.
KmerGenie predictions can be applied to single-k genome assemblers (e.g. Velvet, SOAPdenovo 2, ABySS, Minia). However, multi-k genome assemblers (e.g. SPAdes, IDBA) generally perform better with default parameters (using multiple k values), rather than the single best k predicted by KmerGenie.
Download KmerGenie sources here: kmergenie-1.7038.tar.gz
You will need Python and R.
Latest README and CHANGELOG. Major changes since initial release:
Please use Biostars (dynamic FAQ system, click "New Post", top right corner) to report bugs or ask any question. (An archive of posts between 2012 and 2014 can be found here.)
To contact the authors directly: kmergenie at cse.psu.edu
Chikhi R., Medvedev P. Informed and Automated k-Mer Size Selection for Genome Assembly, HiTSeq 2013. [on arXiv]