KmerGenie

Software: KmerGenie

KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number. Experiments show that KmerGenie's choices lead to assemblies that are close to the best possible over all k-mer lengths.
KmerGenie predictions can be applied to single-k genome assemblers (e.g. Velvet, SOAPdenovo 2, ABySS, Minia). However, multi-k genome assemblers (e.g. SPAdes, IDBA) generally perform better with default parameters (using multiple k values), rather than the single best k predicted by KmerGenie.

See a sample report generated by KmerGenie from a dataset of bacterial reads.

Download

Download KmerGenie sources here: kmergenie-1.7051.tar.gz
You will need Python and R.

Latest README and CHANGELOG. Major changes since initial release:

1.7048 (03/14/18): Python3 fixes and other small things
1.7023 (12/24/16): Improved speed of histogram estimation method (using ntCard)
1.7016 (3/21/16): Improved robustness of diploid model
1.6213 (1/30/14): Advanced Help section in the HTML report, to guide interpretation of results
1.5621 (8/02/13): HTML report for easier results examination (inspired by FastQC)
1.5378 (6/25/13): Suitable histogram resolution (-e parameter) is automatically detected, useful for small (bacterial) genomes.
1.5260 (5/26/13): Improved model R code (thanks to Anton Korobeynikov). Two-pass k estimation. Histograms are automatically plotted.

Support

To contact the authors directly: rayan.chikhi at ens-cachan.org

Article

Chikhi R., Medvedev P. Informed and Automated k-Mer Size Selection for Genome Assembly, HiTSeq 2013. [on arXiv]