Preprint
Assembly-free and alignment-free sample identification using genome skims
Affiliations
Organisations
- (1) University of California, San Diego, grid.266100.3
- (2) University of East Anglia, grid.8273.e
- (3) University of Copenhagen, grid.5254.6, KU
- (4) Norwegian University of Science and Technology, grid.5947.f
Description
Abstract The ability to quickly and inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, has been very successful, but is based on short organelle markers. Recently, an alternative genome-skimming approach has been proposed: low-pass sequencing (100Mb – several Gb per sample) is applied to voucher and/or query samples, and marker genes and/or organelle genomes are recovered computationally. The current practice of genome-skimming discards the vast majority of the data because the low coverage of genome-skims prevents assembling the nuclear genomes. In contrast, we suggest using all unassembled reads directly, but existing methods poorly support this goal. We introduce a new alignment-free tool, Skmer, to estimate genomic distances between the query and each reference genome-skim using the k-mer decomposition of reads. We test Skmer on a large set of insect and bird genomes, sub-sampled to create genome-skims. Skmer shows great accuracy in estimating genomic distances, identifying the closest match in a reference dataset, and inferring the phylogeny. The software is publicly available on https://github.com/shahab-sarmashghi/Skmer.git