Publication

Title: Alignment-Free Viral Sequence Classification at Scale
Authors: van Zyl D, Dunaiski M, Tegally H, Baxter C, , de Oliveira T, Xavier J.
Journal: ,():. doi: 10.1101/2024.12.10.627186.: (2024)

Abstract

AbstractBackgroundThe rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.ResultsWe employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.ConclusionDespite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Download: Full text paper

Citation: van Zyl D, Dunaiski M, Tegally H, Baxter C, , de Oliveira T, Xavier J. Alignment-Free Viral Sequence Classification at Scale ,():. doi: 10.1101/2024.12.10.627186.: (2024).


KRISP has been created by the coordinated effort of the University of KwaZulu-Natal (UKZN), the Technology Innovation Agency (TIA) and the South African Medical Research Countil (SAMRC).


Location: K-RITH Tower Building
Nelson R Mandela School of Medicine, UKZN
719 Umbilo Road, Durban, South Africa.
Director: Prof. Tulio de Oliveira