A Random Forest Classifier for Prokaryotes Gene Prediction
Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.