Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective

Nouman Rasool; Waqar Hussain; Sajid Mahmood

doi:10.4172/jpb.1000458

Abstract

Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective

Nouman Rasool, Waqar Hussain and Sajid Mahmood

It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naïve Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.

Journal of Proteomics & BioinformaticsOpen Access

Abstract

Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective

Journal of Proteomics & Bioinformatics
Open Access