alexa Sequence Features and Subset Selection Technique for the Prediction of Protein Trafficking Phenomenon in Eukaryotic Non Membrane Proteins | OMICS International
ISSN: 2090-4924
International Journal of Biomedical Data Mining
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Sequence Features and Subset Selection Technique for the Prediction of Protein Trafficking Phenomenon in Eukaryotic Non Membrane Proteins

Geetha Govindan* and Achuthsankar S Nair

Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, India

Corresponding Author:
Geetha Govindan
Department of Computational Biology and Bioinformatics
University of Kerala, Thiruvananthapuram, India
Tel: +91-471-2308759
E-mail: [email protected]

Received date: February 14, 2014; Accepted date: November 22, 2014; Published date: February 15, 2015

Citation: Govindan G, Nair AS (2015) Sequence Features and Subset Selection Technique for the Prediction of Protein Trafficking Phenomenon in Eukaryotic Non Membrane Proteins. Int J Biomed Data Min 3:109. doi: 10.4172/2090-4924.1000109

Copyright: © 2015 Govindan G, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at International Journal of Biomedical Data Mining

Abstract

Protein trafficking or protein sorting is the mechanism by which a cell transports proteins to the appropriate position in the cell or outside of it. This targeting is based on the information contained in the protein. Many methods predict the subcellular location of proteins in eukaryotes from the sequence information. However, most of these methods use a flat structure to perform prediction. In this work, we introduce ensemble methods to predict locations in the eukaryotic protein-sorting non membrane pathway hierarchically. We used features that were extracted exclusively from full length protein sequences with feature subset selection for classification. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and classifier performances were compared with and without feature subset selection technique. This study shows the new features extracted from full length eukaryotic protein sequences are effective at capturing biological features among compartments in eukaryotic non membrane pathways at two levels. Feature subset selection techniques helped to reduce the time taken for building the classification model.

Keywords

Sequence driven features; Sequence mapped features; Autocorrelation; Ensemble classifier; Pathways; Protein sorting

Introduction

Eukaryotic cells are organized into several membrane bound compartments. In order to perform the function; newly formed proteins get sorted and are delivered to various compartments in the non-membrane and trans-membrane pathways [1]. This protein sorting process in the pathway is very complex and still not clearly understood. But the most important principle of protein trafficking is that each protein has the information on its final localization site as a part of its amino acid sequence [2]. In 1983, Nishikawa, Kubota and Ooi had conducted investigations into predicting subcellular locations based on amino acid compositions. They had reported that the amino acid compositions have the discriminating ability to classify subcellular locations.

Prediction of protein localization sites in the pathways from the amino acid sequence has implications both for the function of the protein and its possibility of interacting with other proteins in the same compartment [3-5]. Protein sorting pathway in eukaryotes can be represented hierarchically like a tree structure [1,6,7]. Pathway at root level differentiates non membrane and trans-membrane proteins. Non membrane protein pathway can be further divided into secretory and non-secretory types. In a secretory pathway, proteins are delivered to the endoplasmic reticulum (ER), and then transported to other related locations.ER signal sequences, located in the N-terminal sequence, control this protein transport. In the non-secretory pathway, proteins with organelle-specific signal sequences are imported into the nucleus or mitochondria, according to their signal sequence type. The remaining proteins are located in the cytosol which lacks sorting signals [8,9] and some are localized by binding with another protein.

A wide variety of methods have been tried throughout the years in order to predict the subcellular localisation of proteins from their amino acid sequences (Olof). These methods differ in terms of sequence features as input data, techniques employed, time and cost to make the prediction about location. The success of computational prediction relies on the extraction of relevant biological features from the sequence and the computational techniques used [10-14]. Studies by Nakashima and Nishikawa [15], have shown that secretory and intracellular proteins differ significantly in their amino acid compositions and in residue pair frequencies. Hence in our study, priority was given to the features that can be extracted from the full length protein sequence based on various coding schemes without referencing external databases or external server generated outputs.

For computation, we used ensemble learning [16-20] hierarchically, (Figure 1) by mimicking the protein trafficking phenomenon; which is incorporated from the location descriptions provided by the Gene Ontology consortium (GO) [21] with the sequence features as input. First, this approach was used to classify the subcellular location of proteins. Second, this study was extended to determine whether the use of feature subset selection improves the prediction performance at various levels of hierarchy.

biomedical-data-mining-Hierarchical-structures-compartments

Figure 1: Hierarchical structures of compartments in protein trafficking based on cellular sorting used for this study. (Adopted from references [16-21]).

Materials and Methods

Data set

We used the recently published eukaryotic data set of LocTree2 [17] having 1682 proteins for testing and comparing. This is a manually curated database with experimental annotations for the subcellular localizations of proteins. In this dataset sequence bias was reduced through UniqueProt [22]. This bias reduction ascertained that no pair of proteins in the set had BLAST2 [23] HSSP-value (HVAL)> 0 [24,25]. We formed our data set ASN_G_1677 from this by verifying with UniProt release 2013_05 [26] for the protein sequence and for the explicit annotation of subcellular localization. Annotations based on non-experimental findings (‘potential’, ’probable’, or by ‘similarity’) and with multiple localization were excluded. The final data set had 1677 eukaryotic sequences with no over representation of a particular sub cellular protein. The list with pathway and subcellular location within the non-membrane pathway is mentioned in Table 1.

Pathway and subcellular location No. of proteins
Trans-Membrane 245
Non-Membrane 1432
Chloroplast 133
Cytosol 212
Endoplasmic reticulum 10
Extra-cellular space 595
Golgi apparatus 2
Mitochondria 136
Nucleus 321
Peroxisome 6
Plastid 14
Vacuole 3
Total 1677

Table 1: Number of proteins in the data set ASN_G_1677.

Sequence feature formation

The sequence feature extraction performed in this study can be classified into three groups. The first group is a method of converting the protein sequence into a numeric sequence by replacing each amino acid with its equivalent numeric values, counts etc. The second group is based on mapping amino acids into sub groups and the third group is based on features obtained from calculations based on autocorrelation.

1. Features directly from sequence. (Sequence driven feature)

2. Features by mapping the sequence. (Sequence mapped feature)

3. Features from sequence autocorrelation. (Sequence autocorrelation feature)

Sequence driven feature-amino acid dipeptide composition: (Dipeptide): A dipeptide is a molecule consisting of two amino acids joined by a single peptide bond and gives a feature vector with a dimension of 400 from the 20 amino acid combinations. The advantage of dipeptide sequence composition over amino acid composition is that it encapsulates global information about the fraction of amino acids as well as sequence order [27].

Consider a protein sequence AAAPYQAACAQ.

The dipeptide count with 0 skips, d0, is calculated by counting all pairs of amino acid conditions with no skips. In Figure 2, the count of d0AA is shown as 3, and one skip d1AA is counted as 2. The dipeptide count, ‘dNxx’, counts pairs with N skips between them.

biomedical-data-mining-Amino-acid-di-peptide-count

Figure 2: Amino acid di-peptide count with skips.

The feature vector using the occurrence frequency count of a dipeptide to represent a protein sequence is formulated as follows:-

Given a protein sequence P with m amino acid residues, P= [R1 R2 R3 R4R5 R6 R7 ...... Rm], where R1, R2 ….. Rm is the residues, we can map the sequence to a fixed length feature vector for each skip as P={f1 f2f3……..f400}, where f1, f2 are the 400 native dipeptide occurrences (AA, AC, AD…… CA, CC, CD …. YV, YW, YY) counts in P.

The feature vector of the sequence' AAAPYQAACAQ'for dipeptide d0, d1, d2 is as follows:-

Feature vector for occurrence frequency d0

= [3 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 ……0] (1)

Feature vector for occurrence frequency d1

= [2 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ……0] (2)

Feature vector for occurrence frequency d2

=[1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 ……0] (3)

Each protein sequence is represented as three separate numeric counts of its dipeptide d0, d1 and d2, each having 400 components. The feature vector having 1200 attributes is obtained by concatenating the corresponding vectors of d0, d1 and d2.

Sequence mapped feature (composition, transition, distribution (CTD)): The different properties of the amino acids result from the structural variations of the R groups. There are four different classes of amino acids determined by the side chains: (1) non-polar and neutral, (2) polar and neutral, (3) acidic and polar, (4) basic and polar. The twenty amino acids forming the protein sequence can also be divided into several groups based on their properties. Important properties are (5) charge, (6) hydrophilicity or hydrophobicity, (7) size, and (8) functional groups.

Twenty amino acids can be mapped into 1–3 groups by replacing each amino acid code with its group code. From the mapped sequence, features called Composition, Transition and Distribution (CTD) can be calculated.

Composition is the number of amino acids of a particular property divided by the total number of amino acids. Transition characterizes the percentage frequency with which amino acids of a particular property are followed by amino acids of a different property. Distribution measures the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property are located.

Through this method, amino acids are grouped into three classes according to their property types, as shown in Table 2, and are encoded by the numeric indices 1, 2, 3. The attributes of charge, hydrophobicity, normalized van der Waals volume; polarity, Polaris ability, secondary structure and solvent accessibility are used as properties [28-33].

Sl Property Group 1 Group 2 Group 3
1 Charge Neutral Negatively charged Positively charged
  Amino acids A,C,F,G,H,I,L,M,N,P,Q,S,T,V,W,Y D, E K, R
2 Hydrophobicity Hydrophobicity Neutral Polar
  Amino acids C,F,I,L,M,V,W A,G,H,P,S,T,Y D, E, K, N, Q, R
3 Normalised Vander Waals volume 0-2.78 2.95-4.0 4.03-8.08
  Amino acids A,C,D,G,P,S,T E, I, L, N, Q, V F,H,K,M,R,W,Y
4 Polarity 4.9-6.2 8.0-9.2 10.4-13.0
  Amino acids C,F,I,L,M,V,W,Y A, G, P, S, T D,E,H,K,N,Q,R
5 Polarisability 0 - .108 0.128-0.186 0.219-0.409
  Amino acids A, D, G, S, T C,E,I,L,N,P,Q,V F,H,K,M,R,W,Y
6 Secondary Structure Coil Helix Strand
  Amino acids D,G,N,P,S A, E, H, K, L, M, Q, R C,F,I,T,V,W,Y
7 Solvent Accessibility Buried Intermediate Exposed
  Amino acids A, C, F, G, I, L, V, W H,M,P,S,T,Y D, E, K, N, R, Q

Table 2: Amino acid attributes and division of the amino acids into groups.

Consider a sample sequence ‘RKEDQNGASTPHYCLVIMFW’. According to hydrophobicity grouping, this sequence is encoded as “11111122222223333333”.

Composition is the global percentage for each encoded class in the sequence. In this example the total count of 1, 2, 3 is 6, 7, 7 and hence composition is calculated as 6/20, 7/20 and 7/20.

Equation                (4)

where e=1, 2, 3.Ne is the number of e in the encoded sequence and N is the total length of the sequence.

The transition from class 1 to 2 is the percentage frequency with which 1 is followed by 2 or 2 is followed by 1 in the encoded sequence.

The transition descriptor is calculated as

Equation                (5)

Where mn=“12”, “13”,”23” and Nnm, and Nnm are the numbers of dipeptide encoded as “mn” and “nm” respectively in the sequence.N is the length of the sequence. For the given sample sequence, Transition=2/19.

The distribution descriptor describes the distribution of each property in the sequence. There are five distribution descriptors for each property and they are the position percentages in the sequence for the first residue, 25% of the residues, 50% of the residues, 75% of the residues and 100% of the residues.

The CTD calculation is performed for 7 properties for each protein sequence after dividing each sequence into three equal segments. In total, 21 x 3 attributes for a sequence and 441 attributes for 7 properties comprise the final feature vector.

Sequence autocorrelation features (Autocorrelation Descriptors (ACD)): Sequence autocorrelation-based features are based on the Tobler’s First law of geography: “everything is related to everything else but nearby things are more related than distant things” [34] Sequence autocorrelation-based features also assume that “the disturbances in each area are systematically related to those in adjacent areas” [35]. Spatial autocorrelation is positive when nearby things are similar and negative when they are dissimilar. It measures the degree to which near and distant things are related. This concept helps to analyze the dependency among the features of sequences in each location.

Autocorrelation features are calculated based on the distribution of amino acid properties along the sequence. Amino acid indices related to hydrophobicity are used for calculation after replacing each amino acid with its equivalent normalized index as Pi. Three autocorrelation descriptors are used as features. They are normalized Moreau-Broto autocorrelation descriptors [36,37] Moran auto-correlation descriptors [38] and Geary autocorrelation descriptors [39].

The Moreau-Broto autocorrelation descriptor is defined as

Equation                (6)

d is the lag of the autocorrelation, N is the length of the sequence, and Pi and Pi+d are the amino acid index value of the selected property at position I and i+d, respectively. Max. Lag is the maximum value of the lag. The normalized Moreau-Broto autocorrelation descriptors are defined as

Equation                (7)

The Moran autocorrelation descriptor is defined as

Equation                (8)

Equation                (9)

Where Pi, Pi+d have the same meaning as above.

The Geary autocorrelation descriptor is defined as

Equation                (10)

Where Equation, Pi, Pi+d have the same meaning as above. 3510 features from 39 amino acid properties with 30 lag form the sequence feature vector for autocorrelation.

The combined feature vector from three groups had 5151 elements.

Feature subset selection

Feature subset selection is used as a pre-processing step in machine learning methods. The performance of a classifier depends on the number of features, sample size and algorithm complexity. Feature selection is effective in removing irrelevant and redundant features, increasing effciency in learning tasks, improving learning performance like predictive accuracy, and enhancing comprehensibility of learned results [40-42]. In this study, a fast filter method called FCBF (Fast Correlation Based Filter) which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis was adopted [43,44]. The FCBF filter algorithm is designed for high-dimensional data and has been shown effective in removing both irrelevant features and redundant features. This algorithm has two stages: the first stage is based on relevance analysis, aimed at ordering the input variables depending on a relevance score and the second stage is a redundancy analysis, aimed at selecting predominant features from the relevant set obtained in the first stage. With FCBF, we were able to reduce the numbers of features to the range of 100 from 5151.

Computational techniques used (Ensemble Learning)

Ensemble learning is an effective method that has been adopted to combine multiple machine learning algorithms to improve overall prediction accuracy by aggregating the predictions of all algorithms [45]. Multiple learners (base learners) are trained to solve the same problem. It is a set of classifiers whose individual decisions are combined in some way (typically by weighted or unweighted voting) to classify new examples. This method averages over multiple classification models; and each model have different input feature vectors. Weak individual models are transformed into strong ensemble models.

The aim of using the ensemble method is to achieve more accurate classification (on training data) as well as better generalization (on unseen data). These ensemble techniques reduce the small sample size problem which is critical in biological applications and multiple prediction models can be tested with different feature sets. The two most popular classifiers based on the ensemble method, are Bagging [46] and Ada boostM1 [47]. In this study, these two methods were used to predict protein trafficking in the pathway.

Bagging generates new training sets by sampling with the substitution of the training data while boosting adopts an adaptive sampling by using all instances of iteration. In both methods, multiple classifiers are combined using a simple voting system to create a Meta classifier. In Bagging, each classifier has the vote of the same strength, whereas boosting assigns different voting strengths to classifiers based on their accuracy.

Performance evaluation

Basic ensemble based classifiers; Adaboost M1 and Bagging were trained to classify the location compartment of proteins in the pathway using WEKA [48]. Two tests were carried out with ASN_G_1677 dataset for performance evaluation at all levels in the hierarchy as shown in Figure 1. 5 fold cross-validation test (randomly partitioning the dataset into equally sized training and test sets; training on 4 sets and testing with5thset and averaging the results) and (2) independent data test (training on one set and testing with another test set by dividing the dataset into two equal sized random groups). The classifier performance evaluation parameters Specificity, Sensitivity, Accuracy, Mathew correlation coefficient [49], Positive predictive value [50], Negative predictive value [50] and Receiver operating characteristic [51] were calculated at all levels as per the below equations.

Equation

Equation

Equation

Equation

Equation

Equation

Results and Discussion

Here the final 1677 protein sequences were represented in two groups; by combining the three different sequence features with and without feature subset selection. As is well known, 5 fold crossvalidation test and independent data test were performed on these two feature groups to evaluate the quality of the classifier. Tables 3 and 4 shows the performance evaluation parameter summary of classifiers against these two feature groups. Parameters Sp, Sn, Mcc, Ppv, Npv,Acc, ROC and time taken to build the model were obtained from the two tests at various levels of the pathway for the two feature groups using two classifiers. Mcc which is regarded as a balanced measure even for data groups of different sizes; reported 0.5 at level 0 (between non membrane and trans-membrane pathway) and level 1 (between secretory and non-secretory pathway) for both tests. Both tests with feature subset selection; enhanced the average value of Mcc to 0.6. In level 2, between the pathway ER, others and in level 3 between extracellular and Golgi; though the positive predictive value is higher, Mcc value is less than zero. Hence there is disagreement between prediction and observation due to small and unbalanced data size at these levels.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD 50% 96% 0.521 92 67 89% 0.90 28.97 50% 99% 0.625 92 87 92% 0.93 65.98
Di-peptide+ CTD +ACD with FCBF feature subset selection (75 features) 59% 96% 0.601 93 72 91% 0.92 0.38 55% 99% 0.653 93 88 92% 0.93 0.84

Table 3: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between non-membrane and trans membrane pathway at Level 0.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 86% 69% 0.559 78 79 79% 0.85 26.11 87% 77% 0.645 81 84 83% 0.91 78.5
Di-peptide+ CTD +ACD with FCBF feature subset selection
(77 features)
86% 73% 0.591 79 81 80% 0.87 0.36 87% 73% 0.613 81 81 81% 0.88 1.23

Table 3a: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between secretory and non-secretory pathway at Level 1.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 0% 99% 0.01 98 0 98% 0.58 10.3 0% 100% <0 98 <0 98% 0.76 18.86
Di-peptide+ CTD +ACD with FCBF feature subset selection (32 features) 10% 99% 0.15 99 25 98% 0.93 0.02 0% 100% <0 98 <0 98% 0.79 0.05

Table 3b: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between ER and others at Level 2.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 82% 63% 0.454 66 80 75% 0.8 14.58 87% 56% 0.459 71 78 76% 0.82 40.34
Di-peptide+ CTD +ACD with FCBF feature subset selection (33 features) 87% 57% 0.465 71 79 76% 0.81 0.08 89% 63% 0.539 75 81 80% 0.86 0.25

Table 3c: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between Nucleus, Cytosol and others at Level 2.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 0% 100% <0 100 0 99% 0.35 10.2 0% 100% <0 100 <0 100% 0.3 7.59
Di-peptide+ CTD +ACD with FCBF feature subset selection (11 features) 50% 100% 0.498 100 50 100% 0.5 0.02 0% 100% <0 100 <0 100% 0.3 0.02

Table 3d: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between extra-cellular and golgi at Level 3.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 74% 47% 0.208 54 68 63% 0.68 8.72 77% 50% 0.279 59 70 66% 0.72 23.42
Di-peptide+ CTD +ACD with FCBF feature subsetselection (26 features) 77% 48% 0.264 58 69 66% 0.7 0.05 76% 59% 0.349 62 74 69% 0.75 0.34

Table 3e: Performance evaluation summary of classifiers against features for the 5 fold cross-validation test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between Nucleus and Cytosol at Level 3.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 52% 97% 0.583 93 76 91% 0.91 29.53 52% 99% 0.638 93 87 92% 0.94 65.28
Di-peptide+ CTD +ACD with FCBF feature subset selection (75 features) 64% 93% 0.544 94 58 89% 0.9 0.36 56% 97% 0.587 93 72 91% 0.93 0.88

Table 4: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between non membrane and trans membrane pathway at Level 0.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 82% 71% 0.528 74 79 77% 0.84 25.17 82% 74% 0.558 75 81 78% 0.87 78.3
Di-peptide+ CTD +ACD with FCBF feature subset selection (77 features) 80% 71% 0.514 72 79 76% 0.85 0.36 86% 73% 0.599 79 82 81% 0.87 1.22

Table 4a: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between secretory and non-secretory pathway at Level 1.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 0% 100% <0 97 <0 97% 0.48 10.41 0% 100% <0 97 <0 97% 0.5 10.63
Di-peptide+ CTD +ACD with FCBF feature subsetselection (32 features) 13% 100% 0.241 98 50 97% 0.94 0.03 0% 100% <0 97 <0 97% 0.5 0.03

Table 4b: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between ER and others at Level 2.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 83% 48% 0.326 61 74 70% 0.77 13.78 85% 49% 0.369 65 75 72% 0.8 40.28
Di-peptide+ CTD +ACD with FCBF feature subset selection (33 features) 91% 50% 0.457 76 76 76% 0.83 0.08 88% 53% 0.442 71 77 75% 0.83 0.25

Table 4c: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between Nucleus, Cytosol and others at Level 2.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 0% 100% <0 99 <0 99% 0.5 9.89 0% 100% <0 99 <0 99% 0.5 7.22
Di-peptide+ CTD +ACD with FCBF feature subset selection (11 features) 0% 100% <0 99 <0 99% 0.5 0.02 0% 100% <0 99 <0 99% 0.5 0

Table 4d: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between extra-cellular and golgi at Level 3.

Classifier Adaboost Bagging
Feature Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model Sp Sn Mcc Ppv Npv Acc ROC Time in Sec. – to build the model
Di-peptide+ CTD +ACD (5151 features) 80% 38% 0.203 57 65 63% 0.67 8.7 82% 42% 0.255 61 67 65% 0.68 24.17
Di-peptide+ CTD +ACD with FCBF feature subset selection (26 features) 74% 44% 0.183 53 66 62% 0.68 0.06 75% 49% 0.245 57 68 64% 0.67 0.14

Table 4e: Performance evaluation summary of classifiers against features for the independent test at all levels of hierarchy for dataset ASN_G_1677. Performance evaluation of classification of proteins between Nucleus and Cytosol at Level 3.

ROC analysis provides a systematic tool for quantifying the impact of variability among individuals' decision thresholds. The term receiver operating characteristic (ROC) originates from the use of radar during World War II. Just as American soldiers deciphered a blip on the radar screen as a German bomber, a friendly plane, or just noise, radiologists face the task of identifying abnormal tissue against a complicated background. As radar technology advanced during the war, the need for a standard system to evaluate detection accuracy became apparent. ROC analysis was developed as a standard methodology to quantify a signal receiver's ability to correctly distinguish objects of interest from the background noise in the system.

Comparison

In this study, a hierarchical system for the prediction of protein subcellular localization was tested. In order to roundly assess our method, we carried a comparison with the published report of LOCtree [16] and LocTree2 [17]. LOCtree used the amino acid composition, composition of the 50 N terminal residues, pseudo amino acid composition from three secondary structure states and Signal server [52] outputs as a feature vector on support vector machine. LocTree2 used the profiles created by BLAST-ing [23].

Results reported by LocTree2 [17] is directly comparable to ours in terms of selection of a dataset with no feature subset selection. The overall accuracy mentioned in LocTree2 [17] is the positive predictive value (Ppv) based on the fivefoldcross-validation experiments and comparison with our Ppv values at all levels is shown in Table 5.

LocTree25 fold cross validation with SVM Our method 5 fold cross validation
Levels Ppv Sequence Feature Adaboost-Ppv Bagging-Ppv
Level 0 – Non membrane and Trans-membrane pathway 90% Di-peptide+ CTD +ACD (5151 features) 92% 92%
Di-peptide+ CTD +ACD with FCBF feature subset selection (75 features) 93% 93%
Level 1 - Secretory and Non secretory pathway 83% Di-peptide+ CTD +ACD (5151 features) 78% 81%
Di-peptide+ CTD +ACD with FCBF feature subset selection (77 features) 79% 81%
Level 2-ER and Others 75% Di-peptide+ CTD +ACD (5151 features) 98% 98%
Di-peptide+ CTD +ACD with FCBF feature subset selection (32 features) 99% 98%
Level 2 - Nucleus, Cytosol and others 75% Di-peptide+ CTD +ACD (5151 features) 66% 71%
Di-peptide+ CTD +ACD with FCBF feature subset selection (33 features) 71% 75%
Level 3 - Extra-cellular and golgi 80% Di-peptide+ CTD +ACD (5151 features) 100% 100%
Di-peptide+ CTD +ACD with FCBF feature subset selection (11 features) 100% 100%
Level 3 - Nucleus and Cytosol 67% Di-peptide+ CTD +ACD (5151 features) 54% 59%
Di-peptide+ CTD +ACD with FCBF feature subset selection (26 features) 58% 62%

Table 5: Comparison of the 5 fold cross-validation results with the published results of LocTree2.

Tables 3 and 4 show that at level 0 between the non-membrane and trans-membrane pathway; 5 fold cross-validation and independent data test based on Adaboost M, Bagging reported accuracies above 89% with Mcc above 0.5, with and without feature subset selection. In 5 fold cross validation test; Bagging with feature subset selection reported accuracy similar to LocTree2 in level 1 between the secretory and nonsecretory path way and in level 2 between Nucleus, Cytosol – others pathway with Mcc value greater than 0.44. At Level 3 between nucleus and cytosol pathway; Bagging reported positive predictive value of 62% with Mcc of 0.34 while LocTree2 reported 67%.

Conclusion

Protein transport to compartments is a topic which is now also poorly understood. Major protein localisation prediction methods have been implemented using standard machine learning algorithms with parallel architecture for classification [53-56]. Here a novel system of ensemble learners, using hierarchical architecture with features extracted directly from full length protein sequences, with and without feature subset selection was tested. Test results, at the non-membrane pathway of hierarchy show that the prediction accuracy can be significantly improved by using the classifier Bagging and FCBF feature subset selection with significant reduction in time for model building. Accuracy above 90% using bagging on independent data tests indicates that the native protein localization is imprinted onto the protein sequence for each compartment. Sequence features experimented share a common composition and explicitly utilizes intrinsic correlation between proteins that share these common features. Additionally, this hierarchical structure has provided insights into the sorting process, such as the accurate distinction between the intracellular and secretory pathway. Our study supports the hypothesis reported by Nakashima and Nishawa [15].

In the future, it should be possible to extend the classification to any level in the hierarchy using these sequence features and with the location descriptions provided by the gene ontology consortium (GO) [21]. This method can predict the final location of the protein as well as the mechanism of localization. Our findings may contribute to the development of clinical strategies related to drug design. We observed that, as one descends the hierarchical path, the prediction accuracy progressively decreases as the classification task complexity increases. The best scoring decisions are at the top, and the worst are at the bottom. Major problem with this type of hierarchical model is its inability to correct a prediction mistake made at the top node.

References

Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Article Usage

  • Total views: 12270
  • [From(publication date):
    December-2014 - Nov 17, 2018]
  • Breakdown by view type
  • HTML page views : 8437
  • PDF downloads : 3833
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2018-19
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri and Aquaculture Journals

Dr. Krish

[email protected]

+1-702-714-7001Extn: 9040

Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals

Ronald

[email protected]

1-702-714-7001Extn: 9042

Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001Extn: 9040

Clinical Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

Food & Nutrition Journals

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

General Science

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics & Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Materials Science Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Nursing & Health Care Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

Ann Jose

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001Extn: 9042

 
© 2008- 2018 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version
Leave Your Message 24x7