CART Assignment of Folding Mechanisms to Homodimers with Known Structures

Copyright: © 2010 Suresh A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abbreviations: 2S: 2 State; 3S: 3 State; 3SDI: 3 State Dimer Intermediate; 3SMI: 3 State Monomer Intermediate; B/2: Interface Area; CART: Classification And Regression Tree; I/T: Interface to Total residue ratio; ML: monomer length; PPV: positive predictive value


Background
Homodimers play an important role in catalysis and cellular regulation. Moreover, a couple of homodimers have been described as cancer-targets (U.S. Patent office), (Tanaka et al., 2007;Schulke et al., 2003). Thus, the importance of homodimer structures is recognized. The formation of homodimers in cellular biology is interesting and the mechanism (2-state (2S), 3-state (3S)) of folding is more fascinating . Two-state (2S) homodimers (Zhanhua et Liang et al., 2003). The folding data is obtained by denaturation techniques involving thermal and chemical agents. The denatured fraction is studied by CD, NMR and absorption. But these experiments are very time consuming and tedious. Thus, folding data is known only for a few homodimers, although large number of structures is known and available at the protein databank (PDB). Denaturation experiments (using temperature and chemical agents), although tedious to perform, have played a vital role in understanding the structural architecture and folding pattern of homodimers.
A review of unfolding data of homodimers by Neet and Timm (1994) showed that some homodimers denature by a two-state equilibrium transition (2S) while others have stable intermediates in the process (3S). In addition, the conformational stability of these homodimers was found to be related to the size of the polypeptide and the nature of subunit interface. Li et al. (2005) identified structural parameters (based on 41 homodimer structures) for the classification of homodimers into 2S and 3S. The cluster of small-sized proteins with large interface area and high interface hydrophobicity were found to be 2S homodimers, while the cluster of large proteins with small interface area and low interface hydrophobicity were found Abstract Protein homodimers play a critical role in catalysis and regulation and their mechanism of folding is intriguing. with known 3D structures. Determination of folding mechanisms through classical denaturation experiments is both time consuming, tedious, and expensive. Therefore, it is of interest to predict their folding mechanism. Furthermore, a large number of homodimers structures with unknown folding mechanism are available in the PDB. Hence, it is compelling to predict their folding mechanism using structural features intrinsic of each complex structure. Thus, we developed a classifi cation and regression tree (CART) model using predictive parameters ((a) monomer protein size (ML); (b) interface area (B/2); (c) interface to total residues (I/T) ratio) derived from a dataset (46 homodimers with both known structures and folding mechanism) for folding mechanisms prediction. The dataset was subjectively divided into training (13 2S; 6 3SMI; 3 3SDI) and testing (14 2S; 6 3SMI; 4 3SDI) sets for validation. The model performed fairly well for predicting 2S and 3SMI in both during training and testing using ML and I/T as predictive variables. However, it should be noted that the performance of model in classifying 3SDI is poor. Nonetheless, the model was not stable with the inclusion of the predictive variable B/2 and hence, was not considered during training and testing. The CART model produced accuracies of 85% (2S), 83% (3SMI) and 100% (3SDI) with positive predictive values (PPV) of 100% (2S), 83% (3SMI) and 75% (3SDI) during training. It then produced accuracies of 100% (2S) and 50% (3SMI) with positive predictive values (PPV) of 74% (2S), 60% (3SMI) during testing. Thus, we then used the model to assign folding mechanisms to protein homodimers with known structures and unknown folding mechanisms. This exercise provides a framework for predicted homodimer structures with unknown folding mechanism for further verifi cation through folding experiments. The CART model was able to assign folding mechanisms to all (169) the homodimer structures (with unknown folding data) due its automatically robust learning capabilities unlike the manually developed decision model which left some structures unassigned.
to belong to the 3S category. This study shows the importance of structural features such as monomer protein size (ML) and interface area (B/2) for distinguishing 2S from 3S. Tsai et al. (1997) studied 187 stable and 57 symmetrically related oligomeric interfaces. The architecture of 2S interfaces was found to be similar to protein cores, where the monomer chains fold cooperatively. On the other hand, 3S interfaces were found to resemble binding of already folded proteins and mirrored the monomer architecture only in general outline. Levy et al. (2004) suggested that the native protein 3D structure is the major factor governing the choice of homodimer folding and binding mechanism in 11 homodimers with known unfolding data. The study showed that as far as protein folding is concerned, the protein topology is the main factor governing the protein binding mechanism and the degree of topological frustration of a monomer determines whether the binding will occur between two unfolded or folded chains. Mei et al. (2005), defined Interface amino acid residue (IAR -distance between the first and last amino acid that take part in the inter-subunit interaction) and squared loop length (SLL -sum of the squared distances between two successive residues of the monomer) in 32 homodimer structures to find a possible correlation between protein size, sequence and quaternary structure. They propose that medium-sized proteins of classes A (2S) and C (3SDI) models are highly stable due to their large IAR and SLL . 2S proteins were observed to be small sized, 3SMI were medium sized, while 3SDI proteins often existed as large-sized proteins. I/T measure was also used to group 2S, 3SMI and 3SDI homodimers into categories with large I/T (>50%), moderate I/T (50-25%) and small I/T (<25%) interfaces. The study provided a 2-dimensional insight into the interaction of the interface residues, while considering 2S, 3SMI and 3SDI homodimers. Suresh et al. (2009) described a decision tree model to classify 47 homodimers whose folding data was already known (by denaturation experiments) (Wójciak et al., 2003). The model worked based on the structural parameters protein-size (ML), number of interface to total residue ratio (I/T) and interface Area (B/2) and yielded positive predictive values of 71.4%, 58.4% and 57.1% in classifying 2S, 3SMI and 3SDI respectively. The model was further drawn to predict the folding information to a set of homodimers structures whose folding information was not known. The manually set up decision model was able to establish relationship between structural features and folding mechanism despite its inability to assign some structures with folding mechanism. Thus, assignment of folding mechanism for homodimer structures is of both interest and need using simple yet robust classification models. Here, we describe the development, performance and application of a Classification and Regression Tree (CART) model based on structure derived predictive variables such as ML, I/T and B/2 for the prediction of homodimer folding mechanisms given structures.

Methodology
Homodimer structure dataset with known folding mechanism: A dataset of 46 homodimers with known folding data (Table 1) was used in this analysis to develop the CART model. The dataset consists of 2S (27), 3SMI (12), and 3SDI (7) homodimers. The predictive parameters such (ML), interface area (B/2), and ratio of interface to total residues (I/T) were calculated for each entry in the dataset ( Table 2).

Interface area (B/2):
Interface area was calculated using change in solvent accessible surface area (ASA) from a monomer state to a dimer state. ASA was calculated using algorithm described and implemented in the software Surface Racer 5.0 (Tsodikov et Table 2: Characteristics features of the homodimer dataset with known folding mechanism and structural data. These parameters were used in the development of a CART model for the prediction of folding mechanism for homodimers with known structural data and unknown folding data. Interface to total residue (I/T) ratio: It is the ratio between interface residues (number of residues per monomer involved in homodimer interactions) to the total number of residues in the monomer protein. 3SDI proteins lie in the range of 5 to 50% and 3SMI in the range of 9 to 44%; while the 2S proteins lie in the range of 6 to 80%.

Predictor variables:
The predictor variables used in model development are monomer protein size (ML), interface-area (B/2), and interface residues to total residues number ratio (I/T). The minimum, maximum, average and standard-deviation values are shown for each predictor ( Table 2).

Target variables:
The three target variables defined in the model are 2S, 3SMI and 3SDI.
CART model: Classification and Regression Tree (CART) is a robust decision-tree system used for data-mining and predictive modeling (Breiman et al., 1984;Steinberg and Colla, 1997). CART is also known as Binary recursive partitioning; binary because the data (parent node) is split into two child nodes based on certain splitting conditions; it is recursive because the process is repeated with the child node acting as the parent in the subsequent iteration. The CART looks for all possible splits in the group of predictor variables. The Geni impurity criterion (probability measure of misclassification in a group of known data) is used as the splitting rule by default. During the splitting process, intermediate nodes that can no longer be split from terminal nodes. When all the active nodes become terminal nodes, the tree growing process terminates. Each node of the decision tree (including the terminal node) presents some information about a target variable or a group of target variables.
CART-Classifi cation and Regression Tree; ML-Monomer-Length; B/2-Interface Area; I/T-ratio of number of Interface residues to total number residues in the monomer chain. The unknown dataset is a dataset of homodimer structures from PDB with unknown folding mechanism information. The CART version 6.0 (Salford Systems, http://salford-systems. com/cart.php) is used in this study. The Target (2S, 3SMI, 3SDI) and predictor variables (ML, I/T and B/2) are defined in the CART system. Figure 1 shows a flow-chart describing the methodology employed in the development of the CART model and its implementation. CART classification and model development consists of the training and the testing phase. In the training phase, CART reads the predictor variables from a training subset of 22 homodimers and learns to classify the dataset building an overly large tree ( Table 3, Table  4). CART then classifies the testing set using the overly large tree and calculates the error rate. The largest tree or sub-tree with the minimum error rate is chosen as the CART model. The results from the testing phase are given in (Table 5, Table 6).

CART prediction:
The CART model is then applied to predict the folding information to a dataset of 169 homodimers whose folding information was not known (Table 7).

Results
A dataset of 46 known homodimer structures with known folding data collected from literature is given in ( Table 1). The dataset of 46 homodimers is split indiscriminately into two subsets of 22 and 24 homodimers, each subset representing a uniform distribution of the protein folding types. The predictor variables (ML, B/2 and I/T) used in the model was calculated for each structure in training and testing test. (Table 2) shows the respective minimum, maximum, and average and standard deviation values of the predictors in the dataset The model uses predictor variables ML and I/T in the classification process during training and testing. Node 1 (root node) consists of the 22 homodimers. The data is split based on various conditions of the predictor variables at each level. At each intermediate node, a case went to the left child node if and only if a condition statement regarding the predictor variable was satisfied. The classification model consisted of 4 terminal nodes. Information regarding the distribution of folding cases (2S, 3SMI or 3SDI) is available at each node. Table 3 and Table 4 show the training results of the CART model. The CART classification model produced positive predictive values (PPV, ratio of true positives to the total number of true and false positives) 100%, 83.3% and 75% (with an accuracy of 84.6%, 83.4% and 100%) in classifying 2S, 3SMI and 3SDI homodimers respectively during the training phase. During the learning phase, CART trains on the training subset of 22 known homodimers, and develops a maximal tree. The model developed from training is further tested on the second subset of 24 homodimers (Table 5 and Table 6). The CART model displayed positive predictive values of 73.7%, and 60% (with an accuracy of 100%, and 50%) in testing the subset for 2S, and 3SMI, respectively. However, it was not able to perform well in predicting 3SDI during testing. The CART model is then applied to a dataset of 169 homodimers whose folding data is not known. A folding mechanism was assigned to all structures ( Table 7) unlike the manually set up decision tree model as described elsewhere (Wójciak et al., 2003). The CART model classified 78 homodimers as 2S, 45 as 3SMI and 46 homodimers under 3SDI respectively. Table 7 also gives a detailed comparison of the results obtained from CART and the decision-tree studied earlier.

Discussion
The mechanism of homodimer folding and binding was studied based on thermal denaturation experiments using fluorescence (Bowie et al., 1989;Milla and Sauer,1994 (Apiyo et al., 2001). Homodimers fold by three folding mechanisms (2S, 3SMI and 3SDI). Three dimensional structures for these homodimers with known folding mechanism are already available in the protein data bank (PDB). The consideration of homodimers as drug targets in cancer has been realized (Tanaka et al., 2007) (Shulke et al., 2007).Therefore, it is of importance to study homodimers binding and folding. The documentation of folding mechanisms for homodimers through denaturation experiments is tedious. Thus, folding mechanism is known only for a handful of    Table 4: Summary of results from a training experiment is given (see Table 3). 2S  3SMI  3SDI  2S  14  14  0  0   3SMI  6  3 3 0 3SDI 4 2 2 0 Table 5: CART classification of an independent testing set.  Table 6: Summary of results from a testing experiment is given (see Table 5).  We describe the performance of the CART model in assigning folding mechanisms to homodimer structures. The dataset of 46 homodimers (whose folding data is known) is indiscriminately split into two subsets of twenty-two (22) and twenty-four (24) homodimers. One subset acts as the training set (13-2S; 6-3SMI; 3-3SDI) while the other acts as a testing set (14-2S; 6-3SMI; 4-3SDI). The CART algorithm develops an overly large classification-tree based on structural parameters I/T ratio and Interface Area (B/2) from the training set (Table 3 and Table 4) shows the results of the training phase, wherein 2/13 of 2S homodimers are misclassified as 3SMI, 1/6 of 3SMI homodimers have been misclassified as 3SDI, while no misclassification is seen in 3SDI. The model thus created was found to yield positive predictive values 100%, 83% and 75% (with an accuracy of 85%, 83% and 100%) in learning to classify 2S, 3SMI and 3SDI homodimers respectively. Thus, an overly-large tree is grown by the end of the training phase. CART classifies the test sample to determine the misclassification rate of the largest tree or every sub-tree developed in the training phase. The tree or sub-tree with the lowest misclassification rate is chosen as the CART model. The testing-phase CART showed positive predictive values of 74%, and 60% (with an accuracy of 100%, and 50%) in classifying 2S, and 3SMI homodimers respectively (Tables 5 and Table 6). However, the CART model was not able to classify the 3SDI appropriately. The CART was found to perform better than the decision model in classifying 2S and 3SMI.

Dataset
The CART model was then applied to a larger dataset of 169 homodimers whose folding data was unknown. CART was able to assign folding information (target-variable) to all the unknown homodimers when the structural data from the dataset was passed through the classification model ( Table 7). The decision-model was unable to fit 18 structures unlike the CART model. This was because the parameter values of these unassigned homodimers did not fall into any of the condition values set by the decision-tree. CART categorized 78, 45 and 46 homodimers of the dataset were into 2S, 3SMI and 3SDI homodimers respectively, while the decision model grouped 42, 70 and 39 homodimers into 2S, 3SMI and 3SDI, respectively. These results not only emphasize the importance of the structural features in the prediction of homodimer folding, but also prove CART to be a robust and efficient method for classification and prediction of features in multi-parameter datasets, compared to its predecessor. Further, the predicted data serves as a framework for better understanding of the folding mechanism given their structures. It should be kept in mind that these predicted results should be further verified using denaturation experiments.

Conclusion
Homodimer proteins fold through 3 different mechanisms 2S, 3SMI and 3SDI and are grouped accordingly. The presence of homodimers with unknown folding data emphasizes the need for a more efficient fold classification and prediction method. It was of interest to use the CART system in development of a classification model and further predict the folding information to an unknown dataset of 169 homodimers. The model performed fairly well for predicting 2S and 3SMI in both during training and testing using ML and I/T as predictive variables. However, it should be noted that the performance of model in classifying 3SDI is not discriminative in nature. Nonetheless, the model was not stable with the inclusion of the predictive variable B/2 and hence, was not considered further for prediction during training and testing. The CART model produced accuracies of 85% (2S), 83% (3SMI) and 100% (3SDI) with positive predictive values (PPV) of 100% (2S), 83% (3SMI) and 75% (3SDI) during training. It then produced accuracies of 100% (2S) and 50% (3SMI) with positive predictive values (PPV) of 74% (2S), 60% (3SMI) during testing. Thus, we then used the model to assign folding mechanisms to protein homodimers with known structures and unknown folding mechanisms. This exercise provides a framework for predicted homodimer structures with unknown folding mechanism for further verification through folding experiments. The CART model was able to assign folding mechanisms to all (169) the homodimer structures (with unknown folding data) due to its automatically robust learning capabilities unlike the manually developed decision model which left some structures unassigned.

Author's Contribution
PK conceived the idea. AS designed the experiment and performed the analysis with summarized results. PL participated in the analysis and helped in manuscript preparation.