Fei Li*, Lulu Cao, Huifeng Wu and Jianmin Zhao
Key Laboratory of Coastal Zone Environmental Processes, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS); Shandong Provincial Key Laboratory of Coastal Zone Environmental Processes, YICCAS, Yantai Shandong 264003, PR China
Received Date: May 14, 2014; Accepted Date: June 28, 2014; Published Date: July 05, 2014
Citation: Fei Li, Lulu Cao, Huifeng Wu, Jianmin Zhao (2014) Combined SVM-PLS Method for Predicting Estrogenic Activities of Organic Chemicals in the Coastal Water. J Coast Dev 17:388. doi:10.4172/1410-5217.1000388
Copyright: © 2014 Li F, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Coastal Zone Management
Endocrine disrupting chemicals (EDCs); Estrogen receptor (ER); Quantitative structure activity relationship (QSAR); PLS; SVM
Scientific and public concern heightens over the potential health effects of exposure to environmental pollutants with endocrine disruption potential . Diminished or excessive production of estrogens is a major problem related to prostate cancer, spinal bular musular atrophy, and female pattern baldness. Consequently, there is a need to develop screening and testing procedures for endocrine disrupting chemicals (EDCs).
Considering the high number of potential EDCs, this remains a labor intensive and time-costing operation. It is crucial to develop efficient and economical alternative modeling approaches for the purpose of predicting the estrogenic activities of potential EDCs. Quantitative structure activity relationship (QSAR) methods are the most promising and successful tools to provide rapid and useful meanings for predicting the biological activity and chemical toxicity. They are considered as an important part of the priority setting process by the endocrine disruptor screening and testing advisory committee (EDSTAC) . QSAR are widely applied for the understanding of the mechanism of chemicals’ binding for the estrogen receptors (ER) [3-6], for androgen receptor (AR) and for several other members of the nuclear receptor family . These include electrostatic models, comparative molecular field analysis (CoMFA) which considers the overall steric and electrostatic properties of the compound of interest, computer graphic and energy (electrostatic and van der Waals) based models for fit into DNA and common reactivity patterns (COREPA) which reflect the stereoelectronic features.
In this paper, a data set consisted of experimental values which were determined by Nishihara et al. , including 517 natural, synthetic, and environmental chemicals from a broad range of structural classes. The data set was used to construct global QSAR models for the whole data set and local models for specific well-known families. Some information descriptors were selected using Forward stepwise (FS) regression from the original 709 DRAGON calculated descriptors and were applied to construct an optimal model based on SVM. Another classical method, partial least square (PLS)  was utilized to establish QSAR model to compare the results with those obtained by SVM. In addition, some careful models for specific well-known families were examined in conjunction with knowledge of the recently reported ligand-ER crystal structures.
Experiment and data set
Because it is expected that the major key target of EDCs is the nuclear hormone receptor, which binds specifically to the steroid hormone and regulates its gene expression, the yeast two-hybrid assay has been developed. Unlike yeast-based assay (YES) , another reporter gene assay using a yeast two-hybrid cells, the method contains the coactivator, so that the system more closely resembles the mammalian hormonal system.A detailed description of the experimental methods is provided in Nishihara et al. .
The overall data set consisted of more than 500 organic chemicals, including natural substances, medicine, pesticides, and industrial chemicals. Table 1 shows a summary of the test compounds with the names of 55 positive compounds. Tested chemicals consisted of natural substances (metabolites, oxidation products, etc.), medicines, food additives, pesticides, and industrial chemicals (PCBs, PCDFs, PAHs, phenols, benzenes, phthalates and adipates, and others). The estrogenic activities to the ER, expressed as log unit of 10% relative effective concentration (logREC10), are listed in Table 1.
Descriptor generation and selection
Structures of chemicals were drawn with the Chem Draw computer program. These were then geometry-optimized with the PM3 Hamiltonian using the software package Chemoffice 6.0 program, and exported into a file format suitable for MOPAC analysis. The resulting geometry was transferred into the DRAGON software that was used to calculate molecular structural descriptors. Molecular descriptor meanings and their calculation procedure are summarized in the DRAGON software, and explained in detail, with related literature references, in the Handbook of Molecular Descriptors by Todeschini and Consonni .
Generally, more descriptors should be considered in QSAR study so as to better characterize molecular structures. However, if no significant relevant or irrelevant descriptors are included, the quality of prediction and robustness of the developed QSAR model may decrease, and its interpretation becomes more difficult. Hence, descriptors selection is necessary for QSAR study.
In this work, the FS regression was employed to select the optimal subset from an original set of 709 calculated descriptors, as also did and described in other studies . As a result, 13 descriptors were obtained, which are listed in Table 2 with their physical-chemical meanings.
PLS regression was adopted here to develop QSAR model, for this method can analyze data with strongly collinear, noisy and numerous predictor variables [9,13]. PLS regression was carried out using the Simca-S package (Umetrics AB, Sweden). The conditions for computation were as follows: cross validation rounds=7, maximum iteration=200, missing data tolerance=50% and significance level (p) limit=0.05. Within Simca, the number of significant PLS components (model dimensionality) is determined by cross-validation. Cross-validation simulates how well a model predicts new data, and gives a statistic Q2cum (cumulative Q2, Q2 means the fraction of the total variation of the dependent variables that can be predicted by a component) for the final PLS model . Q2cum is a good measure of the predictive power and robustness of the model. When Q2cum of a model is larger than 0.5, the model is believed to have a good predictive ability . Besides Q2cum, model adequacy mainly was measured as the number of PLS components (A), the correlation coefficient between observed and predicted values (R), and the significance level (p). In addition, a general standard error (SE) was adopted to compare the prediction precision of different models. SE was defined like that in multiple regression analysis, i.e,
where n stands for the number of observations in the training set.
In PLS-regression modeling, a predictor variable may be important for the modeling of Y. Such variables are identified by large PLSregression coefficients. However, a variable also may be important for modeling of X, which is identified by large loadings. A summary of the importance of an X-variable for both Y and X is given by a parameter, variable importance for the projection (VIP), which is a weighted sum of squares of the PLS-weights, with the weights calculated from the amount of Y-variance of each PLS component . Therefore, terms with large values of VIP are the most relevant for explaining the dependent variable.
All the predictor variables are not necessarily to be included in a PLS model. Inclusion of redundant variables may lead to PLS models with low statistical significance . Accordingly, the following PLS analysis process was followed to obtain an optimal model. First, a PLS model with all the predictor variables was calculated. Then each variable was eliminated and new PLS analysis was performed, leading to a series of new PLS models. The one with the largest Q2cum was selected. If there were several models with the same Q2cum, the model was selected that eliminated the variable for which the VIP was the lowest in the previous model. This procedure was repeated until two predictor variables were left. Finally, the model with the largest Q2cum was selected as the optimal PLS model.
SVM is a new and very promising classification and regression method developed by Vapnik et al. . A detailed description of the theory of SVM can be referred in several excellent books and tutorials [17,18]. SVMs are originally developed for classification problems; they can also be extended to solve nonlinear regression problems by the introduction of Vapnik’s ε-insensitive loss function. The SVM method has a number of interesting properties, including an effective avoidance of overfitting, which improves its ability to build models using large numbers of molecular proterty descriptors with relatively few experimental results in the training set. The application of SVM in regression can be expressed in the following way [14,19,20]:
Suppose the training data,
where xm is the independent variables assembly of No. m sample (n-dimensional); ym is the independent variables assembly of No. m sample which is a measured value; k is the total number of training set. The kernel idea of SVM algorithms is to make a regression hyper plane, which can do the best to fit samples in space. The linear function is formulated in the high dimensional feature space, with the form of function:
where Φ(x) is the high dimensional feature space, which is nonlinearly mapped from the input space x, w is the weight vector to be identified in the function, and b is the threshold.
The optimal regression function is given by the minimum of the functional,
where C is a pre-specified value, and are slack variables representing upper and lower constraints on the outputs of the system.
In this work, anε -insensitive loss function was used which can be presented in Figure 1. Based on the ε-insensitive loss function and Lagrange function, the original fitting problems can be transformed as the corresponding dual Lagrangian form, which can be given by,
Solving Equation.(6) with constraints Equation.(8) determines the Lagrange multipliers: , and the regression function is given by Equation.(3), where
Finally, considering kernel function K(x, xi), the space transformation of inner product operation can be realized. By introducing Lagrange multipliers and exploitingthe optimality constraints, decision function can take thefollowing form:
Where and αiare the introduced Lagrange multipliers.
According to Karush–Kuhn–Tucker (KKT)conditions, only a number of coefficients among and willbe nonzero, and the data points corresponding tothem could bedefinedas support vectors, which can determine the hyper plane. In this equation, K(x, xi)refers to kernelfunction, including linear, polynomial, radial basisfunction (RBF), and sigmoid function.
The regression performance of SVM depends on the combinationof several parameters. They are penalty factor C, ε of theε-insensitive loss function, the kernel type,and its corresponding parameters. The penalty factor C is a regularizationparameter that controls the tradeoff between maximizing themargin and minimizing the training error. If C is too small,hen insufficient stress will be placed on fitting the trainingdata. If C is too large, then the algorithm will over fit thetraining data.The optimal value forεdepends on the type of noisepresent in the data, which is usually unknown. Even ifenough knowledge of the noise is available to select anoptimal value forε, there is thepractical consideration ofthe number of resulting support vectors. “ε-insensitivity”prevents the total training set meeting boundary conditionsand so allows for the possibility of sparsely in the dualformulation’s solution. So, choosing the appropriate valueofεis critical from theory.The kernel type is another important one. In our study, the Gaussian radial basisfunction is selected, because it has only one kernel parameter and has been commonly used in regression, shown as below:
The kernel parameter σcontrols the amplitudeof the Gaussian function andcontrols the generalizationability of SVM. We have to optimize σ and findthe optimal one. So we should take effective and reliable measures to set the three parameters in RBF-SVM. In this study, Random Search Technique is proposed.
The overall performance of SVM is evaluated in terms of R and a root-mean-squared error (RMSE) according to the equation below
Where yk is the desired output, is the actual output of the SVM model, and n is the number of compounds in analyzed set.
The SVM model in our present study was implemented using the software LibSVM that is efficient software for classification and regression developed by Chih-Chang and Chih-Jen Lin . All the algorithms used in this study were written inMatlab 7.0 and run on a personal computer (Intel Celeron-420 processor /1.66GHz 512MB RAM).
After the FS regression, thirteen molecular structural descriptors are obtained and are listed in Table 2 with their physical-chemical meanings.
The PLS regression was used to perform regression analysis, with logREC10 as a dependent variable and thirteen selected 3D descriptors as independent variables. According to the variable selection procedure mentioned above, 8 descriptors (Mor03p, L3e, R8p, RTv+, R8e, R1p+, R7p+ and HATSv) were obtained. The predicted logREC10 values by the PLS method are given in Table 1, and the statistical values of Q2cum, SE and correlation coefficient R are shown in Table 3. The linear function  was built as follows, with parameters defined in Table 2:
logREC10 = 8.253 + 0.282Mor03p - 1.820L3e - 0.988R8p +
3.284RTv+ - 2.774R8e + 0.567R1p+ - 11.915R7p+ - 0.149HATSv (3)
n=55, A=2, R2X(adj)(cum)=0.587, R2Y(adj)(cum)=0.757
Where A is the number of PLS components, R2X(adj)(cum) and R2Y(adj)(cum) stand for cumulative variance of all the predictor variables and dependent variable, respectively, explained by all extracted components.So it can be concluded that two PLS components were selected in the QSAR model (Equation. 3), and the two PLS components explained 58.7% of the variance of the independent variables, and 75.7% of the variance of the dependent variable. R2Y(adj)(cum) should act as a criterion for optimal variable selection since they describe the performance of models.
Q2cum value of our universal QSAR model is as high as 0.678, indicating that the model has good predictive ability and robustness. Figure 1 shows these predicted values of logREC10 versus experimental values. Concerning the goodness of fit of the model, the correlation coefficient(R) between the observed and the predicted logREC10 with multiple correlation coefficientsis 0.870. All the absolute residuals are less than 3×SE and it indicated that there were not outliers (Figure2). Therefore it can be concluded that the fitting results are satisfactory.
All the predictor variables are listed in Table 4. The VIP values indicate the significance of the variable in explaining the variance of the dependent variable.
c=64.0, g=0.0078125, p=0.0625
R2=0.888, MSE=0.021, SE=0.527
Model validation is one of the most important processes of QSAR development . Any model needs to be validated before it is used for “understanding” or for predicting new events such as the biological activities of new compounds or the yield and impurities at other process conditions. Many researchers apply the leave-one-out (LOO) or leave-some-out (LSO) cross-validation procedures. Another widely used approach to estimate model robustness is the so called y-randomization . However, some researches suggested that the only way to estimate the true predictive power of a QSAR model is to compare the predicted and observed activities of an external test set of compounds that were not used in the model development [23-25].
To validate the developed QSAR model, approximately 60% of the compounds under study were selected randomly and used to develop a new PLS model using the same descriptors, then the new model was used to predicate the log RP values of the remaining 40% compounds. The procedure was repeated 10 times, and the final results are shown in Table 3. From the results, it indicates that the developed QSAR models have good robustness and predictive ability.
Comparison of the results
Figure 4 gives the comparison between the results obtained by PLS and SVM based on the SE. As shown in Table 3, the SVM model gives the highest correlation coefficient R. It indicates that the SVM performed better than the PLS method. It also showed the better generalization ability. The reason may be that the SVM method embodies the structural risk minimization principle which minimizes an upper bound of the generalization error rather than minimizes the training error. This eventually leads to better generalization than neural networks, which implement the empirical risk minimization principle and may not converge to global solutions.
From a practical point of view, interpreting the descriptors used in the models could provide some insight into factors that are likely to govern the ER binding of EDCs and help us to understand which interactions may play an important role in the binding process.
Model (3) extracts two PLS components that are relevant to 8 predictor variables. The factors governing logREC10 can be interpreted by PLS weights of the variables included in model (3). The respective weights of the 8 calculated descriptors retained in the PLS model are shown in Table 4. From the PLS weights, one could see how much one descriptor contributes to the interpretation of the variance of estrogenic activity and how they relate to each other.
The first PLS component is loaded primarily on the five descriptors, L3e, Mor03p, R7p+, R8e and HATSv (Mor03p on the positive side, L3e, R7p+, R8e and HATSv on the negative). 3D-MoRSE descriptors appearing in the model are important because they take into account the 3D arrangement of the atoms without ambiguities (in contrast with those coming from chemical graphs), and also because they do not depend on the molecules with great structural variance and being a characteristic common to all of them. This type of indices are based on the idea of obtaining information from the 3D atomic coordinates by the transform used in electron diffraction studies for preparing theoretical scattering curves . In order to take into account the specific contributions of the atoms to the property being studied, different atomic properties can be employed as weighting schemes. Mor03p corresponds to signal 03 and is weighted by atomic polarizabilities. On the other hand, for L3e, R7p+, R8e and HATSv, W* and the corresponding coefficient in model (3) are both negative. L3e is a WHIM descriptor weighted by atomic Sanderson electronegativities, and it remarkably governs logREC10, as indicated by its VIP, the largest among the predictor variables. R7p+ and R8e are R-GETAWAY descriptors, which are derived by combines the information provided by the molecular influence matrix with geometric interatomic distances in the molecule . The negative PLS weights W* and coefficient of R7p+ and R8e in the model (3) indicate the negative correlation relationship between them and logREC10. This type of elaborated 3D descriptors is able to determine the shape and size of the inhibitor. The descriptor R7p+ is R maximal autocorrelation of lag 7 weighted by atomic polarizabilities and R8e is autocorrelation of lag 8 weighted by atomic Sanderson electronegativities. HATSv is an H-GETAWAY descriptor, which encodes both the geometrical information given by the molecular influence matrix H and the topological information given by the molecular graph, weighted by selected atomic weights. The selected descriptor HATSv is a leverage-weighted total index weighted by atomic van der Waals volumes.
The second PLS component that also extract five descriptors, L3e, R8e, HATSv, R8p and R1p+. The negative PLS weights W*  and coefficient of L3e, R8e and R8p in the model (3) also indicate the negative correlation relationship between them and logREC10. R1p+ is also an R-GETAWAY descriptor weighted by atomic polarizabilities.
In conclusion, the molecular descriptors most frequently selected by FS regression can be used to predict the logREC10 value of organic chemicals. The estrogenic activity is related to distributed atomic Sanderson electronegativities, atomic polarizabilities and atomic van der Waals volumes.
Local QSAR models
As shown in Table 1, the chemicals were divided into six “families” based on their structural characters. They were: (1) natural products and related compounds; (2) medicines, food additives, and related compounds; (3) PCBs, PCDFs, PAHs, and related compounds; (4) Phenols; (5) Benzenes and heterocyclics; (6) Phthalates and adipates. To increase our understanding of the structural requirements for a chemical’s binding to ER, three linear QSAR models for three of the families were developed based on the PLS regression method, which are listed in the Table 5.
(1) natural products and related compounds
logREC10=29.602 + 24.545E1e - 29.320H1v - 30.144 R4e (5)
n=12, Q2cum=0.877, R=0.957, SE=0.549, p < 0.0001
(2) medicines, food additives, and related compounds:
logREC10=7.913 + 1.910 RDF020m - 6.706L3p (6)
n=7, Q2cum=0.934, R=0.989, SE=0.420, p < 0.0001
logREC10=16.474 + 0.040MWC01 - 1.846GATS4v - 1.710GATS1e -1.423Mor02m + 0.807Mor21e + 0.695E2e - 8.037As+ 0.009Gm - 0.668Vm - 0.878HATS6p (7)
n=25, Q2cum=0.915, R=0.983, SE=0.157, p < 0.0001
Inspecting the knowledge obtained above, it is possible to gain some information about what factors are likely to govern the ER binding ability for a specific family. This is beneficial for developing a credible model for prediction.
The two methods, PLS and SVM, were used to develop linear and nonlinear QSARs to predict estrogenic activities of 55 structurally diverse organic chemicals. Eight descriptors, which represent the features of the compounds responsible for the binding ability to estrogen receptor, were selected to develop global QSAR models. Inspection of the PLS model indicates that atomic Sanderson electronegativities, polarizabilities and van der Waals volumes may be most relevant factor controlling the binding behavior, affecting the space-matching between the ER protein and the ligand. The two resultant QSAR models were further compared with respect to statistical measures from a leave-some-out process, with SVM yielding the best model performance in terms of self-consistency and ability to predict the activity of the test chemicals. Additionally, 55 chemicals were divided into six well-known “families” according to their chemical structures. Three local QSAR models were developed, which gave us insight into the factors that govern the binding behavior for these specific chemicals.