3D QSAR Analysis on Isatin Derivatives as Carboxyl Esterase Inhibitors Using K-Nearest Neighbor Molecular Field Analysis

Carboxylesterases (CE) are ubiquitous enzymes which are responsible for the metabolism and detoxification of xenobiotics. Carboxylesterases (CEs) hydrolyze endogenous and exogenous esters to the corresponding alcohol and carboxylic acid [1-4]. In mammals, they tend to be expressed in tissues likely to be exposed to xenobiotics, including the liver, lung, small intestine, kidney, and so on [5]. CEs metabolize a number of useful drugs [6-9] such as Demerol and lidocaine, the anticancer agents capecitabine and CPT-11 [10] (irinotecan, 7-ethyl-10-[4-(1-piperidino)-1-piperidino] carbonyloxycamptothecin), breast cancer drug tamoxifen [11], antiviral drug oseltamivir, alzheimer’s drug tacrine [12], narcotics cocaine and heroin [13], antithrombogenic agent clopidogrel, cholesterol lowering drug mevastatin [11]. Therefore, identification and application of selective CE inhibitors may prove useful in modulating the metabolism of esterified drugs in-vivo and for prolonging the bioactivity of agents that are inactivated by CEs or conversely may reduce the toxicity of compounds that are activated by these enzymes.


Introduction
Carboxylesterases (CE) are ubiquitous enzymes which are responsible for the metabolism and detoxification of xenobiotics. Carboxylesterases (CEs) hydrolyze endogenous and exogenous esters to the corresponding alcohol and carboxylic acid [1][2][3][4]. In mammals, they tend to be expressed in tissues likely to be exposed to xenobiotics, including the liver, lung, small intestine, kidney, and so on [5]. CEs metabolize a number of useful drugs [6][7][8][9] such as Demerol and lidocaine, the anticancer agents capecitabine and CPT-11 [10] (irinotecan, 7-ethyl-10-[4-(1-piperidino)-1-piperidino] carbonyloxycamptothecin), breast cancer drug tamoxifen [11], antiviral drug oseltamivir, alzheimer's drug tacrine [12], narcotics cocaine and heroin [13], antithrombogenic agent clopidogrel, cholesterol lowering drug mevastatin [11]. Therefore, identification and application of selective CE inhibitors may prove useful in modulating the metabolism of esterified drugs in-vivo and for prolonging the bioactivity of agents that are inactivated by CEs or conversely may reduce the toxicity of compounds that are activated by these enzymes.

Data set and biological activity
In the present study 49 molecules of Isatin (indole-2,3-dione) derivatives [18] were used which were reported to have carboxylesterase inhibitory activity. Human intestinal carboxylesterase (hiCE) inhibitory data Ki [inhibition constant (µM)] reported have been converted to the logarithmic scale [pKi (moles)] for QSAR study.

Molecular modeling study
Molecular modeling and kNN-MFA studies were performed on HCL computer having genuine Intel Pentium Dual Core Processor and Windows XP operating system using the software Molecular Design Suite (MDS) [26]. Structures were drawn using the 2D draw application and converted to 3D structures. Structures were optimized by energy minimization and geometry optimization was done using Merck Molecular Force Field (MMFF) method with 10000 as maximum number of cycles, 0.01 as convergence criteria (root mean square gradient) and 1.0 as constant (medium's dielectric constant which is 1 for in vacuo) in dielectric properties. The default values of 30.0 and 10.0 Kcal/mol were used for electrostatic and steric energy cutoff.

Abstract
A three dimensional quantitative structure activity relationship (3D QSAR) using k nearest neighbor molecular field analysis (kNN MFA) method was performed on a series of isatin derivatives as carboxylesterase (CE) inhibitors. This study was performed with 49 compounds (data set) using sphere exclusion (SE) algorithm for the division of the data set into training and test set. SE algorithm allows constructing training sets covering all descriptor space areas occupied by representative points. Between 3.0 to 5.5 dissimilarity levels which comprises of test set size 4 to 10, kNN-MFA methodology with stepwise (SW), simulated annealing (SA) and genetic algorithm (GA) was used for building the QSAR models. Four predictive models were generated with SW-kNN MFA (pred_r2=0.7552 to 0.9376), three predictive models were generated with SA-kNN MFA (pred_r2=0.7019 to 0.9367) and two predictive models were generated with GA-kNN MFA (pred_r2=0.8226 to 0.8497). Most significant model generated by stepwise kNN-MFA showed internal predictivity 82.11% (q2=0.8211) and external predictivity 93.76% (q2=0.9376). In this model hydrophobic and steric interactions dominate the CE inhibitory activity. Hydrophobic field descriptor (H_977) with positive range indicates that positive hydrophobic potential is favorable for increase in activity and hence more hydrophobic substituent group is preferred in that region. Steric field descriptor (S_619) with negative range indicates that negative steric potential is favorable for increase in activity and hence less bulky substituent group is preferred in that region. The kNN-MFA contour plots provided further understanding of the relationship between structural features of substituted isatin derivatives and their activities which should be applicable to design newer potential CE inhibitors.

Molecular alignment:
The dataset were aligned by template based alignment method using most active molecule (49, Table 1) as a reference molecule 2 and structure 1 as a template ( Figure 1) for alignment of the molecules. The alignment of all the molecules on the template (Figure 2). In the template based alignment method, a template structure was defined and used as a basis for alignment of a set of molecules.
Grid size: Once the molecules are aligned, a grid or lattice is established which surrounds the set of compounds in potential receptor surface. Current study uses grid resolution 2 Å.
Descriptor calculation: Once the molecules are aligned, a molecular field is computed on a grid of points in space around the molecule. This field provides a description of how each molecule will tend to bind in the active site. Descriptors representing the steric, electrostatic and hydrophobic interaction energies were computed at the lattice points of the grid using a methyl probe of charge +1.
Generation of training and test set: Sphere exclusion algorithm was used for generation of training and test sets. The whole data set was divided into training and test sets using sphere exclusion algorithm [25]. This algorithm allows constructing training sets covering all descriptor space areas occupied by representative points. The higher the dissimilarity level c is, smaller the training set and larger the test set and vice versa. It is expected that the predictive ability of QSAR models generally decrease when the dissimilarity level increases. Once the training and test sets are generated, kNN methodology is applied to descriptors generated over grid.
kNN-MFA methodology for building QSAR models: Models generated by kNN-MFA in conjunction with stepwise (SW) forwardbackward, simulated annealing (SA) and genetic algorithm (GA) variable selection methods. The QSAR models were developed using Stepwise forward-backward variable selection method with pK i activity field as dependent variable and descriptors as independent variable.
The kNN technique is a conceptually simple approach to pattern recognition problems. In this method, an unknown pattern is classified according to the majority of the class memberships of its k nearest neighbors in the training set. The nearness is measured by an appropriate distance metric (e.g., a molecular similarity measure, calculated using field interactions of molecular structures).
Stepwise forward-backward variable selection method [27] (SW-kNN MFA): This method uses stepwise variable selection and k-NN principle to build QSAR model. In each step this method optimize (i) the number of nearest neighbors (k) used to estimate the activity of each compound and (ii) select variables (stepwise) from the original pool of all molecular descriptors that are used to calculate similarities between compounds. It involves a step-by-step search procedure that begins by addition of a single independent variable with optimal k value (optimizing k value from the given range of k values) and highest sum of weighted k-nearest neighbor cross validation (q 2 ) and external validation (pred_r 2 ) value amongst all available descriptors to form a model.
The parameter settings used for SW-kNN MFA are Cross correlation limit as 0.5, maximum number of variable in final equation as n/5 (n is number of compounds in training set), term selection criteria as q 2 , Ftest in as 4 and Ftest out as 3.99, variance cut-off as 0 and Scaling as Auto Scaling, number of maximum neighbors as 5, number of minimum neighbors as 2 and distance based weighted average as prediction method. [24] (SA-kNN MFA): Simulated annealing is to simulate a physical process called annealing, in which a system is heated to a high temperature and then is gradually lowered to a preset temperature value (e.g. room temperature). During this process, the system samples possible configurations according to Boltzmann distribution. At equilibrium, low energy states wil be mostly populated.

Simulated annealing k-NN QSAR algorithm
k-NN MFA method employs the kNN classification principle combined with the variable selection procedure. For each predefined number of variables (nvar) it seeks to optimize using stochastic sampling and simulated annealing as an optimization tool. The parameter settings used for simulated annealing in the present study are Maximum temperature as 100, minimum temperature as 0.01, cross correlation limit as 0.5, terms in model as n/5 (n is number of compounds in training set), iteration at given temperature as 5, decrease temperature by as 10, seed as 0, perturbation limit as 1, term selection criteria as q 2 , variance cut-off as 0 and Scaling as Auto Scaling, number of maximum neighbors as 5, number of minimum neighbors as 2 and distance based weighted average as prediction method.

Genetic algorithm k-NN QSAR algorithm (GA-kNN MFA):
The genetic function approximation (GFA) algorithm (Rogers and Hopfinger) offers a new approach in building quantitative structureactivity relationship (QSAR) and quantitative structure-property relationship (QSPR) models. Replacing regression analysis with the GFA algorithm allows the construction of models competitive with or superior to those produced by standard techniques and makes available additional information not provided by other techniques [28].   The GFA algorithm used to perform a search over the space of possible QSAR/QSPR models using the fitness score (LOO q 2 /cross terms) to estimate the fitness of each model. Such evolution of a population of randomly constructed models leads to the discovery of highly predictive QSARs/QSPRs. The parameter settings used for genetic algorithm in the present study are Cross correlation limit as 0.5, chromosome length as n/5 (n is number of compounds in training set), cross over probability as 0.95, mutation probability as 0.05, population as 10, number of generations as 1000, convergence criteria as 0.01, seed as 0, term selection criteria as q 2 , variance cut-off as 0 and Scaling as Auto Scaling, number of maximum neighbors as 5, number of minimum neighbors as 2 and distance based weighted average as prediction method. [25]: In cross validation, a compound is eliminated in the training set and its biological activity is predicted on the basis of the k-NN principle, i.e., as the weighted average activity of k most similar molecules (k is set to 1 initially). The similarities are evaluated as Euclidean distances between compounds using only the subset of descriptors that corresponds to the current model. This step is repeated until every compound in the training set has been eliminated and its activity is predicted once.

Cross-validation (q 2 ) using weighted k-nearest neighbor
Cross-validated r 2 (q 2 ) value can be calculated by using following equation, where y i and y * are the actual and predicted activities of the i th compound, respectively, and y mean is the average activity of all compounds in the training set. Both summations are over all compounds in the training set. The obtained q 2 value is indicative of the predictive power of the current k-NN QSAR model in predicting compounds in training set.

∑ ∑
Above procedure is repeated for k=2, 3, 4, etc. Formally, the upper limit of k is the total number of compounds in the data set. The k value that leads to the highest q 2 value is chosen for the current k-NN QSAR model.
(1) Predict biological activity of a compound in the test set on the basis of the k-NN principle, i.e., as the weighted average activity of k (that correspond k value for highest q 2 value) most similar molecules in the training set. The similarities are evaluated as Euclidean distances between compounds using only the subset of descriptors that corresponds to the current model (for highest q 2 value).
(1) Repeat step 1 for every compound in the test set.
(2) Calculate the predicted r 2 (pred_r 2 ) value using following equation, where y i and y * are the actual and predicted activities of the i th compound in test set, respectively, and y mean is the average activity of all compounds in the training set. Both summations are over all compounds in the test set. The obtained pred_r 2 value is indicative of the predictive power of the current k-NN QSAR model for external test set.

Results and Discussion
Different training and test set of isatin (1-H-indole-2,3-dione) derivatives were constructed using sphere exclusion data selection method at different dissimilarity levels (3.0 to 5.5). Training and test set were selected using sphere exclusion method at different dissimilarity level if they follow the Uni column statistics, i.e., maximum of the test is less than maximum of training set and minimum of the test set is greater than of training set, which is prerequisite for further QSAR analysis ( Table 2). This result shows that the test is interpolative i.e., derived from the min-max range of training set. The mean and standard deviation of the training and test set provides insight to the relative difference of mean and point density distribution of the two sets. k-Nearest neighbor molecular field analysis (kNN-MFA) was applied using stepwise (SW), simulated annealing (SA) and genetic algorithm (GA) approace QSAR models. Results of models developed by SW-kNN MFA, SA-kNN MFA and GA-kNN MFA are shown in Table 3. Best three significant models generated on the basis of predicted correlation coefficient are shown in Table 4. Following statistical measure was used to correlate biological activity and molecular descriptors: n=number of molecules, Vn=number of descriptors, k=number of nearest neighbor, df=degree of freedom, r2=squared correlation coefficient, q2=cross validated correlation coefficient (by the leave-one out method), pred_r2=predictive correlation coefficient for external test set, pred_ r2se=coefficient of correlation of predicted data set, Z score=the Z score calculated by q2 in the randomization test, and α=the statistical significance parameter obtained by the randomization test. Data fitness plot for models 1-3 are shown in Figures 2-4. Result of the observed and predicted biological activity for the training and test compounds in the models 1-3 are shown in Table-5. The plot of observed vs. predicted activity of training and test sets for models 1 to 3 is shown in Figures 5-7. From the plot it can be seen that kNN-MFA model is able to predict the activity of training set quite well (all points are close to regression line) as well as external. Sphere exclusion (SE) algorithm allows constructing training sets covering all descriptor space areas occupied by representative points. Between 3.0 to 5.5 dissimilarity levels which comprises of test set size 4 to 10, kNN-MFA methodology with stepwise (SW), simulated annealing (SA) and genetic algorithm (GA) was used for building the QSAR models. Four predictive models were generated with SW-kNN MFA (pred_r2=0.7552 to 0.9376), three predictive models were generated with SA-kNN MFA (pred_r2=0.7019 to 0.9367) and two predictive models were generated with GA-kNN MFA (pred_r2=0.8226 to 0.8497). The kNN-MFA contour plots (Figures 8) provided further understanding of the relationship between structural features of substituted isatin derivatives and their activities which should be applicable to design newer potential CE inhibitors. Further increase in dissimilarity value has produced decrease in model quality.

Interpretation of model 1
It is produced by using stepwise kNN-MFA method having internal predictivity 82.11% (q2=0.8211) and external predictivity 93.76% (pred_r2=0.9376). kNN-MFA plot shown in Figure 8. This model showed that hydrophobic (H_977) and steric (S_619) interactions play important role in determining carboxyl esterase inhibitory activity. Hydrophobic field descriptor (H_977) has positive range (0.3552 to 0.7596) indicates that positive hydrophobic potential is favorable for increase in activity and hence more hydrophobic substituent group     is preferred in that region. Compounds having lesser hydrophobic substituent group are not favorable for biological activity. Steric field descriptor (S_619) has negative range (-0.0599 to -0.0386) indicates that negative steric potential is favorable for increase in activity and hence less bulky substituent group is preferred in that region. Compounds having more bulky substituent group is not favorable for biological activity.

Interpretation of model 2
It is produced by using stepwise kNN-MFA method having internal predictivity 82.11% (q2=0.8211) and external predictiviy 89.0% (pred_ r2=0.89). kNN-MFA plot shown in Figure 4. This model showed that hydrophobic (H_986) and steric (S_620) interactions play important role in determining carboxyl esterase inhibitory activity. Hydrophobic field descriptor (H_986) has positive range (0.2413 to 0.2947) indicates that positive hydrophobic potential is favorable for increase in activity and hence more hydrophobic substituent group is preferred in that region. Compounds having lesser hydrophobic substituent group are not favorable for biological activity. Steric field descriptor (S_620) has negative range (-0.0624 to -0.0277) indicates that negative steric potential is favorable for increase in activity and hence less bulky substituent group is preferred in that region. Compounds having more bulky substituent group is not favorable for biological activity.

Conclusions
In conclusion, the model developed to predict the structural features of Isatins (Indole-2,3-diones) to inhibit carboxylesterases reveals useful information about the structural features requirement for the molecule. In optimized models, Model 1 is giving very significant results. The master grid obtained for the various kNN-MFA models show that negative value in electrostatic field descriptors indicates the negative electronic potential is required to increase activity and more electronegative substituents group is preferred in that position, positive range indicates that the group which imparts positive electrostatic potential is favorable for activity so less electronegative group is preferred in that region. Negative range in steric descriptors indicates that negative steric potential is favorable for activity and less bulky substituents group is preferred in that region. Positive value of steric descriptors reveals that positive steric potential is favorable for increase in activity and more bulky group is preferred in that region. On the basis of the spatial arrangement of the various shapes, electrostatic and steric potential contributions model proposed in this work is useful in describing QSAR of Isatins, Indole-2,3-diones derivatives as carboxylesterase inhibitors and can be employed to design new derivatives with more potent inhibitory activity.