alexa Data Set Analysis for the Calculation of the QSAR Models Predictive Efficiency Based on Activity Cliffs | Open Access Journals
ISSN: 2379-1764
Advanced Techniques in Biology & Medicine
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Data Set Analysis for the Calculation of the QSAR Models Predictive Efficiency Based on Activity Cliffs

Fatima Adilova* and Alisher Ikramov

Institute of Mathematics, National University of Uzbekistan, Uzbekistan

*Corresponding Author:
Fatima Adilova
Institute of Mathematics
National University of Uzbekistan
Tel: 998712629878
Fax: 998712627357
E-mail: [email protected]

Received date: March 28, 2017; Accepted date: March 31, 2017; Published date: April 07, 2017

Citation: Adilova F, Ikramov A (2017) Data Set Analysis for the Calculation of the QSAR Models Predictive Efficiency Based on Activity Cliffs. Adv Tech Biol Med 5:216. doi:10.4172/2379-1764.1000216

Copyright: © 2017 Adilova F, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Advanced Techniques in Biology & Medicine


The activity cliff concept is of high relevance for medicinal chemistry. Herein, we explore a concept of “data set modelability”, i.e., a priori estimate of the feasibility to obtain externally predictive QSAR models for a data set of bioactive compounds. This concept has emerged from analyzing the effect of so-called “activity cliffs” on the overall performance of QSAR models. Some indexes of “modelability” (SALI, ISAC, and MODI) are known already. We extended the version of MODI to data sets of compounds with real activity values. The predictive efficiency of QSAR models is expressed as the correct classification rate by SVM algorithm, which compared with the results of the other two algorithms: algorithm MODI and Voronin’s algorithm modified by the authors. Comparative analysis of the results performed using Pearson’s correlation coefficient square. Our study showed an extreme lack of evaluation of predictive efficiency of data set only based on “activity cliffs”. In the development of more accurate methods that allow to evaluate the possibility of building of effective models on the data samples, it is necessary to take into account other properties of the sample, and not only the presence (and number) of “activity cliffs”.


Models’ efficiency; QSAR; MODI; Activity cliffs


Quantitative structure-activity relationships (QSAR) have been known for many years, however, it can be said that QSAR, as a rule, does not meet expectations in many cases. In principle, the efficiency is determined by the nature of the activity landscape, which is associated with the representation of the chemical space used to describe the set of compounds considered. Activity landscapes have high dimensionality (>3) and depend on:

• The nature of the analysis (e.g. enzymes-based, cell-based, etc.),

• The area (s) of the chemical space from which these compounds are taken,

• Density distribution of compounds in these areas,

• And, most importantly, the molecular representation of compounds.

A typical N-dimensional landscape consists of (N-1)-dimensional chemical space, each dimension is a coordinate determined by a single molecular descriptor or a combination of descriptors. The N-th coordinate is the measured activity of each compounds considered; in three dimensions, activity landscapes are similar to natural landscapes, as shown in Figure 1.


Figure 1: Illustration of the activity landscape with activity cliffs (AC). A. Gradual changes in the structure lead to moderate changes in activity; B. Small changes in the structure lead to strong changes in activity. Reproduced from [9].

For many years, QSAR developed under assumption that similar molecules tend to have similar activity, which makes the activity landscapes look like the gently rolling hills (Figure 1A). However, this picture is not so universal, and in many cases rather resembles a canyon landscape (Figure 1B). In other words, very similar molecules in some cases have very different activities, which in [1] are commonly called activity cliffs (AC). AC is determined by the ratio of the difference in the activity of the two compounds to the "distance" between them in a given chemical space.

In Adilova and Ikramov’s earlier work [2], they investigated a priori estimate of the prognostic efficiency of the sample on the QSAR model for a specific set of bioactive compounds based on the MODI index. It was shown that MODI in some cases weakly reflects the sample ability to be successful in construction of QSAR model. In this paper, another measure of similarity investigated and evaluated its applicability in simulation. It should be noted that molecular similarity is one of the most observed and powerful concepts in chemical informatics [1,3,4]. Many computational methods for estimating the similarity already exist and continue to appear [1,3], but the comparison of compounds and their properties, especially activity, is still one of the most important and often used techniques in chemical and pharmaceutical studies.

Materials and Methods

To study each approach we chose 5231 compounds from the ChEMBL database with activity against the protein of CA2 (Inhibitory activity against human recombinant carbonic anhydrase II). The original data set divided into samples of 100 and 50 compounds in each. 19 descriptors are given for each compound in CHEMBL, which were used in the calculations.

We used MODI [5] as basic algorithm, which calculates the percentage of pairs of compounds close by descriptors to each other and belonging to the same classes in activity to the total number of compounds. Its goal is to predict efficiency of QSAR models built on this sample as training set.

The index MODI calculates by the formula:


Where K is the number of classes in the sample, image the number of elements in the class image is the number of elements in the class i whose closest neighbor is an element from the same class.

We also used another measure of similarity based on Voronin's approach [6,7] that we have modified. Let the object Ai be described by a set of descriptors image.Then, if the descriptor is a real number, then the similarity of two objects Ai, Aj by this descriptor can be calculated by the formula:


where image

If the descriptor is a binary or has a value in which the concept “more or less” cannot be introduced, then the similarity calculated by the formula:

image                   (3)

The general measure of the similarity of two objects is calculated by the formula:


Where the nonnegative coefficients δk are computed on the training sample according to the rule:


We calculated Voronin’s coefficients and MODI indexes on all the samples (50 and 100 compounds in each, and 200 in addition). It was performed by special programs written in C/C++. The solution of (5) for each data set was found by Microsoft Office Excel 2013.

Results and Discussion

To validate the QSAR models based on the Voronin similarity measure, it is necessary to establish a threshold, at the excess of which two compounds are considered to be close. Since the measure itself does not set such a parameter, several values were chosen: 0.68, 0.75 and 0.95. These thresholds have been tested during research to obtain optimal value of efficiency prediction.

Further, we studied all compounds close to each other. The number of such pairs whose activity values differ significantly (by 100 units and by 100 times) was counted. It was considered that if only one Voronin coefficient out of the 19 available is equal to 1, then Voronin's method is inapplicable for this sample. It turned out that for samples with 100 compounds Voronin's method is applicable in 98%, and for samples with 50 in 100%. In this case, for MODI, there is no condition of inapplicability of the algorithm.

We trained SVM models on all the samples and then run prediction mode of each model on other samples. We checked the classification results (% of success). The result of the computational experiment is a total of 900 values, so briefly we present here only the maximum, minimum and average values (Table 1).

Next, we calculated the square of the Pearson [8] correlation coefficient between the SVM classification results and the prediction results of the algorithms studied (Table 2).

Then we changed the number of compounds in sets to 200. We got 8 data sets and run the same experiments. Voronin’s algorithm rejected set #2. The results (in %) are presented in the Table 3.

We also calculated Chi-squared coefficients (Table 4). The results showed that Voronin’s algorithm is a better way to predict efficiency values of SVM-models than MODI. But MODI requires less time while calculation of Voronin’s coefficients can be challenging. MODI seeks for the closest compound to be in the same class while our concept of Voronin’s measure looks for all similar compounds around each molecule. It is more general approach and leads to more deep examination of actual “activity landscapes”.

Algorithm/data set cardinality Maximum Average Minimum
MODI/50 40 13 3
MODI/100 27 15 5.8
SVM/50 54 23 8
SVM/100 56 26.1 8
Voronina (0.68)/50 100 39 0
Voronin (0.75)/50 100 88 0
Voronin (0.95)/50 100 50 0
Voronin (0.68)/100 100 51 0
Voronin (0.75)/100 100 84 0
Voronin (0.95)/100 100 52 0

Table 1: The maximum, average and minimum values of the success percent obtained on the sample family by different algorithms.

Pearson's correlation coefficient MODI Voronin (0.68) Voronin (0.75) Voronin (0.95)
Samples of 100 compounds 0.12 0.05 0.12 0.004
Samples of 50 compounds 0.06 0.03 0.01 0.03

Table 2: Pearson's correlation coefficient between the results of the algorithm SVM classification and prognostic algorithms.

 Set # MODI SVM Voronin (threshold)
MIN MAX AVER (0.68) (0.75) (0.95)
1 37.1 40.0 52.5 46.0 47.2 47.4 48.0
2 31.8 35.5 61.5 44.1 44.4 44.4 44.4
3 31.8 36.5 64.5 43.3 45.0 45.0 46.1
4 30.3 37.0 60.0 45.5 46.0 46.2 47.6
5 32.6 41.5 56.5 45.3 44.6 44.7 46.9
6 34.1 38.5 57.5 44.1 47.2 47.0 49.3
7 35.6 37.5 62.0 44.4 45.3 45.4 46.6
8 32.8 36.5 66.0 45.1 45.0 45.1 45.6

Table 3: Minimum, maximum, and average values of SVM models’ success compared to MODI and Voronin’s algorithm predictions for data sets of 200 compounds each.

  MODI Voronin (Threshold)
(0.68) (0.75) (0.95)
MIN 0.21 0.10 0.11 0.33
MAX 0.27 0.39 0.41 0.41
AVER 0.08 0.11 0.16 0.09

Table 4: Pearson's correlation coefficient between the results of the algorithm SVM classification and prognostic algorithms for data sets of 200 compounds each.

SVM MODI Voronin
95.8% 29.6% 77.3%

Table 5: Comparing results of SVM and prognostic algorithms based on 500 compounds training sample.

Our first research used Voronin’s measure itself (there were no logarithms in (2)) but it failed to distinguish values like 100 and 2000 as the denominator in (2) was large. Use of logarithms showed better results.

In addition, 500 compounds were randomly selected for the training sample from the set of the 5231 compounds and all other compounds were considered as a test sample. We obtained following results (Table 5).


Thus, both algorithms of a priori estimation of the prognostic efficiency of the sample showed extremely low effectiveness. Their application to the estimation of the sample is limited. It is necessary to develop other methods of more accurately assess the capabilities of classification algorithms on these samples. At the same time, MODI efficiently works on samples of small volume, and the modified Voronin method works better on large samples. It is necessary to take into account these features when developing new methods. The results were presented as abstract in [10].


Funding for this work was provided by the Uzbekistan Committee of Science &Technology under Grant number A-5-1: "Development of computer applications for computer drug design" 2015-2017.


Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

Article Usage

  • Total views: 351
  • [From(publication date):
    May-2017 - Sep 20, 2017]
  • Breakdown by view type
  • HTML page views : 306
  • PDF downloads :45

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri, Food, Aqua and Veterinary Science Journals

Dr. Krish

[email protected]

1-702-714-7001 Extn: 9040

Clinical and Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals


[email protected]

1-702-714-7001Extn: 9042

Chemical Engineering and Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001 Extn: 9040

Earth & Environmental Sciences

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

General Science and Health care Journals

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics and Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001 Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Informatics Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Material Sciences Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Mathematics and Physics Journals

Jim Willison

[email protected]

1-702-714-7001 Extn: 9042

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001 Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

John Behannon

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

[email protected]

1-702-714-7001 Extn: 9042

© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version