Received date: December 02, 2013; Accepted date: January 27, 2014; Published date: January 30, 2014
Citation: Ma C, Huang SH, Zhou Y (2014) Measuring Inequalities in Gene Coexpression Networks of HIV-1 Infection Using the Lorenz Curve and Gini Coefficient. J Data Mining Genomics Proteomics 5:148. doi: 10.4172/2153-0602.1000148
Copyright: © 2014 Ma C, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Data Mining in Genomics & Proteomics
The Gini methodology is a family of mathematical models that describe various relations in or between variables [1,2]. The basic concept of Gini methodology is the Gini coefficient (also known as Gini index, or Gini ratio), which measures the inequality of a distribution (e.g., income) with values ranged from 0 (complete equality) to 1 (complete inequality), has been popularly used in economics for quantifying the income inequality in a country [3,4]. Due to the superiority of analyzing data with normalized and non-normalized distribution , Gini coefficient and the derived statistical algorithms have been extended to apply in disciplines as diverse as social science, chemistry and engineering. Recently, the Gini methodology has also been introduced to biology for inferring transcription regulation relationships from gene expression data , and for exploring the symbiosis and pathogenesis of human immunodeficiency virus type 1 (HIV-1) infection .
HIV-1 is a virus that can cause acquired immunodeficiency syndrome (AIDS), leading to thousands of death per year in the world due to the lack of effective vaccines and cure. As one of powerful systems biology approaches, gene co-expression networks (GCNs) have been recently applied to investigate the molecular mechanisms of HIV-1 infection by organizing genes into a network, in which two genes with similar expression patterns are connected by an edge [6-8]. An in-depth statistical analysis of HIV-related network properties will be helpful to discover new biomarkers and signatures of HIV-1 infection. Here we applied the Gini methodology to explore inequalities in GCNs constructed with 943 genes differentially expressed in human lymphatic tissues of uninfected subjects and infected patients at different stages of HIV-1 infection (the acute, the asymptomatic, and the AIDS stages). More details about the microarray data generation and normalization, and the selection of differentially expressed genes can be found in Xu et al. . To construct GCNs, the similarities of expression patterns between two genes were measured with Pearson correlation coefficient (PCC). Two genes were connected in the GCNs if the significance level (p-value) of PCC is lower than 0.05. The p-values were estimated with permutation method by shuffling gene expression data in the microarray dataset.
Despite the connectivity distributions give some insights about how genes are connected in the GCN (Figure 1A), they fails to quantify the characterization of the connectivity in the whole network, leading to the difficulty of comparing two GCNs constructed for different biological conditions. Here the connectivity inequality (CI) is introduced to consider the distribution of connectivity of genes in the whole network with the Gini coefficient algorithm. The CI can be graphically represented with the Lorenz curve, which is a two-dimensional plot of the cumulative fraction of the number of genes in the network versus the cumulative fraction L(p) of total connectivity from these genes. The Lorenz curve more close to the diagonal line indicate that genes in the network are more equally connected. The Gini coefficient is equal to one minus twice the area under the Lorenz curve, and can be computed with the formula :
Figure 1: The connectivity in equality in the geneco-expression networks (GCNs) of HIV-1 infection.
(A) The connectivity distributions of GCNs for uninfected subjects and patients at different stages of HIV-1 infection.
(B) The Lorenz curves of connectivity distributionin four GCNs. prepresents the cumulative fraction of the number of genes in the network, L(p) denotes the cumulative fraction of connectivity from these genes.
(C) Gini coefficient of connectivity infour GCNs.
(D) Gini share of positive and negative connectivity in four GCNs.
(E) Gini correlation of positive and negative connectivity infour GCNs.
, where n is the number of genes in the network, X(i) is the ith value of connectivity sorted in increasing order, 0 ≤ X(1) ≤ X(2) ≤…≤ X(n). We observed that the Lorenz curve from the GCN at the AIDS stage is markedly deviated from the diagonal line that those from the other three GCNs (Figure 1B). At the same time, the Gini coefficient from the GCN at the AIDS stage is much higher than those from the other three GCNs. These results indicate that dramatic changes of transcriptional regulation at the last stage of HIV infection.
In the GCN, the connectivity of a gene is composed with positive and negative connectivity, which present the connection to other genes with positive and negative PCC values, respectively. The contribution of positive and negative connectivity to the overall inequality of connectivity in the network is defined based on the decomposition of Gini coefficient (1):
, where CI and CIn are the inequality of positive and negative connectivity, respectively. Sp and Sn (Sn =1−Sp) are two Gini share measures represent the percentages of positive and negative connection in the whole network, respectively. p τ(Xp,X) and n τ(Xp,X) are two Gini correlation coefficients ranged from -1 to 1, indicating the contribution of positive and negative connectivity to the CI, respectively. As shown in Figure 1D, Gini share of positive connectivity in four networks are remarkably higher than that of negative connectivity, indicating the positive regulation is the dominant relation in the network for uninfected subjects and patients at different stages of HIV-1 infection. Interestingly, the positive regulations were enhanced at the first two stages of HIV-1 infection. In contrast, The negative regulations at the AIDS stage were enhanced. From the HIV uninfected to the AIDS stage, the Gini correlation of negative connectivity is changed more significantly than that of positive connectivity (Figure 1E), indicating that positive and negative coexpression associations might play different roles in the pathogenesis of HIV infection.
Besides the connectivity, the edge weights (i.e., correlation values) in GCNs were also changed during the HIV-1 infection. For a given gene i, the changes in the correlation strengths can be calculated using the differential co-expression (dC) measure with the formula : where gene i connects m genes in two networks, and represent the correlation values between gene i and j in two networks, respectively. In this study, we observed that there were differences in the inequality of edge weights between GCNs of HIV-1 infection (Figure 2). At the acute and asymptomatic stages of HIV-1 infection, the edge weights are more equal than those in network for uninfected subjects. However, the edge weights become dramatically unequal in network for patients at the AIDS stage (Figure 2). On this basis, a novel measure “delta Gini” was introduced to consider the differences in the inequality of edge weights between two networks. Although the delta Gini and dC were significantly correlated in most network comparisons (except AIDS vs. Uninfected) (Figure 3), the delta Gini provided additional information about the changes of edge weights between two networks. First, the delta Gini is ranged from -1 to 1, with positive value indicating the inequality of edge weights is increased and negative values indicating the inequality of edge weights is decreased. Second, the delta Gini is valuable to identify candidate biomarkers of HIV-1 with low rank of dC values. For instance, MRC1 is a mannose receptor interacting with several HIV proteins to promote viral spread [13-15], and has a delta Gini value of -0.44 (rank=2) and a dC value of 0.96 (rank=173) while comparing networks constructed for patients at the AIDS stage and for uninfected subjects. Similarly, PPFIBP1, which plays roles in HIV-1 replication, also has a high rank of delta Gini (value=-0.42; rank=3) but a low rank of dC (value=0.89; rank=283). MDM4 is another representative example showing a positive and high-ranked delta Gini (value=0.36; rank=22), but a low-ranked dC (value=0.92; rank=235). This gene was recently demonstrated to be a direct calpain substrate playing roles in the HIV-induced neuronal damage . The detailed values of delta Gini and dC for all comparisons of GCNs of HIV-1 infection were listed in Supplemental Table 1.
Table 1: List of top 10 genes with largest changes of edge weights between two compared networks.
These results indicate that Gini algorithm would be a complementary approach to dC for comparing the differences between two GCNs.