An Exploratory Analysis of Conservation of Co-Expressed Genes across Alzheimer ’ s disease Progression

Alzheimer’s disease (AD) is a complex disease where the analysis of gene expression patterns relies on computational techniques to understand the cause and progression of the disease. Evidence postulates that the complexity of AD stems from the overlap of early-stage markers with normal aging. Furthermore, there is increasing evidence suggesting that gene co-regulation in AD plays a vital role in the progression of the disease. The aim of this work is to identify and track co-regulated genes from incipient to severe cases of AD i.e. samples that exhibited progression of AD. We hypothesize that co-expressed genes associated with two markers of AD (the cognitive marker-mini mental state examination (MMSE) and the pathological marker Neurofibrillary Tangles (NFT)), are conserved across AD progression. For our analysis we used the Blalock dataset and the prominent tool weighted correlation network analysis (WGCNA). Through our analysis we observed that genes GNA11 and MAP2K2 were consistently ranked through the progression of AD. The functional analysis of the identified co-regulated genes at the incipient stages of AD includes RNA and cofactor binding. Through this exploratory study we conclude that from incipient to severe stages of AD the gamut of co-regulated genes vary rather than being conserved across disease severity.


Introduction
Alzheimer's disease (AD) is a complex progressive neurodegenerative disorder of the brain. The complexity of this disease has been attributed to several pathogenic factors that initiate the processes associated with AD. These pathogenic factors, are still elusive, are responsible for processes such as abnormal β amyloid (Aβ) production [1], tau pathology and the progressive formation of neurofibrillary tangles [2], inflammation, synaptic loss, oxidative damage, protein processing or misfolding [3], calcium dyshomeostasis [4], aberrant reentry of neurons into cell cycle, cholesterol synthesis and effects of hormones, or growth factors. This illusive nature of AD has been attributed to three factors, namely (a) the complexity of the disease itself, (b) the variability of gene expression resulting from the cellular heterogeneity of brain tissues, and (c) from interindividual genetic variability [5]. Blalock et al. [6] in conjunction with microarray analysis investigated the role of genes in discriminating between subjects that had mild cognitive impairment, incipient AD, and those with normal aging [6]. They successfully linked patterns of gene expression to a cognitive marker, mini-mental status examination (MMSE) and a pathological marker neurofibrillary tangle (NFT) of AD, independent of AD diagnosis. Furthermore, they reported that these markers are negatively correlated and tested this correlation with the gene expression patterns of 31 separate microarray samples [6]. They identified genes with potentially important implications for the early pathogenesis of AD specifically in the incipient stage of the disease. This observation of gene expression correlating to the markers was further reinforced by Dunckley et al. [5].
In recent times, there is a growing interest in using co-expression analysis to identify subsets of co-expressed genes that act as molecular signatures for genetic disease. Gene co-expression as defined by Lee et al. [7] is an approach in which genes that have similar expression patterns across a set of samples are hypothesized to either have a or belong to a common functional process [8,9]. As shown by Ray and Zhang [10], used gene co-expression analysis to identify the differences among regions of the brain to shed light onto the progression of AD. Prominent computational techniques used for the identification of co-expressed gene subsets include bi-clustering [11]. Bi-clustering techniques are aimed at finding co-expressed genes across a set of microarray samples. However, bi-clustering techniques rely solely on gene expression patterns across samples to establish function similarity. More recently, techniques of co-expression gene networks have become a popular technique used to identify functional relationships between co-expressed genes. Based on the principles of graph theory, these coexpression gene network techniques uses both gene expression patterns and information from prominent markers to identify closely connected groups of genes that represent co-expressed genes. According to Ray et al. [12], gene co-expression network analysis was successfully used to establish the relation between functional clusters of genes and cisregulatory elements to identify differentially expressed genes used to elucidate the biological processes involved in AD.
Based on the observations derived from Blalock et al. [6], the objective of this study is to identify the co-expressed genes in the incipient AD hippocampus samples using the cognitive marker MMSE and the pathological marker NFT. We hypothesize that co-expressed genes from the incipient AD samples are conserved across different degrees of AD severity and this conservation is best captured by taking into consideration the negative correlation between markers of MMSE and NFT. We therefore aim to create a functional map of these incipient co-expressed genes and track their evolution through the progression of the disease.
We focus our discussion on the schematic representation of the proposed methodology and the steps entailed. In the following sections we describe the results obtained and the function interpretation of the gene clusters identified using the gene ontology enrichment analysis software toolkit (GOEAST) 1 \ [13] DAVID-EASE 2 [14], and the tool for pathway analysis known as Pathway Express 3 [15]

Materials and Methods
The proposed methodology was developed based on the data used by Blalock et al. [6]. The data set consists of expression profiles of the brains hippocampi. Obtained from 22 postmortem subjects with Alzheimer's disease (AD) at various stages of severity, the dataset consists of seven incipient AD samples, eight moderate AD samples, and seven severe AD samples respectively. The dataset also contains nine control samples, resulting in a dataset consisting of 31 samples. Each sample is scored by both (a) cognitive marker MMSE and (b) the pathological marker NFT.
Related studies on AD first select a set of differentially expressed genes on which further analysis is performed [10,16,17]. However, comparing lists of genes from various AD studies has been shown to be computationally inefficient as the resultant computational techniques developed become gene specific. Keeping this in mind we propose a framework that is aimed at providing a comprehensive view of the coexpression patterns between genes using the technique of weighted co-expression network analysis (WGCNA) [8,17]. Our objective is to, organizing co-expressed genes into gene clusters (aka modules) that are believed to be of functional significance. The functional significance of genes in each module is determined using both (a) similarity between gene expression profiles and (b) maximum average negative correlation between the cognitive trait MMSE and pathological trait NFT respectively. The proposed framework therefore consists of twostages (as shown in Figure 6) that encapsulates (a) the identification and ranking gene modules/clusters based on significance, and (b) the identification of conserved co-expressed genes through the disease progression.
Lastly, it will be further interesting to use these types of algorithms in the analysis of larger sample sets of incipient, moderate, and severe AD to further understand what genetic networks are selectively targeted in the early, moderate and late stages of the AD process, and what novel therapeutic strategies may be then devised for the clinical management of this expanding health care concern.
Identification and ranking of gene clusters / modules As shown in Figure 6, before describing the process of identification and ranking of gene co-expression gene modules/clusters, the gene expression data is subjected to data preprocessing strategy employed. Data preprocessing: As described, the dataset consisted of 31 samples. Each sample consisted of 22,283 genes. These genes were subjected gene filtering, where genes with p-value<0.5 were considered to be significant. We further pruned the number of genes to find those genes that were common across all thirty one samples in the dataset. The resultant gene set consisted of 4,483 genes. The resultant 1 http://omicslab.genetics.ac.cn/GOEAST/index.php 2 http://david.abcc.ncifcrf.gov/ 3 http://vortex.cs.wayne.edu/projects.htm#Pathway-Express gene set was standardized using the z-score normalization, where the gene expression values were normalized to have a common mean and standard deviation.

WGCNA Analysis:
The proposed framework consists of a one versus one (pair-wise) comparison of sample classes, categorized by the degree of Alzheimer's disease severity. In the case of control versus incipient AD (CI) we considered the normalized expression data of nine control samples along with the seven incipient samples for analysis.

Identification of co-expressed genes
To effectively identify clusters of co-expressed genes for each of the combination of classes, we used the method weighted gene coexpression network analysis (WGCNA) [18,19]. WGCNA treats each gene (i.e., gene expression profile across a set of samples) as node in a network. The i th gene expression profile x i is a vector whose components report the gene expression values across m samples. We used the co-expression similarity s ij as the absolute value of the correlation coefficient between their expression profiles between two genes i and j.
As prescribed in WGCNA, we used the concept of soft threshold to equate the co-expression similarity s ij into a measure of connection strength between genes i and j, respectively. This transformation was brought about by considering the power of the absolute value of the correlation coefficient represented as The choice of the threshold b controls the sensitivity and specificity of the pairwise connection strength independent of the size of the network and the significance of correlation between expression profiles of a pair of genes. Co-expression networks aim at detecting subsets of genes (gene clusters) that are tightly connected to each other. We used topological overlap (TO) as measure of dissimilarity to identify biologically significant cluster of genes [20]. The TO (ω i ) of gene i with other genes in a network is represented as Thus, a gene i with a high ω i has a high overlap with many other genes in the network. The topological overlap of two nodes reflects their relative interconnectedness. The topological overlap matrix (TOM) Ω = [ω i ] provides a similarity measure between multiple genes in the gene co-expression network represented as where ij iu uj u l a a = ∑ , and i iu u k a = ∑ is the node connectivity. As ω ij is a measure of similarity it was subtracted from one, i.e. to convert it to a measure of dissimilarity In this work we used the TOM-based dissimilarity ij ω Ψ to cluster gene expression profiles for module identification. It is believed that TOM-based dissimilarity ij ω Ψ leads to more distinct cluster of genes.

Identification of gene clusters/modules
We define a cluster of genes (module) whose expression profiles are highly correlated across the samples and with high topological overlap. We used the average linkage hierarchical clustering coupled with the TOM-based dissimilarity ij coherent expression profiles. This clustering was implemented using the dynamic tree cut library of the WGCNA package [21] to identify coherent clusters of genes.

Cluster ranking and selection
Once the clusters of co-expressed genes were identified, our immediate objective was to choose those clusters that are biologically significant. In this paper we define the significance (GS) of a cluster of genes based on the averaged values of the markers MMSE and NFT scores across all genes in the cluster [6].
In determining the GS of a cluster of genes, we assign the nonnegative marker score to each gene (assigned by correlation between genes to sample) and averaged across all the genes in that cluster. In this work we deal two traits -MMSE and NFT. We therefore carry out the ranking of clusters independently for each trait. Based on our hypothesis, we believe that clusters of genes that exhibit the higher average deviation of the GS between traits MMSE and NFT relate to higher functional significance. The average deviation δ for a cluster of genes M between the GS scores of MMSE and NFT is captured using the following relation.

Functional significance and conservation of genes
As previously stated, the above approach is carried out on three (one versus one -pair wise) combinations that include severe AD versus incipient AD (SI), control versus incipient AD (CI), and moderate AD versus incipient AD (MI). For each of the above combinations we extract the top four ranked clusters of genes based on average deviation of GS (based on equation 6).
In order to determine the molecular function of a set of genes commonly expressed across AD severity, we take into consideration the pair wise combinations CI, MI, and SI, to identify subsets of genes that are commonly expressed across AD severity ( Figure 7). Since these combinations share a common sample set (incipient AD), we believe that we should observe a set of commonly expressed genes.

Functional annotation of gene clusters
Database for Annotation, Visualization and Integrated Discovery (DAVID) is a web-based tool that provides integrated solutions for the annotation and analysis of genome-scale datasets derived from highthroughput technologies such as microarrays. Analyses of results are dynamically linked to primary data and external data repositories, to provide an in-depth as well as broad-based data coverage. DAVID EASE is useful for summarizing the predominant biological "theme" of a given gene list. Given a list of genes resulting from our analysis is submitted to DAVID and it rapidly calculates over-representation statistics for every possible Gene Ontology term with respect to all genes represented in the data set.
We used DAVID version 6.7 [14] to identify annotation terms significantly enriched in each reference gene set. We used the modified Fisher's exact test, or EASE score, to identify enriched annotation terms derived from GNF-U133A-QUARTILE and gene ontology (GO) annotation terms, which includes biological process (BP), molecular function (MF), and cellular component (CC) categories. We used the more specific GO term categories provided by DAVID, called GO FAT, to minimize the redundancy of the more general GO terms in the analysis to increase the specificity of the terms.
A list of gene symbols was generated for each dataset and used as input into DAVID. We used the Functional Annotation Tool, with the Human Genome U133A plus 2.0 Array as the gene background, to independently analyze each gene set. We used a count threshold of 5 and the default value of 0.1 for the EASE score settings. We also used the Benjamin corrected p-value, with p<0.05 as the significance threshold. Significant annotation terms identified in the GNF annotation category were further filtered using the inter quartile range of the category size, where the 1 st and 3 rd quartile were removed from the results. Significant annotation terms in the remaining GO annotation categories were filtered by removing those terms with a category size less than 100 and greater than 1000.
Apart for using the tool DAVID EASE for gene enrichment, we used the gene enrichment tool Gene Ontology Enrichment Analysis Software Toolkit (GOEAST) [13]. This tool is predominantly used to identify statistically over-represented GO terms within sets of genes. In our analysis, we subject the probe-ID's of different genes in the gene clusters to generate enriched GO terms in graphical format according to their relationships in the hierarchical tree of each GO category (Biological process, Cellular component, and Molecular function), for a visual representation of relationships between genes.
Similarly to identify the pathways associated with genes of interest, we used the tool Pathway express [15] that is part of the popular tool Onto-express [22].

Co-expressed genes in Incipient Alzheimer's disease samples
To extract a candidate set of genes that are co-expressed in the incipient samples, we use the R package weighted gene co-expression network analysis (WGCNA) [8]. In the experimental setup we consider the seven incipient AD samples from the Blalock dataset. To retain an effective baseline for gene expression we take into consideration the nine control samples throughout this analysis. These samples were preprocessed and normalized resulting in a set of 4,483 genes. We then apply the procedure of WGCNA to identify those genes that are coexpressed in the incipient samples.
As part of WGCNA we used soft threshold (β) using Pearsons correlation. As shown in Figure 1 the value of the soft threshold β is determined be of power equal to eight, where we observe stability in both the scale free topology and the mean connectivity respectively.

Clusters of genes
With the soft threshold determined and fixed at eight, in the next step we work towards the identification of clusters of genes. Using the cognitive marker MMSE and the pathological marker NFT independently, we cluster genes based on their topological overlap (TO) using WGCNA. Figure 2 provides the hierarchical clustering of genes based on their TO profiles. We used the automatic module detection using dynamic tree cutting (DTC) to identify clusters of genes.
With the identification of independent clusters, we rank the clusters on the average deviation of MMSE and NFT scores. Based on the observation by Blalock et al. that MMSE and NFT are negatively correlated, we assign the cluster of genes that has the highest average deviation of MMSE and NFT the highest rank (M 1 ). Table 1 shows the ranking of gene clusters in the control versus incipient samples. genes of the highest ranked gene cluster CI-Purple (M 1 ) is shown in Figure 3. The genes of this cluster (Table 2) exhibit maximum average deviation considering the cognitive trait MMSE and the pathological trait NFT. There were 101 genes selected in gene cluster CI-Purple (M 1 ), Figure 3(a) shows the 101 genes expressed in the control samples. Similarly Figure 3(b) shows the expression profile of the same 101 genes in the incipient AD samples. We observe there is a consistency in the expression profiles of the control and incipient genes exhibiting co-expression across sample types.

Functional profile of selected genes
To ascertain common biological functions associated with the 101 genes of cluster CI-Purple (M 1 ), we adopted an integrated bioinformatics approach based on structured biological knowledge provided by the Gene Ontology (GO) consortium [23]. We applied the GO enrichment analysis software toolkit (GOEAST) [13] to the 101 genes. Briefly, all three branches of GO knowledge structure (biological process (BP), molecular function (MF), and cellular component (CC)) were utilized for this analysis. To test for GO category enrichment, we used the following filtering criteria: (a) p-value cutoff of ≤ 0.001, (b) limited the GO annotation category size of × ≤ 1000 to minimize artificial elevation of p-value, and (c) gene count of >4 in significant categories. Applying these filters, a total of nine enriched GO categories were identified for the 101 genes in the cluster CI-Purple (M 1 ): five BP categories, 10 CC categories, and nine MF categories (Table 3). For additional support, we also performed GO enrichment analysis      Table 4 for the different categories derived using DAVID analysis.
We further characterize the functionality of the 101 genes in the CI-Purple (M 1 ) cluster through pathway analysis using the tool Pathway Express [5]. Using the KEGG database, there were 45 pathways that correspond to the genes in CI-Purple (M 1 ) ( Table 5). The significance of each molecular pathway with regards to the 101 genes in CI-Purple (M 1 ) is based on the impact factor. We filtered out those molecular pathways that were below impact factor <1.75. The resultant were 17 molecular pathways, of which the top five were the Chronic myeloid leukemia (3 genes, impact factor 5.424), Ribosome (3 genes, impact factor 5.243), Notch signaling pathway (2 genes, impact factor 4.008), Vibrio cholera infection (2 genes, impact factor 3.552), and Huntington's disease (3 genes, impact factor 3.228). Of the 17 significant molecular pathways we also observe the Parkinson's disease pathway and the Alzheimer's disease pathways.
To visualize the relationship among enriched GO categories, we generated directed acyclic graphs based on GO knowledge base using the tool GOEAST. We focus our attention to the MF graph (Figure 4), enriched terminal nodes relate to RNA binding and nucleotide binding functions of the GO.
We next perform our analysis of conservation of different functional categories through the progress of AD. Here we repeat our analysis of identifying co-expressed genes using samples of moderate AD with incipient AD to extract clusters of genes. Here we observe a cluster of 177 genes MI-Black (M 1 ) possessed the highest average deviation of 0.226 between the cognitive trait MMSE and the pathological trait NFT. Table 6 provides a list of all clusters of genes in this experiment while considering both moderate AD samples along with incipient AD samples. Gene clusters were ranked based on average deviation of gene significance with respect to both traits. In the similar fashion, we carry out our analysis of identification of co-expressed genes clusters considering samples of severe AD along with samples of incipient AD. Table 7 provides a list of gene clusters ranked based on average deviation of gene significance with respect to both MMSE and NFT traits. The highest ranked gene cluster while considering severe AD and incipient AD samples is the SI-Black (M 1 ) containing 171 genes.
In order to identify the conservation of molecular function across disease severity, we find the common genes expressed across gene   clusters (CI-Purple (M 1 ), MI-Black (M 1 ), and SI-Black (M 1 )) that have highest ranks.
To find the common functions retained between incipient AD and moderate AD samples, we consider those 57 genes that are common between gene clusters CI-Purple (M 1 ) and MI-Black (M 1 ). We subject these genes to the gene enrichment analysis using the tool DAVID EASE (Table 8). We observe two BP categories, nine CC categories, and three MF categories. The MF retained was the RNA binding (MF: 8 genes, p-value=5.4 × 10 -3 ) and nucleotide binding (MF: 13 genes, p-value=3.1×10 -2 ), when compared to the MF found in Table 4 of the  CI-Purple (M 1 ). We then subject these common genes to identify the pathways associated with the common 57 genes using the tool pathway express. The resultant 41 pathways associated with these genes and their corresponding impact scores can be found in Table 9.
To identify those incipient co-expressed genes that are conserved through the severe samples; we extend our analysis to find the genes common to all three gene clusters CI-Purple (M 1 ), MI-Black (M 1 ), and SI-Black (M 1 ).   the 40 common genes using the gene enrichment analysis using the tool DAVID EASE. Once again we observe that the MF RNA binding (MF: 11 genes, p-value=2.0×10 -2 ) and nucleotide binding (MF: 6 genes, p-value=2.0×10 -2 ), were retained across through AD samples from incipient, moderate, and severe. We then subject these 40 genes to the tool pathway express to identify the associated pathways with the genes. Table 11 contains a list of the 38 pathways associated with genes.

Discussion
The overarching question being addressed in this work is what are (if any) the gene sets in incipient AD samples that are co-expressed. Our objective is to identify these co-expressed genes sets as they could be potential disease markers that could be traced through their differential expression. We believe that these markers could to some extent be responsible for the phenotypic differences through the progression of AD.
The identification of AD markers in its incipient stages is a challenge considering the complexity of the disease and its close association with the normal aging process. In this work we have used gene co-expression network analysis (WGCNA) which is used to identify networks of co-expressed genes. We believed that functionally co-expressed genes can be modeled as a complex network. Subnetworks (clusters) within this gene network are believed to have a high level of connectivity determined by correlated expression profiles. We therefore subjected these gene clusters to gene ontology enrichment studies to list prominent functional associated with the genes.
Most of the genes in the incipient stage of AD are associated with RNA binding, nucleotide binding, RNA metabolism and processing, and cofactor binding. Nucleotide and RNA binding were the most prominent molecular functions associated with the co-expressed genes in the incipient stage.
Interestingly, much recent independent work has underscored the important role of up-regulated small non-coding RNAs (sncRNAs) called micro RNAs (miRNAs) in contributing to altered RNA signal processing, resulting in an altered transcriptome in AD brain [24,25].
The pathways associated with the co-expressed genes in the incipient stage of AD, are the chronic myeloid leukemia, ribosome, Notch signaling pathway, and the Vibrio cholera infection pathways. These four pathways were ranked the highest based on their impact scores.   Genes that are conserved through the progression of AD from incipient to moderate, include ACTB gene, the MAP2K2, NDUFV1, and GNA11 respectively. The pathways associated with gene ACTB include, regulation of atin cytoskeleton, shigellois, pathogenic Escherichia coli infection, Vibrio cholera infection, cardiac muscle contraction, aderens junction, leukocyte trans-endothelial migration, tight junction, and focal adhesion.
The gene NDUFV1, is associated with the following pathways: Alzheimer's disease, Huntington's disease, and Parkinson's disease.
The gene MAP2K2 is associated with a range of pathways that include, long term depression, gap junction, GnRH signaling pathway, insulin signaling pathway, thyroid cancer, regulation of actin cytoskeleton, bladder cancer, endometrial cancer, non-small cell lung cancer, acute myeloid cancer, glioma, renal cell carcinoma, melanoma, long-term potential, VEGF signaling pathway, chronic myeloid leukemia, Fc epsilon RI signaling pathway, ErbB signaling pathway, prostate cancer, toll-like receptor signaling pathway, melanogenesis, t-cell receptor signaling pathway, natural killer cell mediated cytotoxicity, MAPK signaling pathway, and pathways in cancer.
The most important is the gene GNA11 and its associated pathways relate to long term depression, gap junction, the calcium signaling pathway, and the GnRH signaling pathway ( Figure 5).
Similarly, genes that are functionally significant and at the same time conserved throughout the progression of AD from incipient through severe include: RPL12, RPL13, FASN, GAPDH, CDC34, DCTN1, GNA11, ACTB, and MAP2K2. The corresponding pathways that gene RPL12 and RPL13 belong to Ribosome. Similarly, gene FASN belongs to the Insulin signaling pathway, gene GAPDH belongs to the Alzheimer's disease pathway, CDC34 belongs to the Ubiquitin mediated proteolysis pathway, and gene DCTN1 belongs to the Huntington's disease pathway. The gene GNA11 belongs to four pathways that 2 Table 11: KEGG pathway analysis of 40 common genes between gene clusters CI-Purple (M 1 ), MI-Black (M 1 ), and SI-Black (M 1 ) using the tool Pathway Express.   include long-term depression, gap junction, GnRH signaling pathway, and calcium signaling pathway.
The gene MAP2K2 is associated with 25 known pathways that include: long-term depression, gap junction, GnRH signaling pathway, insulin signaling pathway, regulation of actin cytoskeleton, thyroid cancer, bladder cancer, endometrial cancer, non-small cell lung cancer, acute myeloid leukemia, glioma, renal cell carcinoma, melanoma, long term potentiation, VEGF signaling pathway, chronic myeloid leukemia, Fc epsilon RI signaling pathway, ErbB signaling pathway, prostate cancer, toll-like receptor signaling pathway, melanogenesis, t-cell receptor signaling pathway, natural killer cell mediated cytotoxicity, MAPK signaling pathway, and pathways in cancer.
Of all the associated genes and pathways, the genes GNA11 and MAP2K2 are prominently ranked in an ascending order as the disease progresses and are closely related to Alzheimer's disease.