alexa Integrating Biological Heuristics and Gene Expression Data for Gene Regulatory Network Inference | Open Access Journals
ISSN: 2157-7420
Journal of Health & Medical Informatics
Like us on:
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Integrating Biological Heuristics and Gene Expression Data for Gene Regulatory Network Inference

Armita Zarnegar1, Andrew Stranieri1*, Peter Vamplew1 and Herbert F Jelinek2,3

1Centre for Informatics and Applied Optimization, Federation University, Australia

2Australian School of Advanced Medicine, Macquarie University, Sydney, Australia

3Centre for Research in Complex Systems and School of Community Health, Charles Sturt University, Albury, Australia

*Corresponding Author:
Andrew Stranieri
School of Information Technology and Engineering
Centre for Informatics and Applied Optimisation
Federation University, Australia
Tel: 61(0)3.53279283
Fax: 61(0)411147195
E-mail: [email protected]

Received Date: April 10, 2017; Accepted Date: April 17, 2017; Published Date: April 20, 2017

Citation: Zarnegar A, Stranieri A, Vamplew P, Jelinek HF (2017) Integrating Biological Heuristics and Gene Expression Data for Gene Regulatory Network Inference. J Health Med Informat 8:258. doi: 10.4172/2157-7420.1000258

Copyright: © 2017 Zarnegar A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Health & Medical Informatics

Abstract

Background: Gene Regulatory Networks (GRNs) offer enhanced insight into the biological functions and biochemical pathways of cells associated with gene regulatory mechanisms. However, obtaining accurate GRNs that explain gene expressions and functional associations still remains a difficult task. Only a few studies have incorporated heuristics into a GRN discovery process. Doing so has the potential to improve accuracy and reduce the search space and computational time. A technique for GRN discovery that integrates heuristic information into the discovery process is advanced. The approach incorporates three elements: 1) A novel 2D visualized co-expression function that measures the association between genes; 2) A post-processing step that improves detection of up, down and self-regulation; and 3) The application of heuristics to generate a Hub network as the backbone of the GRN. Methods: Using available microarray and next generation sequencing data from Escherichia coli, six synthetic benchmark GRN datasets were generated using the neighbourhood addition and cluster addition methods available in SynTReN. Results of the novel 2D-visualization co-expression function were compared with results obtained using Pearson’s correlation and mutual information. The performance of the biological genetics-based heuristics consisting of the 2D-Visualized Co-expression function, post-processing and Hub network was then evaluated by comparing the performance to the GRNs obtained by ARACNe and CLR. Results: The 2D-Visualized Co-expression function significantly improved gene-gene association matching compared to Pearson’s correlation coefficient (t=3.46, df=5, p=0.02) and Mutual Information (t=4.42, df=5, p=0.007). The heuristics model gave a 60% improvement against ARACNE (p=0.02) and CLR (p=0.019). Conclusion: Analysis of E. coli data suggests that the GRN discovery technique proposed is capable of identifying significant transcriptional regulatory interactions and the corresponding regulatory networks. Evaluation studies on different benchmarks demonstrated a substantial performance improvement over state-of-the-art systems.

Keywords

Gene expression; Gene regulatory network; Hubs; Association function; Correlation function

Introduction

Gene Regulatory Networks (GRN) is determined from experimental data that indicates that a regulatory gene product either self-regulates or controls other gene products. GRNs are graph-based networks where vertices indicate genetic components such as transcription factors and the edges are associated regulatory mechanisms [1]. Analysis of GRNs provides insight into how genes cooperate that can potentially lead to new treatments for diseases by means of identifying important genes and gene associations. Early endeavours in GRN discovery extracted gene expression data from microarray experiments to derive interactions between genes [2]. However, microarray technology was found to be highly variable, noisy and expression levels were only measured indirectly with some error [3]. Next Generation Sequencing (NGS) to extract gene expression data [4] was found to be more accurate and reliable; however, it is more expensive and complicated when processing a large numbers of samples [5].

Recent studies for GRN discovery have practical limitations in that most use computational and statistical techniques alone and ignore knowledge about previously discovered GRN’s [6-8]. The application of domain knowledge in the form of heuristics to reduce the search space and guide an algorithm toward a solution has been useful in complex questions outside bioinformatics [9]. For constructing GRNs, Mootha et al. [10] suggested that domain knowledge can assist computational methods and improve the discovery process. Domain knowledge in the form of Gene Ontology (GO) was used in microarray data analysis and GRN discovery by Segal et al. [11] who utilized gene information to partition the search space. They then made use of Bayesian Networks to find dependencies between genes.

Representations of domain knowledge to inform the GRN discovery process have also been advanced by Lo et al. [12], Zhu et al. [13], and Yang et al. [14] who integrated multiple sources of information, such as position affinity and transcriptional modules, to guide the GRN discovery process. Heuristics that may assist with the GRN discovery process are categorized into two classes: i) Heuristics related to the nature of gene interactions; and ii) Heuristics related to the structure of the gene network. To our knowledge, these two heuristic categories have not been studied in the context of GRN discovery.

Heuristics Related to the Nature of Gene Interactions

Typically, methods for GRN inference employ functions such as Pearson’s Correlation Coefficient (PCC) or Mutual Information (MI) to measure the extent to which a gene’s expression is associated with another gene’s expression. However, PCC and MI measures are not entirely consistent with knowledge related to the nature of gene interactions including activation and inhibition and more complex interactions associated with microRNA regulation of gene transcription [15].

Sahoo et al. [16] proposed a measure of association that detected pairwise associations without using PPC or MI measures. The pairwise association measure discretized expression levels into categories of “high” and “low” and then, generated implication relationships to find patterns of these categories. This approach identified indirect relationships between genes and led to the discovery of missing links in some pathways.

The same approach was used in a study by Fioravanti et al. [17] for detecting motifs consisting of self-regulatory and feed-forward loops to construct GRNs. This network created stability in the network against minor perturbations. In another similar study, Wang et al. [18] defined two types of relationships (similar and prerequisite) and found pairwise relationships between genes that minimized classification errors.

In this study, gene-gene interactions were represented as a 2 dimensional grid of expression levels. This replaced PCC or MI that reduces the interaction between two genes down to a single number. A heuristic operationalization of up, down and self-regulation was able to be devised based on the 2 dimensional grid. Heuristics Related to the Structure of the Network.

GRNs tend to be graphs with scale-free properties where most genes are connected to a small number of genes referred to as Hubs. From Kepes [19]; Martinez-Anonio [20] it is known that the distribution of the pairwise correlation coefficients of genes follow a power law in that, while the majority of gene pairs have only a few links, a few pairs display a significant number of edges connected to them. Hub genes are global transcription factors and essential nodes required for regulation of gene expression [4,21]. Hub genes are highly connected with a high degree of output, with typically 10-15 instances of co-regulation with other Hubs [20]. A heuristic that specifies that Hubs should first be identified is promising as Hubs regulate transcription of many other genes.

The current work is premised on the assumption that the GRN discovery process can be improved by using heuristics.

Methods

We have made use of some heuristics from domain knowledge of gene interactions and structure of the GRN to enhance the effectiveness of GRN discovery. The framework that we adopted includes three main components as illustrated in Figure 1 and described in detail below: 1) A 2D Visualized Co-regulation function, which is an association function for measuring pair-wise dependencies between genes; 2) A post-processing component that processes and refines the pair-wise dependencies between genes and reduces the number of false positives; and 3) A hub network that is designed to construct the backbone of the GRN.

health-medical-informatics-System-architecture

Figure 1: System architecture.

2D Visualized co-regulation functions

Existing measures of association to measure gene interactions such as the Pearson’s Correlation Coefficient (PCC) and Mutual Information (MI) have been designed to measure associations between any two variables but are not ideal for measuring the co-regulation patterns of genes as they are not able to precisely capture up-regulation, down regulation or dual interactions between genes [17].

An association measure based more closely on the regulatory relationships for up-regulation, down-regulation, and dual interactions can be defined using a frequentist approach, which implies that attribute values have to group by class for the attribute to be a good predictor. The 2D Visualized Co-regulation function we propose generates a matrix comprising discretized expression levels for two genes as a frequentist approach and visualizes the gene-gene associations for further interpretation. Cells represent the proportion of examples in which the first gene and the second gene have the expression values in the range of the cell boundaries. For instance in the upper left cell entries show that there were no samples with a normalized expression level between 0 and 0.1 for the two genes investigated simultaneously.

The 2D Visualized Co-regulation function proposed here provides a visual depiction of gene associations and can assist biologists with gaining insight into gene interactions. Further, the matrix identifies association patterns of expression as up-regulation, dual and down-regulation relationships and is mapped to different areas of the matrix based on a set threshold. For example, up-regulation is mapped to High-High (HH) areas of the matrix (bottom right side of the matrix) where both genes are at their high expression levels (Figure 2).

health-medical-informatics-regulation-function

Figure 2: 2D visualized co-regulation function.

Figure 2 is a sample 2D Visualized Co-regulation matrix with discretized expression levels for one gene along rows and discretized expression levels for other genes along columns. Each cell represents the proportion of the times when the corresponding genes have the expression values in the range of the cell boundaries.

The matrix shows boundaries for the 25th percentile and 75th percentile for each gene. These boundaries demarcate areas, which are arbitrary boundaries to define expression relationships between two genes as Low-Low, High-High, Low-high and High-Low. The results of the new 2D Visualized Co-regulation function presented here was compared to PPC and MI applying static boundaries and threshold measures.

The depiction of gene interactions using the grid in Figure 2 enabled the operationalization of regulation relations with applying following heuristics:

Heuristic 1: “A gene up-regulates another if the predominant interactions are HH”.

Heuristic 2: “A gene down-regulates another if the predominant interactions are HL”.

Heuristic 3: “A gene is in a dual relationship with another if the predominant interactions were HL and LH”.

Heuristic 4: “A gene does not regulate another unless it up-regulates, down-regulates or is in a dual relationship with another”.

This formulation contrasts with a Pearson co-efficient and MI whereby a positive association exists between genes A and B if, whenever A is low, B is also low, and whenever A is high, B is also high.

Post-processing

In this study, two post-processing procedures were applied to eliminate false positives and improve accuracy. The first post-processing heuristic eliminates false positive interactions by looking for the absence of the reverse interaction.

For example, if there is a gene pair labelled with an up-regulation interaction (high-high), then we expect that there are few associations indicative of a high-low (or in other words down-regulation) interaction. The heuristic deployed can be expressed as:

Heuristic 5: “If a gene up-regulates another, it cannot also down regulate it”.

If the number of entries inside the down-regulation area (high-low area) exceeds a user-set threshold we label the pair as being a false positive, otherwise it is labelled a true positive.

This approach is also applied to detect true self-regulations. A simple rule to apply this post-processing procedure is formulated in Equation 1.

image (1)

The optimal lower threshold of 10% and the upper threshold of 60% were determined experimentally. With this rule, we were able to distinguish all of the self-regulatory false positive pairs in six data sets without removing any true positives and eliminate extreme ranges, which are more likely to be noise.

The second post-processing procedure used Data Processing Inequality (DPI) a measure from the Information Theory domain [22] to remove additional false positives. DPI was applied along with the Expected Mutual Information [23]. DPI works on the basis that indirect interactions, in which there is a chain of interactions between two genes, are typically weaker than direct interactions. The longer the chain between two genes, the weaker the interaction is expected to be. In contrast, a direct relationship presents a strong change [4]. DPI was applied to the output of the 2D Visualized Co-regulation function to remove the weakest chain of interactions and further reduced the number of false positives. The weakest interaction of a chain of three was defined as a score less than 15% of the minimum of the two others determined by DPI.

Hub network

The generation of a hub network is a key heuristic to arrive at a plausible GRN.

Heuristic 6: “A GRN is generated by first identifying hub nodes and linking them to form a hub network”.

This heuristic addresses the problem of matching a finite but large set of candidate networks to a given number of genes. By building a network of hubs before adding genes to hubs the search space is dramatically reduced and makes it possible to generate a GRN with an accurate structure, and also generate a network that is similar to networks obtained from domain knowledge. Heuristics drawn from domain knowledge regarding the average degree of each hub as the approximate number of genes connected to each hub can guide the addition of genes once the Hub Network is established. The network structure is also known to have an approximate average neighbourhood size of five.

The first layer (the core structure) of the network is built by first calculating the 2D Visualized Co-regulation measure for all genes in the expression data. This is followed by the post processing and background correction steps and finally selection of highly connected genes based on the previously mentioned threshold of the hub’s connectivity. This resulted in a network of hubs, which formed the backbone structure of our target network.

The second layer of the GRN consists of genes that are most strongly connected to hub nodes. The addition was weighted based on the known degree of each hub keeping non-hubs to an average degree of one or two [2].

Heuristic 7: “Non-hub Genes that were immediately connected to Hubs were those that were most strongly associated once normalised, with the Hub taking the known Hub degree into account”.

We also applied a background correction after computing the pairwise association of each gene by considering these heuristics. In our background correction process, we first normalized the correlation values between each gene and any other genes and then filtered out those, which were less than 0.5. We chose the top genes from the list of genes interacting with this gene according to the degree of the node and attached them to the node in the network.

We applied a heuristic selection procedure to ensure the scale-free, multi-layered structural properties of GRNs are retained as is found for the Regulon DB algorithm, which clearly shows three categories of genes based on their connectivity degrees, where the first category includes hub genes [24]. In our heuristic selection procedure if the degree of the node was less than 15 (identified as being the average hub degree in the biological network literature), we selected the node exactly according to its degree; otherwise, we selected according to the formula as shown in Equation (2).

image (2)

In Equation (2), n is the number of genes, which are selected from the normalized ranked list. This equation indicates that if a hub node has a degree of 15 or less, then at most 15 nodes from that list were attached to it. If a hub’s degree was more than 15, we used a function to decrease the number of false positive connections and to build a scale free network (as GRNs known to be).

In the background correction process, all non-hub genes get only their top ranked gene attached to them (the genes in the last layer have a degree of one). Based on information from the literature, non-hubs usually have less than five nodes and are most likely to have one or two connections. Therefore, we chose to attach only one node to non-hub nodes to build the third layer of the network.

Data

According to Haynes and Brent [25], small changes in the accuracy of data sets or the design of experiments can impact markedly on the performance of algorithms. Mateschke et al. [26] also demonstrated that evaluation on a single data set–especially on real data-is unsuitable to establish differences in the prediction accuracy of inference methods. Thus, it is important to have a benchmark with no experimental error to test different algorithms. However, there is a lack of large-size real benchmarks [26,27]. In view of this, we used six synthetic datasets as GRN benchmarks produced by SynTReN [28] with different types of networks, different levels of noise, and different levels of complicated regulatory relationships and using neighbourhood addition and the cluster addition method from the original Escherichia coli (E. coli) network obtained from EcoGene [29]. The neighbourhood addition method used a random graph model while the cluster addition method utilized sub-networks from domain knowledge and gradually added nodes (Table 1).

Dataset Number of Experiments Number of Genes Subnetwork selection method Biological Noise Experimental Noise Probability of Complex interactions
1 100 200 (100 background, 100 foreground) Neighbour addition 0.1 0.1 0.3
2 100 200 (100 background, 100 foreground) Cluster addition 0.1 0.1 0.3
3 100 200 (100 background, 100 foreground) Neighbour addition 0.1 0.1 0.4
4 100 200 (100 background, 100 foreground) Neighbour addition 0 0.1 0.3
5 50 200 (100 background, 100 foreground) Neighbour addition 0.1 0.1 0.3
6 100 200 (100 background, 100 foreground) Neighbour addition 0 0 0.3

Table 1: GRN benchmark networks and their characteristics.

Availability of data and materials

Statistics: Our system was tested on these six data sets and was evaluated based on the true positive, false negative, precision and recall F-measure. F-measure is a summary metric, which is the harmonic mean of precision and recall. This metric has previously been used to compare the performance of well-known GRN discovery systems applied to SynTReN data [30]. Sensitivity is the true positive outcome or recall and is the proportion of positives that are correctly identified. The false negatives are those negatives that have been incorrectly labelled and defined as 1-specificity.

Precision was determined by the standard deviation and the F-measure was applied to indicate accuracy of the heuristics. The F-measure is defined by the harmonic mean of positive predictive power (PPV) and specificity (S):

image (3)

For comparing the performance of the entire system, ARACNe [23] and CLR [31] were applied. ARACNe uses mutual information along with DPI to decrease the number of false positives. The Context Likelihood of Relatedness (CLR) algorithm is an extension of the relevance networks class of algorithms [32] and predicts regulations between transcription factors and genes when important mutual information can be detected. CLR adds an adaptive background correction step to the estimation of mutual information.

Results

The performance of the integrated system incorporating heuristics with the 2D Visualized co-regulation function, post processing and hub network algorithm was evaluated on the different benchmarks (Table 1). In this section, we describe results of the application of each of the components of our GRN discovery system as well as the performance of the entire system. The synthetic GRN networks generated (Table 1) differed depending on how they were constructed. The neighbour addition method resulted in greater variation for the median in-degree compared to the cluster addition method. Whilst the cluster addition method produced networks similar to domain networks in terms of the network properties [30]. These observations also hold true for topological characteristics other than average directed path length and average in-degree.

The following results will concentrate on the first and second synthetic GRN dataset, essentially comparing in more detail the difference the effect of how the synthetic GRNs were generated.

2D visualized co-regulation functions

Our 2D visualized co-regulation function was tested against two common methods used in the literature for measuring pairwise associations, namely Mutual Information (MI) and Pearson correlation. Table 2 and Figure 2 summarize the results of 2D Visualized Co-regulation function compared with MI and Pearson Precision and recall were calculated based on the number of true positives, false positives and false negatives discovered using the benchmark datasets and their corresponding goal network.

Method Total Records True Positive False Negative Precision Recall F-measure
MI 761 48 245 0.063 0.16 0.092
PCC 516 43 250 0.083 0.15 0.11
2D Visualized Co-regulation 482 48 245 0.1 0.16 0.12
Ml 812 61 173 0.075 0.26 0.12
PCC 598 52 182 0.087 0.22 0.13
2D Visualized Co-regulation 570 57 177 0.1 0.24 0.14

Table 2: Accuracy of two gene interactions determined by mutual information, Pearson’s correlation coefficient and 2D visualized co-regulation function.

The 2D Visualized Co-regulation function yielded more true positives and less false positives than Pearson’s correlation on both benchmark datasets. Compared with MI, the 2D Visualized Co-regulation function detected equal number of true positives on the first benchmark dataset and less true positives on the second dataset; however there were significantly fewer false positives. This shows that although the number of true positives found using MI is slightly greater than those detected by our function (on one dataset); it comes at the cost of many more false positives.

To carry out an overall performance comparison, our 2D Visualized Co-regulation function was applied to all six benchmark datasets., Performance was significantly better than the MI (t=4.42, df=5, p=0.0069) and Pearson (t=3.46, df=5, p=0.018) functions illustrated in the precision/recall diagram shown in Figure 3.

health-medical-informatics-correlation-functions

Figure 3: Performance comparison of correlation functions and 2D visualized co-regulation function.

Post-processing

Table 3 summarizes the results of Heuristic Post Processing (Heuristic PP) and Data Processing Inequality (DPI) averaged over the six benchmark networks. The F-measure indicates further improvements after applying each of these post-processing methods.

Method True Positive False Negative False Positive Precision Recall F-measure
Heuristic PP Before 72 221 430 0.14 0.25 0.18
After 72 221 416 0.15 0.25 0.19
DPI Before 72 221 430 0.14 0.25 0.18
- After 61 232 298 0.17 0.21 0.19

Table 3: Performance of two post-processing procedures.

Hub network

The hub network generated from gene expression data using our biologically based heuristics method improved results by 60% when compared to results of known E. coli networks published in the literature [20,33].

Integrated system performance

The combination of our 2D Visualized Co-regulation function, Hub Network, and post-processing method as a whole was tested on six different benchmark datasets. The results were compared with those of the ARACNe [23] and CLR [31] on the same datasets.

The result of our whole system performance against ARACNe and CLR is presented in Table 4. A paired T-test between the results of our entire system against those of CLR and ARACNE across different datasets determined that our system performed significantly better than the other two systems. (T-test on F-measure for ARACNe against Hub Network=0.024; CLR against Hub Network=0.0187). This demonstrates that in general, our system (Hub Network) outperforms the two existing methods for GRN discovery. Our model not only performs better compared with the two state-of-the-art systems, but also achieves good results with benchmark datasets that are more similar to known networks in terms of network structure.

Data Set Method Total Records True Positive False Negative Precision Recall F-measure
1 ARACNe 140 24 269 0.082 0.17 0.11
CLR T=10 589 39 254 0.13 0.07 0.09
Hub Network 206 41 252 0.20 0.14 0.17
ARACNe 218 31 203 0.13 0.14 0.14
2 CLR T=10 842 64 170 0.27 0.08 0.12
Hub Network 221 47 187 0.22 0.21 0.21

Table 4: Benchmark results of Hub algorithm on the two datasets.

Figure 4 shows the entire network discovered on the first benchmark, which exhibits similar structural properties to known networks.

health-medical-informatics-first-benchmark

Figure 4: The predicted regulatory network of Escherichia coli, on the first benchmark.

This is partially as a result of the similarity of the core structure of our generated network (Hub Network) with that of the real network and partially due to the fact that unlike some other systems (e.g. CLR), our method can detect loops as well as the direction (up-regulation, down-regulation and dual regulation) of the relationships in the network.

Discussion

Genomic research requires increasingly more data integration to deal with the large data sets from high-throughput molecular biology such as microarray data and NGS-based methods that are obtained from different experimental and analytical sources [5]. Shortcomings of current computation models relate to the nonlinearity of the genome data and the large number of genes and other molecules such as transcription factors, promoters, miRNAs and methylation patterns all having an effect on cell function [34]. Databases such as the Universal Protein Binding Microarray Resource for Oligonucleotide Binding Evaluation (UniPROBE) and JASPAR, compilations form the literature such as TRANSFAC and the B-cell interatomic (BCI) that include predicted protein-DNA interactions are some sources attempting to capture diverse data [35].

Computational methods then attempt to reverse engineer the GRNs from the limited observed phenotypic expression of these variables [15,36]. Results from previous research suggest that the incorporation of domain knowledge as heuristics can help to better identify the context-specific regulatory interactions corresponding to certain phenotypes [37]. However current algorithms do not always allow domain knowledge, complete or incomplete, to be incorporated. Our heuristics incorporated here were derived from general characteristics of gene expression and network interactions and can be used by SynTReN and applied to any other known network such as yeast or human genome data.

Data from microarray or NGS-based methods are also typically collected without knowledge of the stage of the disease. In addition, current statistical methods do not allow linking a gene or gene network with disease progression or risk of future morbidity and mortality and do not allow inferences to be made where a single gene can have a number of functions or similar genes can be identified as associated with a number of different diseases or processes [3,35,38].Therefore, the interpretation of the results from GRNs become complicated and it is difficult to extract a GRN that is not contradictory to current expert knowledge [39].

We argue that using various types of information derived from domain knowledge can improve the accuracy of the discovery process. In addition, using heuristics in GRN discovery is both novel and compelling. Using heuristics from domain knowledge has the potential to result in more accurate GRNs and in a better understanding of gene associations related to phenotypic expression in health and disease. Equally as important is the fact that the extracted GRNs (using such knowledge-based heuristics) will be understandable and plausible to biologists and will be compatible with current expert knowledge.

SynTReN effectively uses known network parts to build a simulated network; its network output is the most similar to real networks. This characteristic made it a suitable candidate to be used for testing our system. This was mainly due to our work, which involved heuristics from the known network in order to elevate performance. Therefore, we needed a simulator, which could produce networks as compatible with domain knowledge as possible. SynTReN has been widely used in the literature to generate data in order to compare performance of different GRN discovery algorithm [26,30].

An initial step towards obtaining meaningful data reduction and inclusion of up-regulation, down-regulation and dual regulation of gene-associations is our proposed 2D Visualized Co-regulation function. The visualisation of the outcome as shown in Table 1 provides visual information on any of the possible gene-gene associations with results being more accurate than previous PCC or MI methods.

Post-processing the large number of false positives is a well-known problem and the main challenge for any GRN discovery algorithm [40]. This problem occurs because it is difficult to distinguish direct gene-gene relationships when a gene directly up or down-regulates another gene from indirect relationships, when the interaction is mediated through a chain of intermediary genes. This poses a further complication for GRN discovery as sometimes an indirect effect represents a stronger correlation than a direct effect [4,11].

The effectiveness of our Heuristic Post-Processing method is based on domain knowledge indicating that transcription factors are commonly self-regulating to allow effective adjustment to environmental conditions. Other measures of association mentioned in the literature are usually not able to detect self-loops; although, self-regulation accounts for as many as 59% of the transcription factors in E. coli [20]. Out of the total number of self-regulation relationships, 87% are negative feedback, 6.5% are positive feedback, and 6.5% are dual circuits [4,41]. The dominant form of self-regulation is self-suppression [20].

In addition to their global structure, GRNs exhibit local structures called motifs. Common motifs include self-regulatory and feed-forward loops. These mechanisms are used to create stability against minor perturbations. GRNs have a considerable number of loops, most commonly feed-forward loops which are merged into the structure of the network. Approximately 59% of the total transcription factors in Escherichia coli regulate the transcription rate of their own genes [42]. A most methods for GRN discovery do not detect these loops, they overlook valuable information in their network discovery processes n this study, our method detects these loops and makes use of the aforementioned heuristic in a post-processing step to reduce the number of false positives in identifying self-regulations (i.e., selfloops). The 2D Visualized Co-regulation function not only leads to improved GRN discovery compared to the DPI method but comes with a visualization which has been shown to be informative for experts to detect the pattern of changes in expression level of genes cross samples.

The third part of our heuristics approach is the construction of the GRN backbone utilizing knowledge of the main transcription factors and their interactions. Hubs form a strong backbone for the GRN as Hub genes encode chromatin regulators and their function as genetic hubs is evolutionarily conserved across different organisms, making this heuristic suitable for any genome. Their normal function is to act as genetic buffers, minimizing the effects of mutations in other genes [43]. Therefore, from a computational perspective identification of these hub genes is critical and if not correctly placed causes the GRN to collapse [20,44,45].

Hubs are connected to other hubs to form a primary structure, a network of hubs, which acts as a backbone to the GRN [46]. Disease conditions are often characterized by lost hub connections rather than additional connections. Lu et al. [45] demonstrated that nodes with high connectivity (hubs and super hubs) tend to have low levels of change in gene expression and genes with a high level of change in expression are more likely to be peripheral nodes with low connectivity. The Hub network algorithm initially builds the backbone of the GRN by identifying plausible hub genes and their connections through our heuristics. We then expand the network from those hubs using the structural property of power law networks [47].

Conclusion

We developed a novel technique for Gene Regulatory Network discovery that integrates heuristic information into the discovery process. The incorporation of three types of heuristics has been shown to effectively inform the generation of gene regulatory networks from gene expression data. All three heuristics are conceptually simple and computationally efficient, scaling to predictions at the level of entire genomes. Overall, this work demonstrates the ability of using heuristics information in elevating the performance of GRN inference by outperforming the two state-of-the-art GRN inference algorithms, ARACNe and CRL. Future research involves using information about functional gene sets in combination with our Hub Network. It may also be possible to increase the performance of our co-regulation function further by using more sophisticated discretization methods.

Authors’ Contributions

AZ designed, carried out the data analysis experiments and drafted the manuscript. PV contributed in writing and also provided feedback regarding the analytical experiments. AS made contributions in the design of the analysis and in writing the manuscript. HFJ contributed to the analysis, interpretation and writing of the manuscript.

Acknowledgements

Authors would like to thank Associate Professor Catherine Abbott from Flinders University in South Australia, Associate Professor Izhak Haviv from University of Melbourne, and Associate Professor Ross Lazarus from Harvard University and Medical Bioinformatics at Baker IDI for the assessment of the visualization of our co-regulation function and providing feedback.

Declarations

The authors have received no funding for this project and have no conflicting interests to declare.

References

Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Recommended Conferences

  • 5th International Conference on Medical Informatics and Telehealth
    August 29-30, 2017 Prague Czech Republic

Article Usage

  • Total views: 85
  • [From(publication date):
    May-2017 - Jun 25, 2017]
  • Breakdown by view type
  • HTML page views : 68
  • PDF downloads :17
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

 
© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version
adwords