Knowledge Mining of Disease Network can Provide New Insights in Cancer Research through Analysis of Other Diseases

The word itself seems to hold a dark power over modern humankind. Nearly every one of us has come in close proximity to this life-threatening disease. In recent years, researchers have produced a body of work that has given us a clearer (albeit more complicated) picture of how cancer comes to be, how it develops, and how it can be treated. The roles of genetics (in the form of single nucleotide polymorphisms or SNPs) [1], epigenetics [2], miRNA [3], copy number variation [4], chromatin structure [5], and protein biomarkers [6] in cancer have been shown. While great scientific advances have been made in the understanding and treatment of this disease in the last 50 years, we still do not have a clear understanding of the ‘how’ and ‘why’. Given a set of initial conditions in the body defined by genetics, lifestyle, environmental exposure, etc., cancer begins and proceeds to develop through an evolutionary process. This results in all cancers having unique characteristics [7]. Clearly, cancer is a multidimensional problem for which we have an enormous amount of data now. Gaining knowledge from the existing data, however, is a nontrivial task.


Cancer
The word itself seems to hold a dark power over modern humankind. Nearly every one of us has come in close proximity to this life-threatening disease. In recent years, researchers have produced a body of work that has given us a clearer (albeit more complicated) picture of how cancer comes to be, how it develops, and how it can be treated. The roles of genetics (in the form of single nucleotide polymorphisms or SNPs) [1], epigenetics [2], miRNA [3], copy number variation [4], chromatin structure [5], and protein biomarkers [6] in cancer have been shown. While great scientific advances have been made in the understanding and treatment of this disease in the last 50 years, we still do not have a clear understanding of the 'how' and 'why'. Given a set of initial conditions in the body defined by genetics, lifestyle, environmental exposure, etc., cancer begins and proceeds to develop through an evolutionary process. This results in all cancers having unique characteristics [7]. Clearly, cancer is a multidimensional problem for which we have an enormous amount of data now. Gaining knowledge from the existing data, however, is a nontrivial task.
In recent years, bioinformatics and computational biology have made a variety of contributions to disease analysis using existing data in an attempt to increase our understanding of many diseases. Popular topics include the discovery, prediction, and analysis of genes related to disease [8], statistical analysis of SNPs and disease [9], the prediction and discovery of new drug targets [10], the development of the disease ontology and its application to the human genome [11,12], the analysis of protein-protein interaction networks as they relate to disease [13], and many others. Of particular interest is the development of 'disease networks' [14,15], which are in most cases bipartite graphs describing disease-disease as well as disease-gene relationships. In the projection of the disease-gene network that describes disease-disease relationships ( Figure 1), nodes indicate diseases and the edge between two nodes represents how these diseases are related. These edges may signify one or more shared genes, metabolic pathways, miRNAs, or a number of other data types. The disease network reveals the interconnected nature of various diseases, which begs the question; can we gain new knowledge of a disease such as cancer by studying 'connected', noncancer diseases? Many diseases including obesity [16,17], various infections [18], diabetes [19], and possibly even psychological stress [20] have been reported some relationship to cancer. Often the relationship type is unknown or partially known, which indicates that a deeper understanding of these relationships is needed. However, those relationships have not been explored as a whole, but rather as individual links.
Due to the complicated nature of many diseases, which may involve the failure of multiple levels of biological function including DNA repair, gene regulation, epigenetic and histone modifications, metabolic pathways etc., elucidation of disease relationships requires a systematic and computational solution. Though there may be a plethora of data available to quantify this disease problem, the data itself does nothing for us if we cannot turn that data into knowledge (a similar problem arose after the sequencing of the human genome). Merely combining sources of data is not sufficient. We must identify patterns within the data, which is manually infeasible when the number of data points and characteristics to be compared is large. Clearer understanding could be gained by finding, among all attributes of a relationship, those that characterize it most accurately. Several existing machine learning algorithms can help achieve this including multiple instance learning [21], positive/unlabeled (PU) learning [22], Bayesian inference [23], the alternating decision tree, or ADTree [24], and others. In the past we have used the ADTree algorithm to analyze methylation patterns on DNA [25] and to predict DNA-binding proteins [26]. In both cases, this algorithm helped us to understand what characteristics have the most influence on determining the class to which the examples belonged. A similar method of 'rule discovery' is needed in the case of the disease network. Of course, the rules may be heavily dependent upon the types of disease in question (i.e. metabolic, infectious, autoimmune and genetic). By analyzing a combination of available genetic, epigenetic, and proteomic data, one will be able to use these algorithms to enrich the edges between cancer and other diseases in the disease network, as well as to predict new edges within disease clusters.
The key to understanding the disease network is to enrich the value of existing edges and to infer new ones based on this enriched value.
There is a wealth of information concerning diseases, metabolism, gene ontology, drug targets, miRNA, protein-protein interaction, gene regulation, and gene expression. Unfortunately, there are large areas of missing and overlapping data as well as many false positives and even more false negatives. This makes it difficult to assemble the puzzle and gain knowledge. One can use algorithms such as ADTree which can filter through noisy data to find the most informative and conserved characteristics of a disease-disease relationship. Cancer A and noncancer disease B, though they may not share a causal gene(s) according to OMIM, but may be related at some distance through a common metabolic pathway, co-regulating transcription factor, or negative regulation by one or more miRNAs. Any of these three could be a false positive association. When analyzed together along with other available data, however, a more complete biological process comes into focus and the noise problem can be mitigated. The ADTree allows us to easily visualize which biological processes contribute most to the disease relationship, eliminating the 'black box' effect of many machine learning algorithms.
Overall, we believe cancer is both unique and related to other diseases. Study of all diseases as a network system can generate many interesting results. For example; drug of related non-cancer diseases may help treat the side effects of cancer drugs; the complex relationship between bacteria and cancer: bacteria can be both beneficial and cancer-causing, can provide new ideas about cancer treatment; mechanisms and tissue-specificity of non-cancer diseases may prime the cellular environment for metastasis. We expect in the near future, with enormous genotype and phenotype data available for all diseases, there will be a novel view point for cancer research that will emerge from the disease network study.