Bioprospecting the Bibleome: Adding Evidence to Support the Inflammatory Basis of Cancer

Background, cancer ������ and question: BioProspecting is a novel approach that enabled our team to mine genetic marker related data from the New England Journal of Medicine (NEJM) utilizing Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) and the Human Gene Ontology (HUGO). Genes associated with disorders using the Multi-threaded Clinical Vocabulary Server (MCVS) Natural Language Processing (NLP) engine, whose output was represented as an ontology-network incorporating the semantic encodings of the literature. Metabolic functions were used to identify potentially novel relationships between (genes or proteins) and (diseases or drugs). In an effort to identify genes important to transformation of normal tissue into a malignancy, we went on to identify the genes linked to multiple cancers and then mapped those genes to metabolic and signaling pathways. Findings: Ten Genes were related to 30 or more cancers, 72 genes were related to 20 or more cancers and 191 genes were related to 10 or more cancers. The three pathways most often associated with the top 200 novel cancer markers were the Acute Phase Response Signaling, the Glucocorticoid Receptor Signaling and the Hepatic Fibrosis/ Hepatic Stellate Cell Activation pathway. Meaning and implications of the advance: This association highlights the role of inflammation in the induction and perhaps transformation of mortal cells into cancers. Major ����� BioProspecting can speed our identification and understanding of synergies between articles in the biomedical literature. In this case we found considerable synergy between the Oncology literature and the Sepsis literature. By mapping these associations to known metabolic, regulatory and signaling pathways we were able to identify further evidence for the inflammatory basis of cancer. Polk Award noted that its 1977 award to the New England Journal of Medicine “provided the first significant mainstream visibility for a publication that would achieve enormous attention and prestige in the ensuing decades” [9]. The NEJM publishes papers on original research, widely cited editorials, review articles, correspondences and case reports. The NEJM consistently has the highest impact factor of the journals of clinical medicine and in 2010 it was 53.48. The NEJM provides on-line (electronic) access. Our study included the full text of all articles from January 1994 through December 2006. SNOMED CT (Systematized Nomenclature of MedicineClinical Terms) SNOMED CT (SCT) is a large-scale general medical ontology used to describe current medical and health knowledge [10]. We employed SCT to represent clinical disorders, proteins, chemicals and metabolic functions identified within the biomedical literature. SCT is the most


Introduction
The Genomics and Bioinformatics research communities have been successful over the past several years in discovering of the genetic basis of Mendelian disorders [1]. Computational tools delivered by the field of Bioinformatics have been essential in our search for the genetic basis of Mendelian disease [2]. Bioinformatics research develops technologies in the context of clinical and genetic data standards that have been employed to create linkages between genes and disorders [3]. Unfortunately, many diseases cannot be traced to a single genetic variation where the true origin of a disease is a complex interplay of genetic variations, environmental factors, and "lifestyle" characteristics, along with some stochastic processes [4]. To advance medical science and healthcare, a broader understanding of genetic markers and their inter-relationships is required [5]. Our project to "discover" a genetic marker by mining the medical literature was an attempt to reveal, and later research, linkages between variants discerned through the mining of the medical literature [6].
comprehensive of the general medical and biological Ontologies. SCT is maintained by the International Heath Terminology Standards Development Organization (IHTSDO) and is designed to represent the health and medical domains. Although very broad with over 291,000 concepts and over 1.2 million relations, it does not provide complete coverage of all medical content. SCT was formed through a merger of SNOMED RT (Reference Terminology) in the late 1990s with another large ontology, Clinical Terms v3, developed in the United Kingdom.
Ontologies, to be practically useful for computation, need to be accurate in two specific ways; firstly, they need to match as closely as possible our understanding of the natural world. Secondly, ontologies need to be at least 95% complete in their domain coverage. This means the ontology needs to be constructed with very close attention to the meanings of its concepts. In other words the concepts used in the ontology need to represent closely the understanding we have of the real world [11]. The variety of linguistic "usage" of those terms needs to be harnessed to ensure that the ontology has adequate content coverage. One should adequately explore the diversity of semantic roles of the concepts to ensure that only those roles that are useful are included in the ontological modeling, and to remove ambiguity in the use of those roles. Precision of each definition is particularly important in establishing the relationships between concepts and identifying the fundamental atomic concepts and how they combine or "fit together" systematically in compositional expressions. Ontologies also consist of levels of abstractions. In ontological computation there are two basic forms of abstraction, aggregation and generalization. Aggregation hierarchies are created from the ontological indexing by linking like objects. Generalization hierarchies are used throughout SCT as the basic mechanisms for relating content (e.g. All Oncological disorders).
SCT provides concepts to represent diagnoses, findings, procedures and testing. In a previous article we Elkin et al. published a cohort study reporting that SCT provided 92.3% coverage of common medical problems from cases from the Mayo Clinic [12]. This NLP system has been used in subsequent studies for electronic quality monitoring [13].

HUGO
The Human Genome Organization (HUGO) is an organization involved in the Human Genome Project. HUGO was established to foster collaboration between genome scientists around the world, in 1989. The HUGO Gene Nomenclature Committee (HGNC) is one of HUGO's committees that assign a unique gene name and symbol for each human gene.

Materials and Methods
We parsed the full-text content of the New England Journal of Medicine (1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) using the Multi-threaded Clinical Vocabulary Server. The MCVS creates an ontology-network of the semantic encodings of genes, proteins, disorders and drugs and metabolic functions identified in the literature that are organized by the section of the article including the tables.
The indexing was done utilizing SNOMED-CT and the HUGO Ontologies. We utilized SNOMED CT as it provides robust clinical indexing. SNOMED-CT has >370,000 concepts and >1,000,000 terms (in our lab we added another 790,000 terms to improve its clinical relevance) and HUGO has >26,000 human gene names. This concept based indexing represents a broad and consistent data infrastructure across articles from the literature [14].
We identified co-occurrences of genes and metabolic functions, proteins and metabolic functions, diseases and metabolic functions and drugs and metabolic functions. Next, we matched these data sets linking pairs (e.g. Disorders and Genes) with a common metabolic function, identifying functional relationships between proteins and diseases, for example. Candidate functional relationships were identified between genes and diseases, proteins and drugs, genes and drugs, and drugs and diseases. Next, we identified the disjoint sets where, for example, a gene and a disease match across metabolic function but have not been mentioned together in any previous NEJM journal article.
We compiled the genes from the disjoint set (potentially novel markers) that were related to more than three tissue types of cancer. We sorted the genes by the number of cancers associated with the gene. Then we took the 200 genes from the top of this list and mapped them against the metabolic pathways and signaling pathways available within Ingenuity™ [15]. Ingenuity is commercially available pathway analysis software. The pathways associated with the highest number of these possibly novel genes are reported.
We compared the chance that a gene would be related to 20 or more cancers and by random chance with the rate of identification in our study and analyzed the results using the McNemar test.
The conceptual process can be visualized as shown in Figure 1.

Novel relationships
SCT contained 574 metabolic functions that were used to link to Genes, Proteins, Disorders, and Drugs.
We identified 1756 genes that were related to 10,303 different types of cancer. Ten genes were related to 30 or more cancers. The gene associated with the most cancers was MYELIN OLIGODENDROCYTE GLYCOPROTEIN and was related to 35 types of cancer. There were 72 genes related to 20 or more cancers and there were 191 genes related to 10 or more cancers. We have annotated the top 50 genes and present them in table 1. In table 2 we show the clinical pathways containing the top 200 genes from this dataset and which we then mapped against the metabolic and signaling pathways (See Table 2) to identify the most common pathways that involved this set of genes. The pathways with the greatest overlap with this dataset were the Acute Phase Response  Signaling pathway (See Figure 2) respectively, the Glucocorticoid Receptor Signaling and the Hepatic Fibrosis/Hepatic Stellate Cell Activation pathway that includes targets such as TNFa-NFkB.
The chance that a gene is related to twenty or more cancers as identified by our bioprospecting method and is not truly related to cancer is very small (that all findings would be false positives; 3.9 x10 -24 :1). If we compare the chance that a gene would be related to twenty or more cancers based on chance alone with the current findings the results are highly statistically significant (p<0.001; McNemar Test).

Discussion
The biomedical literature is a repository of our accumulated biomedical knowledge. Much of the value contained in this resource is locked in free-text and is therefore not in a form that is easily amenable inhibition suppresses retroviral replication in cell culture and primary cells with no measurable drug-induced adverse effects on cell cycle transition, apoptosis, or general cytotoxicity.

XANTHINE DEHYDROGENASE 22
Xdh-null mice were runted and did not live beyond 6 weeks of age. Xdh heterozygous females, although healthy and fertile, were unable to maintain lactation, and their pups died of starvation 2 weeks postpartum. Histologic analysis showed that, in heterozygous females, the mammary epithelium had collapsed, resulting in premature involution of the mammary gland. Electron microscopy showed that Xdh was specifically required for enveloping milk fat droplets with the apical plasma membrane prior to secretion from the lactating mammary gland.

SPLEEN TYROSINE KINASE 22
SYK is activated by oxidative stress; putative tumor suppressor; role in the differentiation of B-cells and many other cell types; inactivated by hypermethylation. Found to be inactivated in a subset of breast cancer. also prevalent in a case of myelodysplastic syndrome.

ANKYLOSING SPONDYLITIS 22
mainly affects joints in the spine and the sacroilium in the pelvis, and can cause eventual fusion of the spine.

EOSINOPHIL PEROXIDASE 21
patially responsible for tissue remodeling; provides mechanism by which eosinophils kill multicellular parasites (eg, the nematode worms involved in filariasis); and also certain bacteria (eg tuberculosis bacteria) HYALURONAN SYNTHASE 3 21 regulator of hyaluronan synthesis, major constituent of extracellular matrix ADENOSINE A3 RECEPTOR 21 expressed at high levels in the vascular smooth muscle layer of normal mouse aortas. knockout mice showed blood pressure comparable to WT, but aorta and heart cAMP levels were elevated. When challenged with adenosine, the KO mice showed further increased cAMP levels in the heart and vascular smooth muscle, and a significant decrease in blood pressure.

HISTONE DEACETYLASE 2 21
KO mice are characterized by partially penetrant embryonic lethality, with abnormalities of myocyte proliferation and differentiation apparent during late gestation HISTONE DEACETYLASE 4 21 regulates chondrocyte hypertrophy and endochondral bone formation in mice by interacting with and inhibiting the activity of Runx2 (600211), a transcription factor necessary for chondrocyte hypertrophy  to computational analysis. Biomedical researchers cannot practically keep up with the entirety of the biomedical literature. Natural language processing has the potential to unlock the knowledge within the freetext medical literature. In this experiment we used the information from the NEJM, one of the premier medical journals, to determine if novel relationships could be identified between Genes, Proteins, Drugs and Disorders. These relationships are indeed prevalent and hold tremendous potential to increase our understanding of human disease.
We examined all of the genes related to more than three cancers. In this analysis we have identified 10 genes related to thirty or more cancers, 72 genes related to twenty or more cancers and 191 genes related to ten or more cancers. It is possible, that the genes in table two will serve to help researchers identify the cure for cancer. We then mapped the top 200 genes to known metabolic and signaling pathways. The three pathways that were most highly correlated with this set of potentially novel gene -cancer diagnosis relationships were all inflammatory pathways (i.e. the Acute Phase Response Signaling Pathway, the Glucocorticoid Signaling Pathway and the Hepatic Fibrosis Pathway). This finding highlights the importance of inflammation as a stimulus for the transformation of normal tissue to malignancy. Perhaps the common basis for transformation of cells to malignancies has already been published but remains "hidden" within synergies between articles within the vast amounts of biomedical literature. This NLP based data mining experiment shows the utility of data mining the literature for scientific discovery. The authors believe that the entire medical literature should be exposed in this way to provide improved computable access to the knowledge contained in the text of the biomedical literature. Future research should include the addition of other journals content in the compendium used to find synergies across articles from the biomedical literature.
Our goal was to identify synergy between articles within the literature indicating potential relationships between genes or proteins and drugs or diseases that have heretofore not been previously recognized. This generated a marker discovery database that is searchable from multiple perspectives such that a disease-oriented researcher could search the disease they are interested in and find all of the genes, proteins and drugs associated with that disease organized by function. Alternatively, researchers could ask for all genes related to a medication. A clinical trialist might ask what new diseases might be treated by an existing medication. Researchers could either access this information with regard to known synergies (where the gene and disease have been mentioned in the same article) that verifies the utility of the algorithm or discontinuities (where the gene and disease have a functional relationship; however the two entities have never previously been mentioned in the same journal article). This may indicate the possibility of a novel relationship that can then be taken back to the bench for further definition, identification and research. Proteins and drugs, for example, that have a functional relationship but that have never been recognized to affect one another, may be candidates for further basic science analysis thereby leading to more rapid marker and treatment discovery. We see this as a potential and very promising method for improving the research productivity.
This longitudinal research program aims to support the development of novel clinical and translational methods that can encompass a wide range of techniques including new methods of phenotyping where one could use SNOMED CT to phenotype the patients described in this paper. New biomarkers for research are a potential output of this project. In addition, this project may benefit clinical informatics for longitudinal studies that aim to rapidly look at specific associations that may lead to either additional retrospective analysis or prospective clinical trials. The special deliverable of this project is an interface where one can search the already recognized (concordant) relationships, and also disjoint or previously undocumented relationships between genes or proteins and drugs or diseases. As organizing concepts are included, searching for classes (e.g. concepts like beta blockers for drugs or cardiovascular diseases) will be possible without having to articulate the individual sub-classes of information.
Although there are many other Ontologies available [16] and other literature which could be included in such an analysis, this study provides a proof of concept that useful synergies can be identified from the knowledge available across articles in the literature. A computable method for identifying these synergies has the potential to speed scientific discovery.
Biomedical Informatics has the potential to help us to uncover novel genetic linkage to human disease. This can lead to significant improvements in patient care with more swift knowledge discovery and translational research, bringing that knowledge more rapidly to the bedside and thereby empowering clinical implementation of personalized /individualized medicine.