Received date: June 05, 2015; Accepted date: July 07, 2015; Published date: July 14, 2015
Citation: Lisitsa A, Stewart E, Kolker E (2015) Is it Time for Cognitive Bioinformatics? J Data Mining Genomics Proteomics 6:173. doi: 10.4172/2153-0602.1000173
Copyright: © 2015 Lisitsa A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Data Mining in Genomics & Proteomics
The concept of cognitive bioinformatics has been proposed for structuring of knowledge in the field of molecular biology. While cognitive science is considered as “thinking about the process of thinking”, cognitive bioinformatics strives to capture the process of thought and analysis as applied to the challenging intersection of diverse fields such as biology, informatics, and computer science collectively known as bioinformatics. Ten years ago cognitive bioinformatics was introduced as a model of the analysis performed by scientists working with molecular biology and biomedical web resources. At present, the concept of cognitive bioinformatics can be examined in the context of the opportunities represented by the information “data deluge” of life sciences technologies. The unbalanced nature of accumulating information along with some challenges poses currently intractable problems for researchers. The solutions to these problems at the micro-and macro-levels are considered with regards to the role of cognitive approaches in the field of bioinformatics.
Digital medicine; Exposome; Omics; Biomarkers
Modern molecular biology studies the function and interactions of biological molecules, processes and systems. Molecular biology primarily focuses on DNA, RNA, proteins and metabolites utilizing four corresponding “omics” technologies: genomics, transcriptomics, proteomics, and metabolomics. Advances in sequencing technologies enable researchers not only to decipher the human genome, but also to uncover features of epigenetic regulation. The level of gene expression is the object of transcriptomics; proteomics provides information about the diversity of the protein molecules encoded by the genome; metabolomics focuses on studying metabolites.
The technical capabilities of modern “omics” significantly exceed the capabilities of researchers to meaningfully process incoming experimental information. A typical microarray study can generate the data on 500,000 single-nucleotide substitutions of DNA that differentiate the genome of one person from another . Transcriptome analysis allows measuring the activity levels of each of 20,000 human protein-coding genes, with the signal of each gene detected by two or three different sequences . An in-depth study of the proteome can quantify up to 10,000 protein products of the genome  and elucidate structural modifications of at least one third of them .
Current international projects are designed not only to read the DNA sequence, but also to decode the meaningful message from this molecule. The question of how to re-interpret the genome in the context of harmful mutations and diseases is of great interest to biomedical researchers . Such projects that are the sequels of the Human Genome Project include: HapMap, EnCode, 1000 Genomes, Cancer Genome, Human Proteome and a number of other largescale initiatives. The projects like the NIH Brain Initiative, the Human Microbiome Project are generating the data that will build a framework for understanding the human condition. These projects combine the technical and intellectual capacities of numerous countries into international efforts.
There is no lack of baseline data on the state of molecular systems within a human body. On the contrary, the exponentially growing stream of the information produced by “omics” is a challenge to the technology of data storage and processing . The unbalanced nature of accumulating information redoubles the challenges of the 5Vs of Big Data: veracity/reproducibility , variety, value and, to a lesser extent, velocity and volume .
The challenge of interpreting the genome is currently being tackled in many labs by using brute-force algorithms, making exhaustive searches for genetic variants, the laborious matching of scattered IDs and hand-curating datasets. However, the data assembled from all sources are left to languish in repositories “till better times” until the development of bioinformatics grows to the capability of gaining knowledge from the data sets accumulated.
Is there a background in modern bioinformatics to beat the challenge of current data deluge? In 2004, Kuchar et al.  introduced cognitive bioinformatics as an instrument of scenario analysis performed by scientists working with molecular biology and biomedical web resources. The term “Big Data” is mainly used to describe a massive volume of both structured and unstructured data that is too large or complex to process it using traditional database and software techniques. Bioinformatics accumulates the data primarily borrowed from the statistics, machine learning and pattern recognition methods (except for the specific tasks of molecular modeling and comparative genomics). Thus, the algorithms of digital data processing cannot be transferred to biological and biomedical outcomes without significant expert adaptation. It is not surprising, for example, that the modern practice of clinical diagnostics has hardly been updated with new biomarkers despite the petabytes of the data generated. Potential solutions can lie in an analysis of network interactions from the microlevel (biological molecules, e.g., ) to the macro-level (exposome, e.g., ).
Socially relevant diseases affecting almost all people are multigenic. Thus, association studies are to be conducted to establish the link between multiple variations and the risk of disease development (GWAS - genome wide association studies). Sometimes, it is relatively easy to determine variation/disease relationships in controlled environments. However, to establish a relationship between a set of variations and a disease is rather difficult  due to the genome variability interfering with environmental factors.
It is possible to reduce the level of genomic variability at the level of proteins. The proteins, as the final products of the genome, have more direct influence on the condition of the body and can serve as reliable diagnostic markers. This issue has been thoroughly studied in the field of proteomics for the last 15 years. However, clinically useful biomarkers that can impact human health are lacking despite the reported hundreds of potentially relevant diagnostic molecules .
A single biomarker can be fallible because proteins interact with each other. They also interact with genes forming a genomic regulatory network. Networks of molecular interactions are studied in the framework of interactomics. This term labels the biology to be investigated but does not describe an appropriate approach. Graph theory (Petri nets), Bayesian models, support vectors, and random forest are some of the approaches that have been used to describe the networks of molecular interactions . At present, it is only known that an organizational framework of hubs structured around the interacting partners can be detected only in the composition of networks with more than 200,000 binary interactions . But such scale of experimental measurements is not affordable.
Disease development is determined by multiple conditions and in no way depends solely on personal genome . The factor of genetic predetermination of a disease is relatively small. Pathologies in a genetically viable organism are formed by the components of environment collectively called the exposome . It is a combination of environmental factors that an organism has encountered along with some susceptibilities inherent in the organism, which in combination leads to a disorder.
Data deluge suggests that solutions should encourage and harness the future convergent synthesis of exterior and interior life factors. The prerequisites for such a generalization could appear from the study of connectome  - the model of the brain as an organ available for the study at the social, cellular and molecular levels. It is impossible to prove the unity of these structures – the systems and a cell as a “society” of molecules. At the present level of knowledge, current mathematical tools do not adequately reflect the fundamental properties of the social, neural and biological networks, especially their time-dependent multilayering. The attempts to construct planar, tree-like architectures of biological systems have been exhausted. Thus, there is a need for fundamentally different, probably non-digital, information processing tools to succeed in multi-dimensional descriptions.
Although it is possible for a clinician to get an enormous amount of the data on internal molecular systems of the body, he/she usually lacks the detailed information with respect to the exposome. Generally, the exposome is captured as a personal health record based on often-faulty memories of patients. Recent developments in modern telecommunication sphere have resulted in an opportunity of collecting the exposome data. First of all, it can be periodically received from our data communicators such as geolocations . With respect to the health characteristics of an individual, the sequence of geolocation is no less informative than the DNA sequence in the genome . The information kept in electronic financial systems (for example, the information on purchases) along with Google searches and Facebook activity can be used to accurately build the behavioral profile of an individual. Some wrist trackers can measure physical activity, food/ drug intake and other health parameters and generate such profiles as well. To avoid faulty memories these profiles can then be shared with a physician for a more accurate accounting of the patient’s exposome. In addition, these profiles can be shared with others or connected to a community of like-minded (www.myfitnesspal.com , www.mynetdiary.com ). Such detailed profiles can also be valuable for evaluating drug side effects as well .
Geo-data are becoming a source of the prediction of behavioral profiles leading to the development of socially relevant diseases: diabetes, pathologies of the cardiovascular system and depression . Further development of these concepts depends on creating a mathematical approach and the appropriate tools to describe the multilayered cognitive slices. Today, there are no published computational principles that can surmount the barrier between the statics of traditional neural network and dynamics of real biological system. A strong attempt was made by IBM which provided its Watson artificial intelligence for medical application [23,24].
The methods for quantifying the exposome in terms of the environmental factors are generally indirect. For example, if the taxi order is accepted with a delay less than 5 minutes, it could indicate an aggressive industrial environment. The less the delay is, the more the chance is that a person’s current location is within the generally unsuitable environment with hard traffic, noise and air pollution. Hypothetically, it is also possible to measure even mental environment by analyzing the number of calls per hour/per day, voice timber in a conversation and the diversity of vocabulary, as has been recently implemented in the service “Okay, Google”.
Of course, the usefulness of exposome data cannot be disclosed until the data will be collected and provided for an analysis. The good news is that there is no need to invest scientific money into the collection of exposome data. The process of collection is fueled by business and goes in a fully automatic mode. The challenge is to make the data collected available to science and to develop the artificial intelligence to tackle this data.
Multi-layer cognitive biological systems should be based on a detailed description of the surrounding world. Artificial intelligence will perceive the physiological and emotional status of a person only if a critical mass of people consciously and without any pressure shares information about their location, the genome and diseases. And many other things that today we prefer to hide. Modern society is not fully prepared to accept such relationships due to ethical requirements regarding the protection of personal data. In medicine, in addition to this issue, there is a considerable need for a physician who can turn biomedical information into life-changing decisions.
A physician never makes a diagnosis solely based on the results of in-vitro tests. Even when there is a patient with an off-the-scale LDL cholesterol level, the physician will never diagnose without an additional research, including examination data and questioning the patient. A correct diagnosis always depends and will depend upon the professionalism of the physician - perhaps the only parameter that cannot be formalized within the standards of medical care.
The development of “omics” as the technologies providing redundant information on the subject is essentially antagonistic to the concept of medicine. Indeed, if in the future (not that far, by the way!) bionic implants transmit gigabytes of the data on the state of a single molecule, a physician is unlikely to be able to read, understand and use these data without pre-buffering and analytical processing. But, such analysis will be completely useless until it is possible to compare the data between people, which will provide a sufficient pool of training images and an assessment of a “human norm”.
It should be assumed that future systems with a cognitive component will be in demand as an interface between a physician and a patient data deluge. At present, the abundance of the data on a patient (for example, her/his genome) is more likely to overwhelm a physician than help in decision making. Currently, modern bioinformatics is aimed at decreasing the complexity of “omics” data through data reduction. Yet during this reduction, a valuable, individual, personalized picture disappears. A good contact between a physician and a patient in the array of “omics” data will be maintained only in case of the virtualization of both the physician and the patient. After virtualization the personal experience of decision-making of the physician can be copied repeatedly, resulting in the artificial “proliferation” of physicians. In the long term, the introduction of cognitive technologies in medicine should provide every citizen with a readily available ultra-professional virtual physician with an infinite memory of “clinical records” characterized by hundreds of thousands of molecular parameters. The physicians as persons will become supervisors of a network of medical community and the most complex cases that require human intervention will rise to their level.
Living organisms are complex systems. Thus, immense datasets are being generated by modern “omics” technologies to adequately represent and understand them. Yet these datasets overwhelm the capabilities of any scientist, lab or research center to transform the data to actionable knowledge. Ironically and perhaps fortuitously, some of these datasets also contribute to our understanding of the brain and its cognitive processes and thus can be used to enhance our capabilities. Perhaps cognitive bioinformatics can formalize the logic of human thinking and apply it to organize the existing arrays of experimental data in the form of networks with dynamic behavior. The principles of networking between the neurons of the brain are likely to become scalable both from the macro level to the level of social networks and from the micro level to the level of intermolecular interactions in a cell. By applying one of the wonders of nature — the thought process — we can build this framework and analyze the approaches necessary to understand its secrets.