alexa Artificial Intelligence in Biological Data

ISSN: 2165- 7866

Journal of Information Technology & Software Engineering

  • Review Article   
  • J Inform Tech Softw Eng 2017, Vol 7(4): 207
  • DOI: 10.4172/2165-7866.1000207

Artificial Intelligence in Biological Data

Indrajeet Chakraborty1, Amarendranath Choudhury2* and Tuhin Subhra Banerjee3
1Department of Bioinformatics, Karunya University, Coimbatore, Tamil Nadu, India
2Department of Life Science and Bioinformatics, Assam University, Silchar, Assam, India
3Department of Life Sciences, Satpalsa High School, Satpalsa, Birbhum, West Bengal, India
*Corresponding Author: Amarendranath Choudhury, Department of Life Science and Bioinformatics, Assam University, Silchar, Assam 788011, India, Tel: +91 7003017920, Email: [email protected]

Received Date: Aug 03, 2017 / Accepted Date: Aug 29, 2017 / Published Date: Sep 06, 2017


Artificial Intelligence (AI) or Machine learning in present era serves as the primary choice for data mining and big data analysis. With effective learning and adaptation model, it provides solutions to several engineering applications. These include techniques such as Artificial Neural Network modelling, Reasoning based decision algorithms, Simulation models, DNA computing and Quantum computing among several others. With the application of AI in Biomedical research, the fuzziness and randomness in handling such type of data has significantly reduced. Rapid technological advancements have helped AI techniques evolve in manner which promotes handling such fuzzy data effectively and much more conveniently. The review presents a comprehensive view of machine learning and AI computing models, advanced data analytics and optimisation approaches used in Bioengineering such as Drug Designing and Analysis, Medical imaging, biologically inspired learning and adaption for analytics, etc.

Keywords: Artificial intelligence; Machine learning; Bioengineering; Big data


With computers becoming the necessity of modern era, it becomes imperative for machine to adapt to the recent trend in the consumer industry. There has been a steady growth in demand for machines that are intelligent and can autonomously react to situations and clearly explain the reasons or the logic behind it. Therefore in layman terms, Artificial Intelligence (AI) can naturally be explained as an action a machine performs which otherwise would have been done by a human using his intelligence [1,2].

With recent advancements in Artificial Intelligence (AI) technology, many have already grown accustomed to talking and interacting with their gadgets at home. AI technology dominates most of the fictional literature works and cinema around, presenting a popular but scary picture of the coming future. It has already begun changing our lives however most of it is yet to happen. Almost all of the major IT firms are spending millions on developing and implementing AI considering it critical for its future state. Providing personalised relationship with machines is the recent trend in product based industries and is believed to flourish even more lately [3,4].

Artificial Intelligence/Machine Learning

Machine Learning is the form of AI that enables machine to learn without being specifically programmed for each instance [5,6]. The fundamental aim in this context is to make decisions. At the root level, more than one neuron (the fundamental unit of a learning system) group together to form a network also called as a neural network is responsible for the Learning process [7-9]. The algorithm provides the guidelines for rules that are to be followed in the learning process. The target here is to look for a solution for network parameters that yield optimised cost function [10,11]. Training is performed by providing the algorithm with complete set of training examples presented and processed once [12-14]. Thereafter, the neural network is presented a complex relationship with the ability to classify input data. The complexity depends of the number of operation simultaneously being carried out [9]. These learning methods are usually classified into three sub-categories namely.

Supervised learning

Supervised Learning is a closed loop feedback system wherein network parameters are adjusted by comparing the actual and desired output of the system. Labelled set of training data is mapped onto its output using a general rule or mapping pattern which acts as an input function [15,16]. Differences between these values are considered as the error measure and are used to control the learning process. Learning process is repeated until the error measure becomes sufficiently small or the process meets a failure [17,18]. Gradient descent algorithm is used to minimise the error measure [19-21].

Unsupervised learning

Unsupervised Learning is implemented without any defined output for a given input set. The pattern or rule used for classification is learned by the algorithm itself while training. The task here is to generate a hypothesis for the input data and then obtain the output as per the postulates. In this process, all possible hypotheses are evaluated however the output is obtained using the optimal one of them all. Final hypothesis determination governs sub-classification criteria from unsupervised learning techniques [12].

Reinforcement learning

A form of reinforcement based learning technique that identifies general patterns or classification rule in a training dataset and then apply the experience and learning upon another dataset. The classification rule is thus based entire upon the training provided. The method uses two different datasets that are passed on to the learning algorithm [18,22,23]. One of the datasets is analysed and all possible hypothesis are tested on it. The optimal of which is then applied onto the other dataset [1,24].

Big Data

Big data is what transforms case-based studies to large-scale, datadriven research works. The characteristics of big data are defined by three major features namely Volume, Variety, and Velocity [25,26].

Over the years volume of biological data has grown exponentially. This is evident from the fact that ProteomicsDB covers 92% (18,097 of 19,629) of known human genes that are annotated in the Swiss- Prot database. Millions of patient’s data have already been collected and stored electronically in databases worldwide. Analysis of these accumulated data would not only enhance health-care services but also bring about major breakthroughs in research [27-29]. Medical imaging also produces vast amounts of data with even more complex features and broader dimensions. The Visible Human Project has archived 39 GB of female datasets [30,31].

Next in line is the variety among data types and its structure. Biological data includes several levels of data sources thus providing a rich array of data for researchers. From genomics, proteomics, metabolomics to protein interactions all of these are unstructured challenges for novel investigations [30-33].

Lastly, velocity refers to producing and processing data. The next generation of sequencing technologies (NGS) [34,35] enables reading billions of DNA sequence data each day at relatively lower costs. Faster speeds are needed for gene sequencing along with faster technologies to process them. In medicine, big data technology is providing faster tools for discovering new patterns among large datasets.

Biological Big Data

With the advent of enhanced computing and storage capabilities, the level of analysis for biological data has shifted from sequence based to a molecular level. This switch has been on account of massive rise in demand for personalised medicine. We are now witnessing an ever increasing demand for producing, storing and analysing huge datasets within a given time frame. Next Generation Sequencing (NGS) has brought about a tide of genomic data and the challenge presently is to store, compute and efficiently manipulate it. Hadoop and MapReduce are the most extensively used currently [25,27,29,36-38].

It is due to lack of universal definition of Artificial Intelligence has helped rapid developments in AI. Basically AI can be regarded as an event or activity that provides machine the ability to make decisions and function appropriately and as per their external surroundings. AI focuses not on expanding the scale or the speed rather the emphasis is on making machines autonomous and comprehensive. Till date there has been no match to human intelligence in terms of reasoning ability, perform set tasks, perceive language, handling sensory signals, response to stimuli, artistic and literary works and even gaming. In recent news is AlphaGo - a Narrow AI which defeated its rival the 18-time world champion Lee Sedol, this being the first time in history when a human brain got defeated by an algorithm. This marks the beginning of a new age in computing, where we can expect building imminent computers [39,40]. Computers that are tailor made for neural network based calculations provide better coordination between hardware and software for advanced capabilities and performance [26,41].

AI applications

The genesis of Deep learning approaches dates back to nineteenth century. Having flourished all over the years, showing steady growth however wide stream applications began not before 2012. With the industry investing in the technology along with the advent of high performance computing capabilities, enhanced storage and parallel computing facilities, its applications in our daily lives has made it increasingly important for us. From business to automobile, art to linguistics nothing remains unaffected by its presence. Medical informatics, the application of Information Technology techniques involves examination of patient records and reports and through analysing of huge amounts of such data; complex interactions and correlation in it is revealed [42-46]. Practical application is observed in areas such as oncology, liver pathology, thyroid disease diagnosis, rheumatology, dermatology, cardiology, neuropsychology, gynaecology and perinatology [47-50]. Medical data is now facing serious setbacks as we still have not been able to come up with statistical methods capable of dealing with noisy and missing data [51-53]. Due to this reason the results drawn out of an AI experiment on medical data still faces uncertainty and errors [45,54-56]. Growing trend in the web world has come up with a trending new system called the ‘Internet of things’ (IoT) wherein several devices are interconnected and keep sharing useful sensory data and commands among them helping devices understand and respond to the external environment. The technique is now creating new opening in a wide range of sectors such as healthcare, retail, banking, manufacturing, smart homes, and personalised user application are some of them [57-59].

AI in genomics

Over the last ten years, machine learning has made tremendous progress in the world of computer science, and still among the fastest growing areas. By 2014, the scientific community had several published research works where machine learning is applied to interpret genome data. Nevertheless, wide scale practical application is something yet to happen. Understanding genes would help transform medicine across the globe [60-62]. The advantage with computers is that you can provide them lots of training. Teach them what is wrong and what is right and meanwhile it keeps learning from its mistakes and eventually starts recognising patterns in data. With large amounts of memory and processing power, computers can learn effectively and continuously for huge amounts of unstructured data. Growing influence of AI over medicine and its worldwide implementation is helping make decision making in machines accurate, personalised and faster. However the current healthcare system remains not capable enough to implement the rapid advancements being made. The implementation of Electronic Health Records has advanced the clinical setup a bit and is being looked into as the maiden step towards revolutionizing modern day healthcare [1,2,5].

The issue here is hardware limitation which is encountered while handling huge amount of data, especially when the training set is huge. Such computation tasks require huge memory and processing capabilities. To overcome this better GPUs with greater amount of memory are being developed such as the ones being developed by companies like Intel and Nvidia. However this still remains a work in progress and currently leaves parallel computing as the only alternative [63,64].

AI in proteomics

Proteins came into the picture ever since it was possible to obtain them in purified form using Mass Spectroscopy and Blotting approaches. Ever since then, the development of high-throughput methods in protein based studies also called as ’proteomics’ has been expanding. With more amounts of data available, machine learning has found increased applications in prediction, feature selection, pattern recognition as well as numerous automation works. The major application is in the form of semi-supervised learning techniques where the algorithm learns from large datasets out of which only few are labelled. The technique finds vast applications as researcher are able to handle big data by labelling limited set of examples nevertheless handling huge sets of unlabelled data. In understanding protein sequences, the first step is to generate a profile for the unknown protein based upon its sequence using homology modelling techniques [28,32,33,37]. Multiple local alignment of the query sequence is made with existing databases containing non-redundant records of protein structure and evolutionary information. This helps building a comprehensive representation for the query protein sequence or sequences. Next step is the analysis of this information using the prediction engine which then classifies these into families, superfamilies, folds, clusters etc. based upon the classifier used [37,65-68].

AI in proteome informatics

Protein structure prediction and dynamic analysis of the predicted structures is one of the very first areas to apply machine learning. Later these came to be broadly known as Artificial Intelligence and have now currently moved into an even broader interdisciplinary technology called as Deep Learning. As determining protein structures is essential for the understanding of Biological processes and for understand cell functioning [69-72].

Protein structure and fold prediction has had a profound impact in understanding their function. Many new protein sequences have been stored over the past few years in numerous databases globally. Determining the structure and folding of these proteins experimentally would not only be cumbersome but also it would cost a lot of time and money. As experimental verification needs a lot of time and also includes risks of inherent human errors, it is therefore imperative that we develop computational techniques for structure and fold prediction of protein structures based on available sequence data form various public domain databases. The need of the hour is to look for methods that are fast, accurate and automated tools, which would rapidly analyse these sequences and predict its function [73-75].

Understanding how proteins attain their three-dimensional structure is still among some of the raffling mysteries in Biology. Understanding of this would server highly beneficial to medicine especially pharmaceutical sector.

Summarising, machine learning based artificial intelligence is successfully being applied in protein fold prediction and structure prediction applications. In coming days, application of it are expected to spread to even more dominantly on areas such as disease based genome modification prediction, protein-protein binding site prediction, protein-protein network prediction as well as other allied research topics within the genre [76-78].

AI in phylogeny

Over the last few years, Computational Biology or Bioinformatics as we know it has grown by leaps and bounds. An event mainly caused by massive explosion in the amount of available data and developments of tools for automated analysis of this data. These methods have now become the “workhorse” of Bioinformatics. With simple techniques such as decision tree builder we are able to select among data and subset using limits. These algorithms are an efficient mean for computational time reduction showing clear advantage over others. However, logic behind a particular outcome may not always be clearly justified. Among the most popular ones are the k-nearest neighbour, Bayes theorem, neural network, decision trees [79-82].

Phylogeny is explained through trees wherein the roots are the origin of evolution, the leaves are the species/an organism or a genomic sequence, the branches are the relationship among the leaves and the branch length represents the evolutionary time taken for a particular evolution. A cladogram however is that form of representation which does not take into account time considerations. Construction of this involves the following steps namely [82-85].

Alignment: Sequence alignment is the first step in evolutionary tree construction. The sequences can be either nucleotide sequences or amino acid sequences that are arranged using multiple alignment algorithms.

Alignment check: Thereafter the alignments of these sequences are checked and are looked for evolutionary similarities.

Distance computation: Once the relationships are established, the next step is to find the evolutionary distance between two nodes.

Validation: Once the tree is built, the final step is to validate our results which are done statistically such as back propagation techniques.

Some popular distance computation methods are Distance Matrix methods which work by constructing a correlation matrix which is done in two similar ways namely Unweighted Pair Group Method with Arithmetic Mean (UPGMA) and Neighbour Joining method (NJ). Both work in a similar fashion by pairing two leaves (nodes) based upon shortest distance. Then recalculate the distance matrix using average distance between the newly constructed node and the remaining leaves. The calculation is done iteratively until only two clusters of nodes are left [86-88].

AI in next-generation sequencing

Biological Databases are a huge collection of Biological information collected, curated and stored in a defined schema. These include experimental results, high-throughput experimental results, published literature and computational analysis. These databases include data from a wide variety of field such as proteomics, metabolomics, genomics, microarray data analysis and the latest trending is Next-Generation Sequence (NGS) data. NGS data now accounts for whole genome analysis as well has undoubtedly served many issues of genomics and proteomics. These databases can be broadly classified into structure databases and sequence databases. Nucleic acid and protein sequences are stored in sequence databases also known as primary databases and protein structures are stored in structure databases also known as secondary database. Some popular among these are GenBank [80], SwissProt [89] and PIR [90]. GenBank is a fast growing repository of known genetic sequences. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references. SwissProt is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy. Many software tools have come up over the years to retrieve, analyse and visualise data from such databases. These tools cover a wide range to handy operations such as homology modelling, similarity and functional analysis etc. One such fascination tool presently trending is “MethBank” involving whole genome sequences which provide configurable and interactive data analysis. Working on a Red Hat Enterprise Linux server, Java frontend and MySql based query environment. The interface is web friendly and helps retrieve a variety of diverse information [91-94].

AI in genomic expression profiling

Owing to technological advancements in genomics, we are now able to check the expression analysis of thousands of genes at a time using a microchip. Due to this technology, expression profiles for thousands of genes are now available that has helped greatly in the identification and treatment of a variety of diseases. ‘’DNA Microarrays' as they are popularly referred to; is a high intensity gene array having thousands of spots that helps examine such huge numbers in one go. Here, health and control samples can be compared to see the abnormalities during diseased cell state [95-99]. With technology advancing rapidly, more and more researchers are being attracted to work on microarray technology and are an integral part of Molecular Biology and Medicine studies. Gene Expression analysis can easily reveal the finding for a patient by checking the disease-related genes. However the problem arises in classification of the available data. Classification algorithms involve statistical methods such as Support Vector Mechine (SVM), Decision Trees and Bayesian Network have been most popularly used. Nevertheless with the advent of AI techniques, these have been employed extensively for classification [100-106].

Integration of Biological Databases with AI

For decades, we tried building computational models for teaching machines. However, one major setback here is the amount of variation in data collected from difference sources. As we know, we shall be able to achieve optimised results only when we would be able to integrate data from a variety of different sources and then devise an automated learning algorithm to analyse and infer prediction based on previous learning experiences. Machines are thus able to serve request from users as well as from other servers more efficiently utilising minimum computational power [107-114].

Biological databases that once comprised of sequences and structures of compounds have now advanced into storage of more complex and bulk data. Most Microarray Profiling studies are based upon a limited subset of the complete expression dataset. We realise that full potential can only be reached upon integration and unification of all available data. Unification and standardisation of public data provides for deep and more accurate insights in analysis and make it easier as well as accurate for machine to find patterns in data. ‘ONCOMINE’ is one such tool for rapid interpretation of gene’s potential role in a particular disease. Expression sets from multiple sources can be retrieved and analysed along while integrating it with multiple other resources such as gene ontology annotations, target gene data, etc. When searching for a disease of interest list of all differential expression analyses is made available [115-136].


History records, humans have went on adapting to new techniques and developing better technology. As reviewed AI, we could see that it had been developing ever since its introduction in 1943 with McCullouch and Pitts giving the world the concept of artificial neurons. Ever since then, it has been growing rapidly, showing unexpected growth at times with new cutting edge technology coming into play. Compared with traditional methods, machine learning based methods are more accurate, robust and reliable. Conversely, since AI has shortcoming as well, we constantly need to look for improvement in its design and application. In days to come, AI would constantly find extensive applications in lot many unexplored areas. The measure of its success would eventually be measured by the amount of change it causes in people’s lives. The ease with which people nowadays are adapting to AI technologies, the future definitely looks good.


Citation: Chakraborty I, Choudhury A (2017) Artificial Intelligence in Biological Data. J Inform Tech Softw Eng 7:207. Doi: 10.4172/2165-7866.1000207

Copyright: © 2017 Chakraborty I, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Select your language of interest to view the total content in your interested language

Post Your Comment Citation
Share This Article
Recommended Conferences
Article Usage
  • Total views: 2596
  • [From(publication date): 0-2017 - Dec 10, 2018]
  • Breakdown by view type
  • HTML page views: 2413
  • PDF downloads: 183

Post your comment

captcha   Reload  Can't read the image? click here to refresh
Leave Your Message 24x7