Received date: June 10, 2014; Accepted date: August 30, 2014; Published date: September 03,2014
Citation: Mehmood MA, Sehar U, Ahmad N (2014) Use of Bioinformatics Tools in Different Spheres of Life Sciences. J Data Mining Genomics Proteomics 5:158. doi:10.4172/2153-0602.1000158
Copyright: © 2014 Mehmood MA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Data Mining in Genomics & Proteomics
The pace, by which scientific knowledge is being produced and shared today, was never been so fast in the past. Different areas of science are getting closer to each other to give rise new disciplines. Bioinformatics is one of such newly emerging fields, which makes use of computer, mathematics and statistics in molecular biology to archive, retrieve, and analyse biological data. Although yet at infancy, it has become one of the fastest growing fields, and quickly established itself as an integral component of any biological research activity. It is getting popular due to its ability to analyse huge amount of biological data quickly and cost-effectively. Bioinformatics can assist a biologist to extract valuable information from biological data providing various web- and/or computer-based tools, the majority of which are freely available. The present review gives a comprehensive summary of some of these tools available to a life scientist to analyse biological data. Exclusively this review will focus on those areas of biological research, which can be greatly assisted by such tools like analysing a DNA and protein sequence to identify various features, prediction of 3D structure of protein molecules, to study molecular interactions, and to perform simulations to mimic a biological phenomenon to extract useful information from the biological data.
Bioinformatics; Life sciences; Sequence analysis; Phylogeny; Structure prediction; Molecular interaction; Molecular dynamic simulations
ADMET: Absorption Distribution Metabolism Excretion and Toxicity; ANN: Artificial Neural Network; BLAST: Basic Local Alignment Search Tool; CADD: Compute Aided Drug Design; cDNA: Complementary DNA; CDS: Coding Sequence; ESTs: Expressed Sequence Tags; GWSA: Genome Wide Sequence Analysis; HMM: Hidden Markov Model; HTS: High Throughput Screening; MSA: Multiple Sequence Alignment; NCBI: National Centre for Biotechnology Information; NJ: Neighbour Joining; NMR: Nuclear Magnetic Resonance; ORF: Open Reading Frame; PDB: Protein Data Bank; SNP: Single Nucleotide Polymorphism; UPGMA: Unweighted Pair Group Method with Arithmetic Mean; XRD: X-ray Crystallography Diffraction
Bioinformatics is an interdisciplinary science, emerged by the combination of various other disciplines like biology, mathematics, computer science, and statistics, to develop methods for storage, retrieval and analyses of biological data . Paulien Hogeweg, a Dutch system-biologist, was the first person who used the term “Bioinformatics” in 1970, referring to the use of information technology for studying biological systems [2,3]. The launch of userfriendly interactive automated modeling along with the creation of SWISS-MODEL server around 18 years ago  resulted in massive growth of this discipline. Since then, it has become an essential part of biological sciences to process biological data at a much faster rate with the databases and informatics working at the backend.
Computational tools are routinely used for characterization of genes, determining structural and physiochemical properties of proteins, phylogenetic analyses, and performing simulations to study how biomolecule interact in a living cell. Although these tools cannot generate information as reliable as experimentation, which is expensive, time consuming and tedious, however, the in silico analyses can still facilitate to reach an informed decision for conducting a costly experiment. For example, a druggable molecule must have certain ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties to pass through clinical trials. If a compound does not have required ADMETs, it is likely to be rejected. To avoid such failures, different bioinformatics tools have been developed to predict ADMET properties, which allow researchers to screen a large number of compounds to select most druggable molecule before launching of clinical trials . Earlier, a number of reviews on various specialized aspects of bioinformatics have been written [6-8]. However, none of these articles makes it suitable for a scientist who does not belong to computational biology. Here, we take the opportunity to introduce various tools of bioinformatics to a non-specialist reader to help extract useful information regarding his/her project. Therefore, we have selected only those areas where these tools could be highly useful to obtain useful information from biological data. These areas include analyses of DNA/protein sequences, phylogenetic studies, predicting 3D structures of protein molecules, molecular interactions and simulations as well as drug designing. The organization of text in each section starts from a simplistic overview of each area followed by key reports from literature and a tabulated summary of related tools, where necessary, towards the end of each section.
Sequence analyses refer to the understanding of different features of a biomolecule like nucleic acid or protein, which give to it its unique function(s). First, the sequences of corresponding molecule(s) are retrieved from public databases. After refinement, if needed, they are subjected to various tools that enable prediction of their features related to their function, structure, evolutionary history or identification of homologues with a great accuracy. Which tool should be used for what depends on the very nature of analysis to be carried out . For example, data retrieval tools such as Entrez of PubMed  allows one to search and retrieve data from a wide range of data domains. Similarly, pattern discovery tools such as Expression Profiler , Gene Quiz  allow researchers to search out different patterns in the given data. Another set of tools is dedicated to carry out sequence comparison. These tools such as BLAST (Basic Local Alignment Search Tool) , ClustalW  enable one to compare gene or protein sequences to study their evolutionary history or origin. The data visualization tools such as Jalview , GeneView , TreeView , Genes-Graphs  allow researchers to view data in graphic representation. These tools use advanced mathematical modelling and statistical inferences such as dynamic programming, Hidden Markov Model (HMM), Regression analysis, Artificial Neural Network (ANN), Clustering and Sequence Mining to analyse the given sequence.
These analyses are popular due to their huge applications in biological sciences, the simplicity, and the capacity to generate a wealth of knowledge about the gene/protein in question. These types of analyses are particularly useful for identification of promoter, terminator, or un translated regions involved in the expression regulations, recognition of a transit peptide, introns, exons or an open reading frame (ORF), and identification of certain variable regions to be used as signatures for diagnostic purposes. Therefore, sequence analyses are one of the frequently performed analyses of bioinformatics. For example, Stoilov et al.  used sequence analysis coupled with homology modelling to investigate the genetic basis of primary congenital glaucoma (PCG) . The authors were able to underpin mutations that impair the proper folding and haeme-binding ability of CYP1B1 peptides. Similarly, a genome-wide sequence analysis (GWSA) of Mycobacterium tuberculosis H37Rv revealed that majority of the bacterium’s proteins were the result of repetitive gene duplication or exon–shuffling events . In a recent study, the gene cbp50 from Bacillus thuringiensis serovar konkukian was predicted to encode protein that features multiple chitin-binding domains [20,21]. Similarly, Rho-independent transcription terminators form a collection of 343 prokaryotic genomes were predicted quite accurately (<6% false positive prediction) using various computational tools .
Mostly predictions rely on complementary DNA (cDNA) and Expressed Sequence Tags (ESTs). However, the cDNA/ESTs information is often scarce and incomplete, and therefore makes the task of finding new genes hugely difficult. Computational scientists have developed another technique referred as an ab initio geneidentification. The potential of this technique was demonstrated in a study, which was able to predict 88% of already verified exons and 90% of the coding nucleotides from Drosophila melanogaster with very low rate of false-positive identification . Keeping in view the accuracy (~90%) delivered by this approach, it could be a reliable tool for annotating lengthy genomic sequences and prediction of new genes. Recently, Lencz et al.  were able to identify an inter-genic Single Nucleotide Polymorphic (SNP; rs11098403) at chromosome 4q26 linked with schizophrenia and bipolar disorder by performing a genome-wide association study (GWAS) coupled with cDNA and RNA Seq on a set of 23,191 individuals (5,415 schizophrenic, 4,785 bipolar and 12,991 controls) . The rs11098403 was found to be linked with the expression of neighbouring enzyme, NDST3, involved in the metabolism of heparan sulphate (HS) in the brain tissues. Similarly, Peng and co-workers (2013) predicted the function of 31,987 genes from the draft genome of a forest species Phyllostachys heterocycla using gene prediction modelling approaches based on FgeneSH++ . Please refer to Table 1 for a list of tools used in primary sequence analyses.
|BLAST||It is a search tool, used for DNA or protein sequence search based on identity.|||
|HMMER||Homologous protein sequences may be searched from the respective databases using this tool.|||
|Clustal Omega||Multiple sequence alignments may be performed using this program.|||
|Sequerome||Used for sequence profiling.|||
|ProtParam||Used to predict the physico-chemical properties of proteins.|||
|JIGSAW||To find genes, and to predict the splicing sites in the selected DNA sequences.|||
|novoSNP||Used to find the single nucleotide variation in the DNA sequence.|||
|ORF Finder||The putative genes may be subjected to this tool to find Open Reading Frame (ORF).||http://www.ncbi.nlm. nih.gov/projects/gorf/|
|PPP||Prokaryotic promoter prediction tool used to predict the promoter sequences present up-stream the gene||http://bioinformatics. biol.rug.nl/websoftware/ppp/ppp_start.php|
|Virtual Foorprint||Whole prokaryotic genome (with one regular pattern) may be analysed using this program along with promoter regions with several regulator patterns.|||
|WebGeSTer||This is a database containing sequences of transcription terminator sequences and is used to predict the termination sites of the genes during transcription.|||
|Genscan||Used to predict the exon-intron sites in genomic sequences.|||
|Softberry Tools||Several tools are specialized in annotation of animal, plant, and bacterial genomes along with the structure and function prediction of RNA and proteins.||www.softberry.com|
Table 1: Selected tools for primary sequence analyses.
Phylogenetic analyses are procedures used to reconstruct the evolutionary relationship among a group of related molecules or organisms, to predict certain features of a molecule with unknown functions, to track gene flow, and to determine genetic relatedness . This all could be represented on a genealogic tree or tree of life. The underlying principle of phylogeny is to group living organisms according to the degree of similarity: greater the similarity, closer the organisms would appear on a tree. A phylogenetic comparative analysis is widely used to control for the lack of statistical independence among species .
The methods to construct a phylogenetic tree are divided into three major groups: distance methods, parsimony methods, and likelihood methods. None of the methods is perfect; each one has its own particular strengths and weaknesses. For example, the distance-based trees are easy to set up but not that accurate. The maximum parsimony and maximum likelihood methods are (in theory) the most accurate, but they take more time to run . The distance-matrix methods such as Neighbour Joining (NJ) or Unweighted Pair Group Method with Arithmetic mean (UPGMA) are the simplest. The experts believe that the neighbour joining method provides a very good trade-off between the available methods.
Since discussing the details of each bit for performing MSA, building trees, and testing best fits is beyond the scope of this article, therefore, the reader is referred to the detailed protocol published by the Molecular Genetics Laboratory, Central University of Punjab  on this issue. Table 2 lists some widely used tools in phylogenetic analyses.
|MEGA (Molecular Evolutionary Genetics Analysis)||Builds phylogenetic tress to study the evolutionary closeness.|||
|MOLPHY||It is molecular phylogenetic analysis tool based on maximum likelihood method.|||
|PAML||A phylogenetic analysis tool based on maximum likelihood.|||
|PHYLIP||A package for phylogenetic studies.|||
|JStree||An open-source library for viewing and editing phylogenetic trees for presentation improvement.|||
|TreeView||Software to view the phylogenetic trees, with the provision of changing view.|||
|Jalview||It is an alignment editor and is used to refine the alignment|||
Table 2: Some popular tools used for phylogenetic analyses.
Phylogenetic tools are commonly used to test various evolutionary hypotheses and have become indispensable for functional genomics, particularly when the functions of a gene are not known. For example, prior to the expression of an algal membrane protein, plastid terminal oxidase 1 (PTOX1), in tobacco chloroplasts, authors conducted a phylogenetic analysis to construct the evolutionary history and determine essential features of that particular polypeptide . The phylogenetic analysis revealed that the Chlamydomonas reinhardtii PTOX1 (Cr-PTOX1) has typical signatures of higher plant PTOX such as iron-binding sites, a conserved exon and various blocks of amino acids to act as plastoquinol terminal oxidase . Similarly, Chen et al.  used phylogenetic analysis to study the evolutionary history of respiratory mechanisms in the deep-sea bacterium Shewanella piezotolerans WP3 . The phylogenetic analyses coupled with reverse genetic studies revealed that out of two nitrate reductases, NAP-α and NAP-β, the hallmark of the genus Shewanella, the NAP-β evolved long before NAP-α molecules.
Biological sequence database refers to a vast collection of information about biological molecules such as nucleic acids, proteins and polymers, each molecule to be identified by a unique key. The stored information is not only important for future use but also serves as a tool for primary sequence analyses. With the advancement of high throughput sequencing techniques, the sequencing has reached to a whole-genome scale, which is generating a massive amount of data every day. The submission and storage of this information to become freely available to the scientific community has led to the development of various databases worldwide. Each database has become an autonomous representation of a molecular unit of life. This section deals with such databases, as an understanding of these databases will help to retrieve important information from these data collections relevant to one’s project.
Databases contain a variety of information; and therefore are classified into Primary, Secondary, or Composite databases, depending upon the information stored in them. For example, the data in a primary database is obtained through experimentation such as yeast-two hybrid assay, affinity chromatography, XRD or NMR approaches such as related to sequence or structure. SWISS-PROT , UniProt  and PIR , GenBank , EMBL , DDBJ  and the Protein Databank PDB  are examples of primary databases. A secondary database contains information that is derived from the analysis of data stored in primary databases like conserved sequences, active sites of a protein family or conserved secondary motifs of protein molecules [39,40]. Examples of secondary databases include SCOP , CATH , PROSITE  eMOTIF . Consequently, the primary databases are of archival nature while secondary databases are termed as curated databases. A composite database contains information derived from different primary sources. Examples of composite databases include NRDB (nonredundant database), which contains data obtained from GenBank (CDS translations), PDB, SWISS-PROT, PIR, and PRF. Similarly, the INSD (International Nucleotide Sequence Database) is another example of composite database, which is collection of nucleic acid sequences from EMBL, GenBank, and DDBJ. The UniProt (universal protein sequence database)  represents another example, which is also a collation of sequences derived from various other databases PIRPSD, Swiss-Prot, and TrEMBL. Similarly, wwPDB (worldwide PDB) is a composite of 3D structures in the RCSB (Research Collaboratory for Structural Bioinformatics), PDB, MSD, and PDBj .
The GenBank, built by the NCBI , is a vast collection of genome sequences of over 250,000 species. The data from GenBank can be accessed through the NCBI’s integrated retrieval system, Entrez, while the literature is accessible via PubMed . Each sequence carries information about the literature, bibliography, organism, and a set of various other features, which include coding regions, promoters, untranslated regions, terminators, exons, introns, repeat regions, and translations. The sequence information stored in GenBank is obtained through submission both by the individual laboratories as well as by large-scale genome sequencing projects. Similarly, the Xenbase is an updated resource of genomic and biological data on the frogs including Xenopus laevis and Xenopus tropicalis , where Xenopus spp. are considered as model providing new knowledge in the field of developmental biology which may exploited to modelling and simulation studies of the human diseases.
The Saccharomyces Genome Database (SGD) contains comprehensive information of the yeast (Saccharomyces cerevisiae) and also provides bioinformatics tools to explore and analyse the data available in SGD. The SGD may be used to study functional relationships among gene sequence and gene products in other fungi and eukaryotes (http://www.yeastgenome.org/). Similarly, another genome database called “WormBase” is developed and maintained by international consortium of computer scientists and molecular biologists to provide precise, recent and reachable information related to the molecular biology of C. elegans and other related nematodes (http://www. wormbase.org). The webpage for this database also host several tools for the precise analyses of the stored information. Another up-to-date database is “FlyBase” dedicated to provide information on the genes and genomes of Drosophila melanogaster along with the tools to search gene sequences, alleles, genetic aberrations, different phenotypes, and images of the Drosophila species . Similarly, the wFleaBase (http://wfleabase.org/) provides information on genes and genomes for species of the genus Daphnia (water flea) where Daphnia is considered as a model system to study and understand the complex interplay between genome structure, gene expression, individual fitness, and populationlevel responses to chemical contaminants and environmental change. Although, the wFleaBase contains data from all species of the genus yet the primary species are D. pulex and D. magna. Please refer to Table 3 for further information on genome databases.
|DNA Data Bank of Japan||It is the member of International Nucleotide Sequence Databases (INSD) and is one of the biggest resources for nucleotide sequences.|||
|European Nucleotide Archive||It captures and presents information relating to experimental workflows that are based around nucleotide sequencing.|||
|GenBank||It is the member of International Nucleotide Sequence Databases (INSD) and is a nucleotide sequence resource.|||
|Rfam||A collection of RNA families, represented by multiple sequence alignments|||
|Uniprot||One of the largest collection of protein sequences.|||
|Protein Data Bank||This is another major resource of proteins containing information of experimentally-determined structures of nucleic acids, proteins, and other complex assemblies.|||
|Prosite||Provides information on protein families, conserved domains and actives sites of the proteins.|||
|Pfam||Collection of protein families|||
|SWISS PROT||A section of the UniProt Knowledgebase containing the manually annotated protein sequences||.|
|InterPro||Describes the protein families, conserved domains and actives sites|||
|Proteomics Identifications Database||A public source, containing supporting evidence for functional characterization and post-translation modification of proteins and peptides.|||
|Ensembl||It is a database containing annotated genomes of eukaryotes including human, mouse and other vertebrates.|||
|PIR||An integrated public resource to support genomic and proteomic research|||
|Medherb||Resource database for medicinally important herbs|||
|Reactome||A peer-reviewed resource of human biological processes|||
|TextPresso||This database provides full text literature searches of model organism research, helps database curators to identify and extract biological entities which include new allele and gene names and human disease gene orthologs||http://www.textpresso.org/|
|TAIR||The Arabidopsis Information Resource (TAIR) maintains adatabaseof genetic andmolecular datafor the model plantArabidopsis thaliana. It provides information on gene structure, gene product, gene expression, DNA and seed stocks, genome maps, genetic and physical markers.||http://www.arabidopsis.org/|
|dictyBase||dictyBaseis an online bioinformatics database for Dictyosteliumdiscoideum.|||
|Signalling & Metabolic Pathway Databases|
|KEGG||KEGG is a suite of databases and associated software for understanding and simulating higher-order functional behaviours of the cell or the organism from its genome information.|||
|CMAP||Complement Map Database is a novel and easily accessible research tool to assist the complement community and scientists from related disciplines in exploring the complement network and discovering new connections.|||
|SGMP||The Signaling Gateway Molecule Pages (SGMP) database provides highly structured data on proteins which exist in different functional states participating in signal transduction pathways.|||
|PID||The Pathway Interaction Database (PID) is a collection of curated and peer-reviewed pathways composed of human molecular signaling and regulatory events and key cellular processes. It serves as a research to study the cellular pathways with a special emphasis on cancer.|||
|HMDB||The Human Metabolome Database (HMDB) is the most comprehensive curated collection of human metabolite and human metabolism data in the world. It contains records for more than 2180 endogenous metabolites with information gathered from thousands of books, journal articles and electronic databases along an extensive collection of experimental metabolite concentration data compiled from hundreds of mass spectra (MS) and Nuclear Magnetic resonance (NMR) from the analyses performed on urine, blood and cerebrospinal fluid samples. The HMDB is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community.|||
Table 3: List of some popular databases.
The most significant protein sequence databases include SWISSPROT (Swiss Protein) Databank , TrEMBL (translation of DNA sequences in EMBL) , UniProt (Universal Protein Resource) , PIR (Protein Information Resource)  and wwPDB (worldwide Protein DataBank). The SWISS-PROT  represents one of the comprehensive protein sequence databases. The SWISS-PROT provides information of its entries, which has been generated both experimental as well computational studies. It also provides links to several other data sources such as GenBank, EMBL, DDBJ, PDB and various other secondary protein databases namely domains, posttranslational modifications, species-specific data collections. The protein information in SWISS-PROT mainly concentrates on model organisms and human. The TrEMBL by contrast provides information on proteins from all organisms .
Similarly, the PIR is another comprehensive collection of protein sequences. It provides user several attractive features for example to search for a protein molecule via an ‘interactive text search’ and to perform various web-based analyses such as sequence alignment, matching of peptide molecules and peptide mass calculations .
The UniProt is one of the comprehensive collections of protein sequence resources, which are open to free access. The UniProt database emerged by combining SWISS-PROT, PIR and TrEMBL collections. It provides all sorts of protein information ranging from sequence to function 
The worldwide Protein Data Bank (wwPDB) has been exclusively designed to archive each single 3D structure of protein molecules to become freely available to the scientific community. The databank now contains over 83,000 experimentally generated structures. The PDB also constantly develops tools for the users to provide better access to the data .
The Rfam database contains comprehensive information about RNA molecules and their various features like secondary structures and gene expression modulating elements. The Rfam databases are hosted by the Wellcome Trust Sanger Institute and it is similar to the Pfam database for annotating protein families . As there are number of curated databases available, one of such databases is IntAct, which contains data on protein interactions. All data manually curated by MINT (Molecular INTeraction database) curators has been shifted onto the IntAct database at EMBL-EBI and have been merged with the existing IntAct data collections .
MINT is another database that stores information about proteinprotein interactions derived from already published data in literature . Curated databases for information on complex metabolic pathways have also been built. For example, the Reactome is one such curated database that represents a range of diverse human processes ranging from metabolism to signal transduction. The Reactome is an open source platform, which is freely available to be used and redistributed .
The Transporters Classification Database (TCDB) is a collection of membrane transporters . It uses an internationally approved Transport Classification (TC) system for the classification of protein, which is similar to that of Enzyme Commission (EC). However, it also has some differences from EC system; it provides functional and phylogenetic information as well, for example. The information of more than 600 families of transporters is available in this database. A TC number to sequenced homologues of unknown function is assigned only if it belongs to rare or under-represented family. Various subunits are represented by ‘S’ followed by a number such as S1, S2, S3 and so on. Whereas the proteins which act as accessory transporters as well as those whose characterization is not complete yet are represented by number 8 and 9, respectively . Similarly, the Carbohydrate-Active enzyme Database (CAZy) contains comprehensive information about carbohydrate-modifying enzymes and other information relevant to them. The enzymes are classified into distinct families on the basis of amino acid similarities in their sequences or the presence of various catalytic domains . The databases about the structure, classification and ontology of the lipid molecules have been discussed in detail elsewhere . Since, comprehending all the databases is beyond the scope of this article, we have listed some popular databases in Table 3.
Protein molecules begin their life as shapeless amino acid strings, which ultimately fold up into a three-dimensional (3D) structure to become biologically active. The folding of the protein into a correct topology is a pre-requisite for any protein to perform its biological functions. Therefore, information of 3D structure of a protein is necessary to gain an insight into the function of a specific protein. Usually, 3D structures are determined by X-ray crystallography or related techniques like NMR. However, these techniques are expensive, difficult and time consuming and are often hampered by the poor heterologous expression, and attempts to obtain good crystals . Therefore, very few structures (~250) using XRD and NMR spectroscopy are submitted compared to nearly a million monthly submissions to NCBI. Information of tertiary structures on genomescale level for many proteins is therefore lacking. Alternatively, a protein’s 3D structure can be predicted using various bioinformatics tools, and consequently has become one of the hot topics in the field of bioinformatics .
Bioinformatics approaches can easily identify secondary structure elements in a protein sequence such as helices, sheets, domains, strands and coils. Proteins adopt a specific structure due to the presence of weaker electrostatic forces such as hydrogen bonds between these elements. Therefore, the propensity of appearing certain residues in a particular region of protein such as sheets or coils can be useful to predict a secondary structure of a protein. The most straightforward approach to predict a 3D structure of a protein molecule is comparative modelling. In this approach, a related template (at least 30% sequence identity with target protein) is selected to predict the unknown structure. Since, the 3D structural information is scant, and it is not possible to infer tertiary structures from similarity searches. Therefore, targets not having sufficient identity can be modelled by using different approaches such as known as threading or fold-recognition . In case where these tools fail to generate a reliable structure, then a combination of various physical principles is applied.
The most commonly used method is homology modelling for predicting template-based structure of target protein. However, relatively low number of structures available in PDB hampers this approach . A variety of procedures is available to model the target protein if a homologue of an unknown protein is available such as COMPOSER  or 3D –JIGSAW  and MODELLER  to name a few.
The majority of secondary structure prediction tools use the frequency of observed amino acids at a certain position, which is guessed from the 3D structures determined experimentally. Therefore, earliest methods used observed periodicity of residues to predict secondary structures. However, advanced methods to predict secondary structures or identify secondary structure elements in a given protein sequence use neural networks such as NNPredict. The NNPredict is a multilayer neural network-based method, which predicts the position of each amino acid by letters ‘H’ or ‘E’ for residues appearing in helices or coils respectively. Similarly, the PredictProtein is another automated server, which is based on neural network. It uses multiple sequence alignment to predict various structural and functional annotations of a protein molecule. JPred is another neural-network based method that uses a combination of various methods to predict secondary structure. Since it uses different methods to predict a structure, therefore, its predictions are usually of higher accuracy. To have a snapshot of various protein prediction tools, visit this page http://www.biologie.uni-hamburg.de/bonline/ library/genomeweb/GenomeWeb/prot-2-struct.html.
For predicting structures, computer simulations include energy calculations based on physio-chemical principles, thermodynamic equilibrium with a minimum free energy and global minimum free energy of protein surface. A number of tools are available to predict the secondary structure of a protein molecule. One of the most important tools is ExPASy (the Expert Protein Analysis System), powered by the Swiss Institute of Bioinformatics (SIB). The Expasy provides access to a number of web-based sources such as SWISS-PROT, TrEMBL, SWISS-2DPAGE, PROSITE, ENZYME and the SWISS-MODEL to perform a protein’s structure- as well as function-related studies. The Expasy also provides several additional tools to determine similarity, pattern identification, and studying posttranslational modifications . The iterative threading assembly refinement (I-TASSER) is a webbased tool, which generates automated protein structure and makes functional predictions. The server generates 3D models of a target protein via multiple threading using templates from PDB .
These approaches have been successfully applied to predict the structure of chitin-binding proteins CBP50 and CBP24 from B. thuringiensis serovar konkukian S4 using Modeller v9.0  and Auto Dock vina [21,71,72]. Another study explored the pathogenicity of Mycoplasma genitalium strain G37 in sexually transmitted diseases by modelling the hypothetical proteins of the selected strain using (PS)2v2 sever . Similarly, a putative gene (deactylase or xylanase) cda1 was subjected to functional annotation, and it was confirmed that the enzyme encoded by cda1 gene is a chitin deacetylase gene and may not have any xylanase activity . Table 4 lists some commonly used tools to predict secondary structure of protein molecules.
|CATH||A semi-automatic tool for the categorized organization of proteins.|||
|RaptorX||It facilitates the user to predict protein structure based on either a single- or multi-template threading.|||
|JPRED||Used to predict secondary structures of proteins.|||
|PHD||Used to predict neural network structure.|||
|HMMSTR||A hidden Markov model for the prediction of sequence-structure correlations in proteins.|||
|APSSP2||Predicts the secondary structure of proteins.|||
|MODELLER||Predicts 3D structure of protein based on comparative modelling|||
|Phyre and Phyre2||Web-based servers for protein structure prediction|||
Table 4: Selected tools used to perform structure-function analyses of proteins.
Proteins seldom perform their functions in isolation, and therefore often interact with other molecules all the time to execute a certain process. Understanding how biomolecules interact with other molecules holds numerous implications, for example, for protein folding, drug design and purification techniques  and therefore has become one of the mostly pursued research area using either experimental or bioinformatics approaches. Understanding of molecular interactions is also essential to elucidate the biological functions of a molecule. For example, protein-protein interactions play a key role in cellular activities such as signalling, transportation, homeostasis, cellular metabolism and various biochemical processes .
Bioinformatics in this regard becomes quite handy to predict protein-protein interactions without resorting to costly, and timeconsuming physical approaches such as X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. Often crystal structure coordinates give misleadingly static views of interactions as a complex cannot be represented by a single structure. Therefore, it has been realized that 3D structure of a molecule cannot produce a complete picture of each and every individual interaction. Therefore, computational approaches capable to predict reliable protein-protein interactions have become essential. Nevertheless, such studies generate useful information, which enable scientists to determine a specific pathway to be manipulated in order to achieve required change(s) in the cell.
The parameters governing protein-protein interactions include interface size, amino acid composition at interface, types of chemical groups, complementarity between surfaces, hydrophobicity, hydrogen bonds, and conformational changes whilst complex formation takes place. These properties are studied using various protein datasets. The in-silico approaches used to study molecular interactions fall into two groups: homology based, and non-homology based. The homologybased methods as the name implies work on direct comparison of protein sequences. The non-homology approaches take into account functional interactions collectively. Although the homology-based methods remain the most preferred methodology, the non-homologybased methods are powerful to assign functions to those genes whose homologues have not been characterized yet. For example, Lu et al.  developed 2,865 protein-protein interactions using a multimeric threading approach  and out of these 1138 were confirmed in the DIP (Database of Interacting Proteins) . Recently, Hosur et al.  developed a new three-step algorithm, Coev2Net, to predict proteinprotein interactions. The algorithm is capable to predict interactions with a high-level of performance as compared to prevalent methods. Similarly, Zhang et al.  used PrePPI algorithm to predict a large number of reliable interactions from both yeast (30,000) and human (>300,000) . However, all methods have their own limitations. For instance, they make use of fewer examples, which cannot be applied to all species or all proteins. In an attempt to develop a universal methods, Valente and co-workers  designed a method called Universal In silico Predictor of Protein-Protein Interactions (UNISPPI), which can be applied to a range of diverse species, hence it was termed ‘universal’. Other useful features of the model include its capability to differentiate instances of a complete proteome or even parasite-host associations.
Apart from prediction of protein structures, the molecular modelling can also assist in choosing one unique conformations, which governs the activity of a biomolecule. Other applications include spotting residues at ‘hot-spot’ of protein interfaces by docking a protein onto a small molecule called ligand. There are a large number of softwares available to perform docking calculations; only few, which are most widely used, will be discussed here.
The best-known docking program is DOCK, which is able to assign ligand site on the receptor quite reliably. It also performs evaluation of the quality of the fit. Another program GRID  uses a 3D grid to find out protein binding sites for ligands. AUTODOCK is another commonly used suit and is perhaps one of the most cited platform for the prediction of protein-ligand docking studies (http://autodock. scripps.edu/). It is run and maintained jointly by The Scripps Research Institute (TSRI) and Olson Laboratory. The High Ambiguity Driven protein-protein Docking (HADDOCK) (http://haddock.science. uu.nl/) is another docking approach for the modelling of bio-molecular complexes . For a comprehensive comparison of various docking programs, reader is referred to the review article published by . Similarly, an algorithm, IsoRank, was developed to perform the global alignment of multiple protein-protein interaction networks to maximize the overall match across all input networks. The IsoRank was used to compute the first known global alignment of PPI networks using five species: yeast, fly, worm, mouse and human. It was revealed functional orthologs across these species . Later, IsoRankN (IsoRank-Nibble) was developed based on spectral methods and is error-tolerant and computationally efficient .
Computer based techniques could assist and accelerate the discovery of biological mechanism and lead molecules needed for new drug. For example, virtual screening of flavonoids from Amelanchier alnifolia against Hepatitis C virus (HCV)’s non-structural NS3/4A protease /helicase discovered a high binding affinity of Quercitin 3- galactoside and 3-glucoside with HCV/NS3/4A . The study suggested that Quercitin 3- galactoside and 3-glucoside might be good candidates for inhibition of HCV NS3/4A. Similarly, Sehar and co-workers predicted a chitin-binding site in CBP24  and CBP50  of Bacillus thuringiensis using molecular modelling to study chitindegradation pathways to be used in engineering fungal resistance mechanisms in plants.
The massive generation of data has led to the development of various databases to organize and facilitate study on molecular interactions. For example, signal transduction pathways databases may include protein-protein, protein-DNA, Protein-RNA, DNA-RNA, DNAsubstrate interactions . The Biomolecular Interaction Network Database (BIND) is one of the largest available information resources that provide access to pairwise molecular interaction and complexes . Similarly, MINT is another database, which stores information of functional interactions of biological molecules . A list of selected tools to study protein-protein interactions is given in Table 5.
|SMART||A Simple Modular Architecture Retrieval Tool; describes multiple information about the protein query.|||
|AutoDock||Predicts protein-ligand interaction and is considered as reliable tool.|||
|HADDOCK||Describes the modelling and interaction of bio-molecular complexes such as protein-protein, protein-DNA|||
|BIND||A database that provides access to molecular interaction and bio-complexes|||
|MOE||An integrated package of tools used for drug discovery. It combines visualization, modelling, and drug discovery on one plate-form.|||
|STRING||A database of both known and predicted protein interactions.|||
|MIMO||A dynamics graph-matching tool for the comparison of biological pathways in an efficient manner.|||
|IntAct||It is an open source database system and provides analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.|||
|Graemlin||It is capable of scalable multiple network alignment with its functional evolution model that allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways.|||
|PathBLAST||It is meant to search protein-protein interaction network of the any selected organism and extracts all interaction pathways that align with the query.|||
|CFinder||This tool is capable of finding and visualizing the overlapping dense groups of nodes in networks, and quantitative description of the evolution of social groups. Itis efficient for clustering data represented by genetic or social networks and microarray.|||
|MCODE||It is suited for both computationally and biologically oriented researchers. Its features include; Fast network clustering, Fine-tuning of results with numerous node-scoring and cluster-finding parameters, Interactive cluster boundary and content exploration, Multiple result set management, Cluster sub-network creation and plain text export|||
Table 5: Selected tools to study the molecular interactions.
Drug discovery is a process by which new drug molecules are discovered or designed to cure different diseases. Before the advent bioinformatics tools, scientists used chemistry, pharmacology and clinical sciences to discover new compounds. However, the traditional process is quite slow and expensive as well. The market pressure to find new drugs in a short period with minimum risks has fuelled the interest in alternative ways of designing drugs such as bioinformatics. Bioinformatics has greatly facilitated this complex process and is playing a vital role in advancing the process of drug discovery/designing, since it is faster to analyse molecules on computer as compared to experimental approaches. In fact, a completely new and dedicated field known as Computer Aided Drug Design (CADD) has come into existence to discover novel drug molecules . The whole process of discovering and designing new drug molecules is quite complicated and is quite challenging. The entire process can be divided into four different steps: identification of drug target, validation of target, lead identification, and lead optimization . In this section, we will briefly discuss how bioinformatics is useful in discovering new drugs.
Since drug molecules always act on a target to deliver therapeutic benefit to the patient. The target is a small key biomolecule that allows the drug molecule to produce a desired effect on metabolic or signalling pathway pertinent to the disease under study without interfering the normal functioning of the cell. Therefore, the very first step in the drug designing process is to identify a target involved in that disease. This demands a full knowledge of metabolic processes in normal as well as diseased conditions. The sequencing of human genome provided over 30,000 genes to researchers to include them in their search for new drug targets . Since then the number of potential drug targets is increasing day-by-day . Understanding how a gene functions is indeed a key to choose a gene as a target. A number of databases have been developed to facilitate the search of new drug targets (Table 6).
After selecting potential targets, the involvement of those targets in a particular disease is studied. This is target validation. The targets are compared to analyse for their ability to influence that disease. This is also necessary to determine the likelihood of success in next phase. Bioinformatics approaches such as modelling enable scientists to tailor compounds to bind at a particular site (Predicting protein structure and function for detail on modelling). Next scientists have to find a certain compound - lead compound - capable to alter the action of target. A number of bioinformatics tools allow virtual screening of a large number of compounds that could bind/inhibit or activate a protein. The virtual High Throughput Screening (vHTS) enables identification of promising molecules as early as possible; one of the most needed process in the entire drug discovery process. Often the identified compounds do not have required properties, and therefore they are ‘refined’ to produce more specific effect with reduced number of side effects. This process is ‘lead optimization’.
A number of computer techniques are capable enough to give a compound a higher specificity and fewer side effects. Since lead optimization is the most expensive step in the entire drug discovery process, therefore, often scientists have to develop chemical analogues of such compounds with desired properties. Following the identification and refinement of the lead molecule, scientists conduct pre-clinical animal safety tests. If the lead molecules do not have required binding properties, the drug discovery project is likely to fail .
One of the challenges for researchers for developing a new drug is the prediction of drug-like properties of the lead compound. These properties include charge distribution, solubility, hydrophobicity, pKa, refractivity, molecular weight, and ClogP/LogD. Therefore, the initial evaluation with the help of bioinformatics techniques can significantly influence the ultimate success of the project .
Many tools are available to predict drug-like properties and ADMET (absorption, distribution, metabolism, excretion, and toxicity). For example, the OSIRIS Property Explorer is a web-based tool, which predicts a number of ADMET properties including cLogP, solubility, Toxicity, and Overall Drug-Score. The values are displayed in different colours like green for good, red for bad and yellow for irrelevant. Although it is very user-friendly platform, however, independent assessment is needed of its quality. Similarly, ChemSilico is neural net based prediction method, which calculates ADMET properties. This tool has been validated and trained using over 35,000 tested compounds. The Pre-ADME is another freely available web-based utility that predicts ADMET properties of a druggable compound. Similarly PASS Online (Prediction of Activity Spectra for Substances) is a Bayesian-based tool that calculates over 4000 different biological activities including pharmacological effects, mutagenicity, mode of action, toxicity, interaction with metabolic enzymes and transporters, influence on gene expression, and embryotoxicity. PASS is able to predict with a validation of 85%. It is freely available, but the user has to register first. DREADD (designer receptors exclusively activated by designer drugs) is a recent addition in the toolkit of a computational biologist for the identification of druggable targets for both known and orphan GPCRs (G-protein coupled receptors). It also allows one to study the activities of novel drugs as well. For more on DREADDs, please refer to the comprehensive review published by .
Due to space limitations, it is not possible to include every ADMET prediction tool here. For more details on how drug metabolism is studied using various bioinformatics tools, kindly refer to . Table 7 gives list of various ADMET prediction tools.
It is interesting to note that structure-based predictions are considered more accurate compared to empirically derived predictions due to the detailed understanding of the structure of the substrates and active sites. The generation of more and more structural data is now making it possible to design tools generating mechanistic prediction for ADMET. Even when a structure of a protein is not available, various bioinformatics tools such as modelling allow predicting the structure of an unknown protein. However, having a 3D structure of a protein is not enough to carry out ADMET work. To comprehend mechanistic details of an enzyme, understanding the enzyme-substrate interaction is also essential. This is carried out using another set of computation tools collectively termed as docking. These tools allow one to have study active sites at a molecular level by dock a protein onto a substrate molecule.
Each New Chemical Entity (NCE) should have acceptable ADMET properties to pass through the clinical trials. ADMET data is necessary to determine the feasibility and safety of the drug in human. ADMET deficiencies are the leading cause of failure of most of the drug candidates. The Swiss Institute of Bioinformatics (SIB) has developed an interactive directory of different in silico tools used at each stage of drug discovery. The directory is accessible from this URL http://www. click2drug.org/citations.html
There are a number of drugs whose development was assisted by structure-based design and screening strategies. The discovery of the HIV protease inhibitors is one of the examples . Similarly, Reddy et al.  used computational tools to develop cyclooxygenase (COXs)- based anti-inflammatory drug with no gastric side effects . In order to identify various mutant forms of H-Ras (Harvey-Ras) polypeptides in cancer patients, Jayakanthan and co-workers performed virtual screening of lead compounds. The authors were able to identify two novel leads, 3-aminopropanesulphonic acid and hydroxyurea. The docking analysis revealed that Ile-36, Glu-37, Asp-38 and Ser-39 were involved in the interaction with the ligand . In a study aimed at finding novel targets for globlastoma, use of bioinformatics tools led to the discovery of several novel genes related to the disease . The study was also be to discover a regulatory feedback loop mediated by cyclin-dependent kinase 1 (CDK1) and WEE1. Similarly, Wu et al.  developed tumour-specific networks to identify targets from differentially expressed tumour genes for breast, colon and lung cancer . By using this approach, authors were able to identify several new targets for cancer of which two, Calcium/calmodulin-dependent serine protein kinase (CASK) and RuvB-like1, have recently been verified by experimental approaches . In another study, McDermott and coworkers  applied protein-protein interaction approach to newly discovered differentially expressed proteins from a cell culture model of HCV (Hepatitis C Virus) to identify novel targets  (Figure 1).
As we know, biological activities are the result of molecular interactions that occur in a time-dependent manner. This time dependent behaviour of a molecule could be studied using another set of bioinformatics tools collectively referred as Molecular Dynamics Simulations (MDS). The MDS techniques aim to provide detailed information on the fluctuations, dynamic processes such as ion transport, and small- and large-scale conformational changes of proteins, nucleic acids, and their complexes occurring in biological systems. They also assist determination of structures from experimental approaches like XRD and NMR spectroscopy. The MDS tools could also be useful in gaining insights into situations where use of experimental means is not possible.
For example, to determine which serine residues take part in phosphorylation of starch branching enzyme IIb (SBEIIb) authors used MDS together with site-directed mutagenesis and mass spectroscopy. The study was able to determine that phospho-Ser297 forms a stable salt bridge with Arg665, part of a conserved Cys-containing domain in plant branching enzymes. This study hold numerous implications for elucidating biological role of the enzyme in starch biosynthesis in higher plants from yield-improvement perspectives .
In another study, MDS were used to study the selectivity of two membrane bound transporter proteins aquaporin-1 (AQP1) and aquaglyceroporins (glycerol facilitator; GlpF) for various solutes like ammonia, urea, water and glycerol. The study observed that unlike GlpF, the selectivity of AQP1 was dependent on the hydrophobicity of the solute particles and therefore could act as a filter for in vivo filtering of small molecules .
In a recent study, MDS were employed to investigate the folding process of the Trp-cage protein molecules. The Trp-cage protein is a small (~20 amino acids) protein, which enters into a stable state after folding and forms a hydrophobic core around a central Trp residue. Several experimental and simulation approaches failed to understand the underlying mechanism of folding. Authors used a series of advanced simulation techniques to discover that the central Trp6 residue is critical for folding process. The single chain interacts with itself and becomes a barrier for controlling transitions to a near native folded structure .
Similarly, Isin et al.  used MDS to study different conformations of various ligands with β2-adrenergic receptor and discovered a novel binding site is involved in binding with high molecular weight molecules . The study suggested that modelling different active conformations for identification of novel binding sites could be used in refining mechanisms of action of various drugs . Table 7 lists some popular platforms widely used for MDS analyses.
Bioinformatics is a comparatively young discipline and has progressed very fast in the last few years. It has made it possible to test our hypotheses virtually and therefore allows to take a better and an informed decision before launching costly experimentations. Although, more and more tools for analysing genomes, proteomes, predicting structures, rational drug designing and molecular simulations are being developed; none of them is ‘perfect’. Therefore, the hunt for finding a better package for solving the given problems will continue. One thing is clear that the future research will be guided largely by the availability of databases, which could be either generic or specific. It can also be safely assumed, based on the developments in the field of bioinformatics, that the bioinformatics tools and software packages would be able to give results that are more accurate and thus more reliable interpretations. Prospects in the field of bioinformatics include its future contribution to functional understanding of the human genome, leading to enhanced discovery of drug targets and individualised therapy. Thus, bioinformatics and other scientific disciplines have to move hand in hand to flourish for the welfare of humanity.
Authors thank Higher Education Commission (HEC), Pakistan, for funding their research work and apologise to all those colleagues whose work could not be discussed here due to space constraints.