alexa
Reach Us +441414719275
Biological Databases- Integration of Life Science Data | OMICS International
ISSN: 0974-7230
Journal of Computer Science & Systems Biology

Like us on:

Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Biological Databases- Integration of Life Science Data

Nishant Toomula1*, Arun Kumar2, Sathish Kumar D3 and Vijaya Shanti Bheemidi4

1Department of Biotechnology, GITAM Institute of Technology, GITAM University, Visakhapatnam, India

2Department of Biochemistry, GITAM University, Visakhapatnam, India

3Department of Biotechnology, University of Hyderabad, Hyderabad, India

4Department of Biotechnology, Nottingham Trent University, Nottingham, United Kingdom

*Corresponding Author:
Dr. Nishant Toomula
Department of Biotechnology
GITAM Institute of Technology
GITAM University, Visakhapatnam, India
E-mail: [email protected]

Received date: November 02, 2011; Accepted date: December 05, 2011; Published date: December 09, 2011

Citation: Nishant T, Arun Kumar, Sathish Kumar D, Vijaya Shanti B (2011) Biological Databases- Integration of Life Science Data J Comput Sci Syst Biol 4:087-092. doi:10.4172/jcsb.1000081

Copyright: © 2011 Nishant T, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Computer Science & Systems Biology

Abstract

Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in biological information generated by the scientific community. Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures. This article presents information on some popular bioinformatic databases available online, including sequence, phylogenetic, structure and pathway, and microarray databases.

Keywords

EMBL (European Molecular Biology Laboratory); DDBJ (DNA Data Bank of Japan); NCBI (National Center for Biotechnology Information); Proteomic tools; CATH (Class, architecure, topology, homologous superfamily); Protein domain

Introduction

Bioinformatics is the application of Information technology to store, organize and analyze the vast amount of biological data which is available in the form of sequences and structures of proteins and nucleic acids [1]. The biological information of nucleic acids is available as sequences while the data of proteins is available as sequences and structures. Sequences are represented in single dimension where as the structure contains the three dimensional data of sequences.

The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: “Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics which assess relationships among members of large data sets, the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.”

Relational database concepts of computer science and Information retrieval concepts of digital libraries are important for understanding biological databases. Biological database design, development, and long-term management are a core area of the discipline of bioinformatics [2]. Data contents include gene sequences [3], textual descriptions, attributes and ontology classifications, citations, and tabular data. These are often described as semi-structured data, and can be represented as tables, key delimited records, and XML structures [4,5]. Cross-references among databases are common, using database accession numbers [6,7].

A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided into:

• Collection of data in a form which can be easily accessed

• Making it available to a multi-user system

Databases in general can be classified in to primary, secondary and composite databases. A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures [8].

A secondary database contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence, signature sequence and active site residues of the protein families [9] arrived by multiple sequence alignment of a set of related proteins [10]. A secondary structure [11] database contains entries of the PDB in an organized way. These contain entries that are classified according to their structure like all alpha proteins, all beta proteins, turns, helices [12,13]. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary databases created and hosted by various researchers at their individual laboratories include SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

The first database was created within a short period after the Insulin protein sequence was made available in 1956. Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin consisted of just 51 residues which characterize the sequence. Around mid nineteen sixties, the first nucleic acid sequence of Yeast tRNA with 77 bases was found out. During this period, three dimensional structures of proteins were studied and the well known Protein Data Bank was developed as the first protein structure database with only 10 entries in 1972. This has now grown in to a large database with over 10,000 entries. While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986. Modern biological databases comprise not only data, but also sophisticated query facilities and bioinformatic data analysis tools [14]; hence, the term ‘‘bioinformatic databases’’ is often used.

Biological databases can be broadly classified in to sequence, structure and pathway databases. Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure databases are applicable to only Proteins.

Sequence databases

Nucleotide and protein sequence databases represent the most widely used and some of the best established biological databases. These databases serve as repositories for wet lab results and the primary source for experimental results. Major public data banks which takes care of the DNA and protein sequences are GenBank [15] in USA, EMBL (European Molecular Biology Laboratory) in Europe and DDBJ (DNADataBank) in Japan [16].

Gen Bank: The GenBank nucleotide database is maintained by the National Center for Biotechnology Information (NCBI) [17-19], which is part of the National Institute of Health (NIH), a federal agency of the US government.

EMBL: The EMBL nucleotide sequence database [20] is maintained by the European Bioinformatics Institute (EBI) in Hinxton and

DDBJ: DNA Data Bank of Japan Is a biological database that collects DNA sequences submitted by researchers. It is run by the National Institute of Genetics, Japan.

Ensembl: The Ensembl database is a repository of stable, automatically annotated human genome sequences. Ensembl annotates and predicts new genes, with annotation from the InterPro [21] protein family databases and with additional annotations from databases of genetic disease-OMIM [22-24], expression-SAGE [25,26] and gene family [27].

SGD: The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae.

computer-science-systems-biology-Database

Figure 1: Schematic representation of Database

dbEST: dbEST [28] is a division of GenBank that contains sequence data and other information on short, ‘‘single-pass’’ cDNA sequences, or Expressed Sequence Tags (ESTs) [29], generated from randomly selected library clones. Expressed Sequence Tags (ESTs) are currently the most widely sequenced nucleotide commodity in the terms of number of sequences and total nucleotide count [30].

PIR: The Protein Information Resource (PIR) is an integrated public bioinformatics resource that supports genomic and proteomic research and scientific studies [34]. PIR has provided many protein databases and analysis tools to the scientific community, including the PIR-International Protein Sequence Database (PSD) of functionally annotated protein sequences. The PIR-PSD, originally created as the Atlas of Protein Sequence and Structure edited by Margaret Dayhoff, contained protein sequences that were highly annotated with functional, structural, bibliographic, and sequence data [35,36].

Swiss-Prot: Swiss-Prot [37,38] is a protein sequence and knowledge database. It is well known for its minimal redundancy, high quality of annotation, use of standardized nomenclature, and links to specialized databases. As Swiss-Prot is a protein sequence database, its repository contains the amino acid sequence, the protein name and description, taxonomic data, and citation information [39].

TrEMBL: The European Bioinformatics Institute, collaborating with Swiss-Prot, introduced another database, TrEMBL (translation of EMBL nucleotide sequence database) [40]. This database consists of computer annotated entries derived from the translation of all coding sequences in the nucleotide databases.

UniProt: UniProt database is organized into three layers. The UniProt Archive (Uni-Parc) stores the stable, nonredundant, corpus of publicly available protein sequence data. The UniProt Knowledge base (UniProtKB) consists of accurate protein sequences with functional annotation [41,42]. Finally, the UniProt Reference Cluster (UniRef) datasets provide nonredundant reference clusters based primarily on UniProtKB. UniProt also offers users multiple tools, including searches against the individual contributing databases, BLAST [43,44] and multiple sequence alignment, proteomic tools, and bibliographic searches [45].

Structure databases

Knowledge of protein structures and of molecular interactions is key to understanding protein functions and com-plex regulatory mechanisms underlying many biological processes [50].

Protein Data Bank: The PDB (Protein Data Bank) is the single worldwide archive of Structural data of Biological macromolecules, established in Brookhaven National Laboratories in 1971. It contains Structural information of the macromolecules determined by X-ray crystallographic, NMR methods [51-53]. PDB is maintained by the Research Collaboratory for Structural Bioinformatics (RCSB). It allows the user to view data both in plain text and through a molecular viewer using Jmol.

SCOP: The SCOP (Structural Classification of Proteins) [54] database was started by Alexey Murzin in 1994. Its purpose is to classify protein 3D structures in a hierarchical scheme of structural classes.

CATH: The CATH database (Class, architecure, topology, homologous superfamily) is a hierarchical classification of protein domain structures, which clusters proteins at four major structural levels.

NDB: Nucleic Acid Database, also curated by RCSB and similar to the PDB and the Cambridge Structural Database [55], is a repository for nucleic acid structures. It gives users access to tools for extracting information from nucleic acid structures and distributes data and software.

Pathway databases

Development of metabolic databases derived from the comparative study of metabolic pathways will cater the industrial needs in more efficient manner to further the growth of systems biotechnology [56,57]. Some examples of the pathway databses are KEGG [58], BRENDA, Biocyc.

Database URL Feature
GenBank [31] http://www.ncbi.nlm.nih.gov/ NIH’s archival genetic sequence database
EMBL http://www.ebi.ac.uk/embl/  EBI’s archival genetic sequence database
DDBJ http://www.ddbj.nig.ac.jp/ NIG’s archival genetic sequence database
SGD http://www.yeastgenome.org/ A repository for baker’s yeast genome and biological data
EBI genomes http://www.ebi.ac.uk/genomes/ It provides access and statistics  for  the completed genomes [32]
Ensembl http://www.ensembl.org/ Database that maintains automatic annotation on selected eukaryotic genomes [33]
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene Each UniGene cluster contains sequences that represent a unique gene, as well as related information.
dbEST http://www.ncbi.nlm.nih.gov/dbEST/ Division of GenBank that contains expression tag sequence data

Table 1: Summary of Nucleotide sequence databases.

Database URL Feature
Swiss-Prot/TrEMBL [46,47] http://www.expasy.org/sprot/  Description of the function of a protein, its domains structure, post-translational modifications etc,
UniProt [48]    http://www.pir.uniprot.org/  Central repository for PIR, Swiss-Prot, and TrEMBL
PIR http://pir.georgetown.edu/ It strives to be comprehensive, well-organized, accurate, and consistently annotated.
Pfam    pfam.sanger.ac.uk/    Database of protein families defined as domains [49]
PROSITE www.expasy.ch/prosite/ Database of protein families and domains

Table 2: Summary of Protein sequence databases.

Database URL Feature
PDB www.rcsb.org/pdb/ Protein structure repository that provides tools for analyzing these structures
SCOP scop.mrc-lmb.cam.ac.uk/scop/ Classification of protein 3D structures in a hierarchical scheme of structural classes
CATH www.cathdb.info Hierarchical classification of protein domain structure
NDB http://ndbserver.rutgers.edu/ Database housing nucleic acid structural information

Table 3: Summary of Structure databases.

Database URL Feature
KEGG http://www.genome.jp/kegg/ Protein structure repository that provides tools for analyzing these structures
BioCyc http://www.biocyc.org/  Classification of protein 3D structures in a hierarchical scheme of structural classes
BRENDA http://www.brenda-enzymes.org/ Hierarchical classification of protein domain structure
EMP http://emp.mcs.anl.gov/ Database of Enzymes and Metabolic pathways public server
BRITE http://www.genome.jp/kegg/brite.html Biomolecular Relations in Information, Transmission and Expression

Table 4: Summary of Pathway databases.

KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the primary resource for the Japanese GenomeNet service that attempts to define the relationships between the functional meanings and utilities of the cell or the organism and its genome information [59-61]. KEGG contains three databases: PATHWAY, GENES, and LIGAND. The PATHWAY database stores computerized knowledge on molecular interaction networks. The GENES database contains data concerning sequences of genes and proteins generated by the genome projects. The LIGAND database holds information about the chemical compounds and chemical reactions that are relevant to cellular processes.

BRENDA: It is the main collection of enzyme functional data [62] available to the scientific community. It is maintained and developed at the Institute of Biochemistry and Bioinformatics at the Technical University of Braunschweig, Germany.

BioCyc: The BioCyc Database Collection is a compilation of pathway and genome information for different organisms [63]. It includes two other databases, EcoCyc [64], which describes Escherichia coli K-12; MetaCyc [65], which describes pathways for more than 300 organisms.

Conclusion

As biology has increasingly turned into a data rich science, the need for storing and communicating large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and NMR. Biological databases are an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life.

References

Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Article Usage

  • Total views: 19477
  • [From(publication date):
    December-2011 - Jan 24, 2019]
  • Breakdown by view type
  • HTML page views : 15213
  • PDF downloads : 4264

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2019-20
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

Agri and Aquaculture Journals

Dr. Krish

[email protected]

+1-702-714-7001Extn: 9040

Biochemistry Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Business & Management Journals

Ronald

[email protected]

1-702-714-7001Extn: 9042

Chemistry Journals

Gabriel Shaw

[email protected]

1-702-714-7001Extn: 9040

Clinical Journals

Datta A

[email protected]

1-702-714-7001Extn: 9037

Engineering Journals

James Franklin

[email protected]

1-702-714-7001Extn: 9042

Food & Nutrition Journals

Katie Wilson

[email protected]

1-702-714-7001Extn: 9042

General Science

Andrea Jason

[email protected]

1-702-714-7001Extn: 9043

Genetics & Molecular Biology Journals

Anna Melissa

[email protected]

1-702-714-7001Extn: 9006

Immunology & Microbiology Journals

David Gorantl

[email protected]

1-702-714-7001Extn: 9014

Materials Science Journals

Rachle Green

[email protected]

1-702-714-7001Extn: 9039

Nursing & Health Care Journals

Stephanie Skinner

[email protected]

1-702-714-7001Extn: 9039

Medical Journals

Nimmi Anna

[email protected]

1-702-714-7001Extn: 9038

Neuroscience & Psychology Journals

Nathan T

[email protected]

1-702-714-7001Extn: 9041

Pharmaceutical Sciences Journals

Ann Jose

ankara escort

[email protected]

1-702-714-7001Extn: 9007

Social & Political Science Journals

Steve Harry

pendik escort

[email protected]

1-702-714-7001Extn: 9042

 
© 2008- 2019 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version