Bioinformatics for High Throughput Sequencing

Volume 4 • Issue 4 • 1000e108 J Data Mining Genomics Proteomics ISSN: 2153-0602 JDMGP, an open access journal Over 12 years have passed since the publication of the first rough draft of the sequence of the human genome, and almost 10 years since a complete DNA sequence of the euchromatic part of the human genome was published by a large international consortium [1,2]. While the assembly of the initial maps was costly and labor-intensive, it has been a very important step towards understanding biological functions and interactions, phenotypes, diversity, disease and interactions of an organism with its environment.

Over 12 years have passed since the publication of the first rough draft of the sequence of the human genome, and almost 10 years since a complete DNA sequence of the euchromatic part of the human genome was published by a large international consortium [1,2]. While the assembly of the initial maps was costly and labor-intensive, it has been a very important step towards understanding biological functions and interactions, phenotypes, diversity, disease and interactions of an organism with its environment.
In this last decade, ground-breaking technologies such as pyrosequencing, next generation (nexgen) sequencing and Pacific Biosciences' '3 rd generation' sequencing [3,4] have not only helped to greatly accelerate the rate of DNA/RNA sequence generation and scientific discoveries, they have also allowed to significantly cut the cost of sequencing and the time to deeply sequence entire genomes or transcriptomes. While these achievements are certainly laudable, they have also created new challenges: with contemporary data sets extending into the range of tens to hundreds of gigabytes per run, data storage, management and interpretation began to face new problems. Other issues relate to quality control (QC) in high throughput sequencing studies.
The good news is that there exists an emerging new generation of researchers with broad training in computer science, informatics, wet lab sciences and computational biology that will allow them to tackle the challenges and solve many of the problems, once the appropriate, and definitely not inexpensive, infrastructures have been put in place [5].
The potential pay-off of these endeavors, investments and efforts could become enormous. To name a few examples, a deeper understanding of the interaction of microbial communities requires multidisciplinary teams of well trained researchers. To understand how microbial communities inhabit and interact with the termite gut environment and how cellulose degrading enzymes are produced and compartmentalized starts with the isolation of cellulose-degrading microorganisms, a metagenomic analysis of enzymes from gut inhabiting microbes and a high resolution analysis of the termite gut environment [6][7][8]. While microbial communities had been studied just a few years ago using shot-gun sequencing approaches [9], nexgen and 3 rd generation sequencing approaches are expected to generate much more information in less time and for a fraction of the price. Results from this type of research may be shifting paradigms in biotechnology/processing in just a few years.
Human health, susceptibility and acute disease have always been mentioned as drivers of biotechnology-/sequencing-based diagnostic technology developments. Very encouraging results from RNA sequencing, for example, showing recurrent gene fusions and cancerassociated expression of long non-coding RNAs as well as atypical gene splicing in prostate cancer may one day allow to predict the course of the disease [10]. This is exciting new research in the discovery of biomarkers for tumor aggressiveness, metastasis or response to therapy that might direct therapeutic interventions in prostate cancer patients. Other non-coding RNAs including micro-RNAs might also become prognostic markers for disease progression or disease-free survival [11].
This list of promising applications of high throughput sequencing technology could go on for many pages, including the sequencing of plant genome [12] or disease causing viruses, microbes or agents. We prefer to keep this editorial concise and introduce the reader to a collection of cutting-edge articles that describe innovative solutions to today's problems in bioinformatic analysis of high throughput sequencing data.
In 2013, the publishers of the Journal of Data Mining in Genomics and Proteomics (JDMGP) and myself issued a call to the scientific community to consider publishing high-quality, peer-reviewed articles in a Special Issue of JDMGP entitled 'Bioinformatics for High Throughput Sequencing' for streamlined review by their peers and open access publishing. The Open Access publishing model makes these articles available shortly after acceptance, and world-wide readers will not have to pay fees or order a copy through libraries.
The current issue of JDMGP contains an exciting collection of nine research articles from labs that work at the cutting edge of high throughput sequencing. The following two papers focus on algorithm development. A contribution by G. Natsoulis and colleagues describes a novel two-step algorithm for the 'identification of insertion deletion mutations from deep targeted resequencing' [16], while I.Y. Zhbannikov and coauthors present 'SlopMap: A Software Application Tool for Quick and Flexible Identification of Similar Sequences Using Exact K-Mer Matching' in Roche 454-and Illumina-generated data sets [17].
With 3 rd generation sequencing technology becoming available to the research community, the article by X. Jiao et al. entitled ' A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS' [20] is considered very timely. Finally, the fastest sequencing tools will be underutilized, if bottlenecks continue to exist in the pipeline of template generation and processing. E. Avsar-Ban and colleagues present a 'High-Throughput Injection System for Zebrafish Fertilized Eggs' intended of overcome problems in the use of zebrafish as a vertebrate model system [21].
The present nine articles describe mostly the research focus of their teams of authors. Present efforts at the publishing house are underway to publish a further volume with additional contributions on 'Bioinformatics for High Throughput Sequencing' before the end of the year. Please see the JDMGP's Special Issue web site for the timeline and further information.

Acknowledgement
The skillful assistance of staff of the Weier laboratory, LBNL, is gratefully acknowledged. This work was supported in parts by NIH grants CA136685 and CA168345 carried out at the Earnest Orlando Lawrence Berkeley National Laboratory under contract DE-AC02-05CH11231.

Disclaimer
This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor The Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof, or The Regents of the University of California.