Bioinformatics in High Throughput Sequencing: Application in Evolving Genetic Diseases

Bioinformatics is a computational biology, in terms of macromolecules applying “informatics” techniques to understand and organize the information associated with these molecules. These data are product of large-scale molecular biology projects, such as the various genomes sequencing projects, analysis of gene expression and analysis of genomics, proteomics and protein-protein interactions. They are collected and stored in different databases. Analysis in bioinformatics available in molecular biology focuses on: macromolecular structures, genome sequences and gene expression data. Techniques developed by computer scientists have enabled researchers to sequence nearly 3 billion base pairs of the human genome. Recent scientific discoveries that resulted from the application of next generation DNA sequencing technologies have given rise to the science of genomics, and have enabled critical advances in other fields, including epidemiology, forensics, evolutionary biology and medical diagnostics. Technologies for high throughput sequencing, their limitations and their applications are spotted in this review. Sequencing known genes enables the discovery of novel mutations that could help scientists understanding the evolving features of some genetic diseases, occurrence of many genetic diseases due to mutant variants of one gene or clusters of genes, or even explains the overlapping features of some genetic diseases mapped to nearby or distant loci.


Introduction
Computational, mathematical, statistical and informatics technologies developed parallel to the biological research enabled scientists to interconnect, integrate and interpret the complex nature of any biological system. In fact, information extraction from complex data is a great problem in biological research, where computational systems, biostatistics and information technologies are finding their increasing applications. The assemblage and integration of all these technologies in solving the problems related to the biological systems have been termed as "bioinformatics" in mid 1980s [1].
According to this, Bioinformatics has become an integral part of research and development in the biomedical sciences, and also has an essential role both in deciphering genomic, transcriptomic and proteomic data generated by high throughput experimental technologies, and in organizing information gathered from traditional biology [2].
Defining Bioinformatics as a union of biology and informatics, meaning bioinformatics, involves the technology that uses computers for storage, retrieval, manipulation and distribution of information related to biological macromolecules, such as DNA, RNA and proteins. Bioinformatics has a major impact on many areas of biotechnology and biomedical sciences and applications, for example, in knowledge-based drug design, forensic DNA analysis and agricultural biotechnology.
Bioinformatics is limited to sequence, structural and functional analysis of genes and genomes and their corresponding products. It is often considered as computational molecular biology [3]. The ultimate goal of bioinformatics is to better understand a living cell, and how it functions at the molecular level. By analyzing raw molecular sequence and structural data, bioinformatics research can generate new insights and provide a "global" perspective of the cell. The reason that the functions of a cell can be better understood by analyzing sequence data is ultimately because the flow of genetic information is dictated by the "central dogma" of biology, in which DNA is transcribed to RNA, which is translated to proteins. Cellular functions are mainly performed by proteins whose capabilities are ultimately determined by their sequences. Therefore, solving functional problems using sequence and sometimes structural approaches has proved to be a fruitful endeavor [3].
Sequence based methods of analyzing individual genes or proteins have been elaborated and expanded, and developed for analyzing large numbers of genes or proteins simultaneously. With the complete genome sequences for an increasing number of organisms at hand, bioinformatics is beginning to provide both conceptual bases and practical methods for detecting systemic functional behaviors of the cell and the organisms [2].
The completion of the first human genome drafts was just a start of the modern DNA sequencing era, which resulted in further invention, improved development toward new advanced strategies of high-throughput DNA sequencing, so called the "high-throughputnext generation sequencing" (HT-NGS). These developed HT-NGS strategies addressed our anticipated future needs of throughput sequencing and cost, in a way which enabled its potential multitude of current and future applications in mammalian genomic research [4]. Additionally in these advanced laboratory methodologies, a scope of new generation of bioinformatics tools has further emerged as an essential prerequisite to accommodate further strategic development and improvement of output results.

Sequencing Overview
Sequencing has progressed far beyond the analysis of DNA sequences, and is now routinely used to analyze other biological components, such as RNA and protein, as well as how they interact in complex networks. In addition, increasing throughput and decreasing costs are making medical applications of sequencing a reality.
Next-generation sequencing (also 'Next-gen sequencing' or NGS) refers to DNA sequencing methods that came to existence in the last decade after earlier capillary sequencing methods that relied upon 'Sanger sequencing' [5]. As opposed to the Sanger method of chain-termination sequencing, NGS methods are highly parallelized processes that enable the sequencing of thousands to millions of molecules at once.

Popular NGS methods include pyrosequencing developed by 454
Life Sciences (now Roche), which makes use of luciferase to read out signals as individual nucleotides are added to DNA templates, Illumina sequencing that uses reversible dye terminator techniques that adds a single nucleotide to the DNA template in each cycle and SOLiD sequencing by Life Technologies that sequences by preferential ligation of fixed length oligonucleotides [4]. But these advances did not merely make the sequencing of DNA and RNA cheaper and more efficient; they have also helped create innovative new experimental approaches that penetrate deeply into the molecular mechanisms of genome organization and cellular function.
A prime example of the advances that have been facilitated by new sequencing technologies is the NHGRI-funded ENCODE project, which was launched in late 2003, based largely upon methods first developed in yeast [6,7]. The pilot phase of ENCODE relied heavily on microarray-based assays to analyze 1% of the human genome in unprecedented depth [8].
With credit to advances in high-throughput sequencing, researchers expanded the scope of this project, to include the whole human genome [9]. A total of ~1650 high-throughput experiments were performed to analyze transcriptomes and map elements, and identify methylation patterns in the human genome. This multi-institution consortia project has assigned biochemical activities to 80% of the genome, particularly annotating the portion of the genome that lies outside the wellstudied protein-coding regions, including mapping over four million regulatory regions. This information has also enabled researchers to map genetic variants to gene regulatory regions and assess indirect links to disease [10]. Similar projects annotating the genome have also been performed for Drosophila melanogaster [1], Caenorhabditis elegans [11], and mouse [12].

Whole Genome Re-sequencing
The term "re-sequencing" refers to the act of sequencing multiple individuals from the same species, where a reference genome has been generated, and is used to assist in the interpretation of the data collected using next generation sequencing approaches. For example, re-sequencing of human genomes has been used to discover both mutations [13,14], and polymorphisms [1]. The existence of reference genome sequences has driven this application, which was the first one employed using Roche/454, Illumina/Genome-Analyzer and Applied Biosystems/SOLiD technologies.
Since that landmark study, whole genome re-sequencing continues to be used actively in various projects, including for example the 1000 Genomes Project [1], which aims to discover common sequence variants in healthy human populations, and also in various cancer studies (e.g. [13,14]), including those conducted under the auspices of the large TCGA (http://tcga.cancer.gov/) and ICGC (http://www.icgc. org/, 2010) consortia.
Applications for whole genome re-sequencing continue to emerge, and the steady decrease in cost per base and the increased throughputs associated with the latest technology advances will hopefully make this mode of data collection as appealing financially as it is scientifically.

DNA Sequencing Using Bioinformatics Analysis
Bioinformatics analysis of sequencing data can be divided into several stages. The first step is technology dependent, and deals with processing the data provided by the sequencing instrument. Downstream analysis is then done ad hoc to the type of experiment. When sequencing new genomes, de novo assemblies are required, which are possibly followed up with genome annotations. Re-sequencing projects use the short reads for aligning (or mapping assembly) against a reference sequence of the source organism; these alignments are then analyzed to detect events relevant to the experiment being conducted (e.g. mutation discovery, detection of structural variants, copy number analysis). The first step of bioinformatics analysis starts during sequencing, and involves signal analysis to transform the sequencing instruments fluorescent measurements into a sequence of characters representing the nucleotide bases. As sequencers image surfaces densely packed with the DNA sequencing templates and sequencing products, image processing techniques are required for detection of the nascent sequences and conversion of this detected signal into nucleotide bases. Most technologies assign a base quality to each of the nucleotides, which is usually a value representing the confidence of the called bases. Although each vendor has methods specific to their technology to evaluate base quality, most provide the user with a Phred-like Score value: a quality measurement based on a logarithmic scale encoding the probability of error in the corresponding base call [15].
To achieve contiguous stretches of overlapping sequence (contigs) in de novo sequencing projects, software that can detect sequence overlaps among large numbers of relatively short sequence reads is required. The process of correctly ordering the sequence reads, called assembly,is complicated by the short read length; the presence of sequencing errors; repeat structures that may reside within the genome; and the sheer volume of data that must be manipulated to detect the sequence overlaps. To address such complications, hybrid methods involving complimentary technologies have been successful. For example, by mixing 200 bp 454 sequences reads with Sanger sequences, Goldberg et al. [16] successfully sequenced the genomes of several marine organisms. A different approach eliminated the need for Sanger sequencing by mixing two distinct next generation sequencing technologies [17]. By taking advantage of 454's longer reads (250 bp) with short Illumina reads (36 bp), Reinhardt et al. [17] were able to de novo sequence a 6. Bioinformatics for Highthroughput Sequencing during sequencing of the mouse genome [18]; by taking advantage of the conserved regions between mouse and human, Gregory et al. [18] were able to build a physical map of mouse clones, establishing a framework for further sequencing. A similar approach can be used to produce better assemblies with next generation sequencing. For example, to sequence the genome of the fungus Sordaria macrospora [19], short reads from 454 and Illumina instruments were first assembled using Velvet [20], and the resulting contigs were then compared to draft sequences of related fungi (Neurospora crassa, N. discreta and N. tetrasperma).
This process helped produce a better assembly by reducing the number of contigs from 5,097 to 4,629, while increasing the N50 (the contig length N, for which 50% of the genome is contained in contigs of length N or larger), from 117 kb to 498 kb.
More recently, new algorithms have been developed, which can assemble genomes using only short reads. Most of these methods are based on de Bruijn graphs. Briefly, the logic involves decomposing short reads into shorter fragments of length k (k-mers). The graph is built by creating a node for each k-mer and drawing a link, or "edge," between two nodes when they overlap by k-1 bp. These edges specify a graph in which overlapping sequences are linked. Sequence features can increase the resulting graph's complexity. The graph can, for example, contain loops due to highly similar sequences (e.g. gene family members or repetitive regions), and so-called bubbles can be created when single base differences (e.g. due to polymorphisms or sequencing errors) result in the creation of non unique edges in the graph, which yield not one, but two possible paths around the sites of the sequence differences.
Graph complexity and size increase for large genomes, and given that the graph needs to be available in memory for efficient analysis, not all implementations can handle human size genomes. Some publicly available implementations, such as Velvet [20] and Euler-SR [21], have been successfully used to assemble bacterial genomes. Another implementation, ABySS [22], makes use of parallel computing through the Message Passing Interface (MPI), to distribute the graph between many nodes in a computing cluster. In this way, ABySS can efficiently scale up for the assembly of human size genomes, using a collection of inexpensive computers. Two newer assemblers [23], and ALLPATHS-LG [24], are able to assemble human-sized genomes using large memory multi-cpu servers, requiring 150 Gb and 512 Gb RAM, respectively.
For re-sequencing experiments, high-throughput aligners are required to map reads to the reference genome. Many applications have long been available for sequence alignments; however, the amount and size of the short reads created by next generation sequencing technologies required the development of more efficient algorithms. Some methods use "hashing" approaches, such is the case of Maq [23], in which the reads are reduced in complexity to unique identifier keys ("hashed"). These can then be used to scan a table made from a similarly "hashed" representation of the reference genome to identify putative read alignments to the reference. Other methods, based on Burrows-Wheeler transformation, have become popular for read alignment. These include BWA [25], Bowtie [26], and Soap [26]. Although these algorithms are relatively fast compared to Maq [27], they are somewhat limited when it comes to splitting a read to achieve gapped alignments, which can occasionally be required due to insertion/deletion sequence differences ("indels") between sequence data and the reference. The Mosaik aligner [28] attempts to approach this by using a Smith and Waterman (1981) algorithm to align the short reads.

Genomic Sequencing in Medical Fields
Genomic sequencing will have an enormous impact on the field of medicine. Until recently, cost and throughput limitations have made general clinical applications infeasible. Currently, though, the price of about 5000 USD for a normal human genome sequence (not counting analysis), and fast throughput (several days to a few weeks) is rapidly making the medical sequencing practical. Indeed, high-throughput sequencing has already been used to help diagnose highly genetically heterogeneous disorders, such as X-linked intellectual disability, congenital disorders of glycosylation and congenital muscular dystrophies [29]; to detect carrier status for rare genetic disorders [29,30]; and to provide less invasive detection of fetal aneuploidy through the sequencing of free fetal DNA [31]. Nonetheless, medical sequencing could potentially be applied in a wide range of areas, such as cancer, hard-to-diagnose diseases and personalized medicine.

From Low Throughput to High Throughput Sequencing Bioinformatics in Clinical Settings
Many genetic diseases are characterized by cutting-edge features; overlapping features of some genetic diseases may necessitate an extensive study, not only at the DNA, but also at the protein level. Low throughput sequencing known target genes enables the discovery of novel mutations that could help scientists understanding the evolving features of some genetic diseases, occurrence of many genetic diseases due to mutation variants of one gene or cluster of genes, or even the overlapping features of different genetic diseases mapped to nearby or distant loci.
Amplification of all the coding sequences, including flanking introns in CTNS gene using a Big Dye Primer Cycle Sequencing kit and an ABI 310 Genetic Analyzer (PE Applied Biosystems, Foster City, California, USA), yielded a novel nonsense mutation (c.734G4A); homozygous in probands, but heterozygous in the parents [32]. This mutation substitutes tryptophan by a premature stop codon at the position 245 in cystinosin (W245X). This novel truncating CTNS mutation could explain the detection of congenital heart defects, for the first time-not previously reported in literature, in the two patients with severe infantile cystinosis ( Figure 1A and 1B).
On the other hand, as LMNA gene (OMIM: 150330) mutations that codes for lamin A/C ( HGNC id: 663) had been associated with more than 13 disease variants, involving heart, nerve, adipose tissue, skeleton…etc. in different patterns of which mandibulo-acral dysplasia (OMIM: 248370) and Hutchinson-Gilford progeria syndrome (OMIM: 176670).A novel p.Arg527Leu LMNA mutation in two unrelated Egyptian families causes overlapping mandibuloacral dysplasia and progeria syndrome had been recently discovered; the affected patients had features of mandibulo-acral dysplasia (stunted growth, hypoplastic mandible, stiff spine, acro-osteolysis of distal phalanges), with some progeroid features, such as pinched nose, premature loss of teeth, loss of hair and scleroderma-like skin atrophy [33]. Patients were homozygous; however, their parents were heterozygous for p.Arg527Leu LMNA mutation (Figure 2A and 2B). Computational predictions of such substitution effects suggested an alteration in the protein stability, and thus a great tendency for protein aggregation; such changes might influence its interaction with other proteins. This bioinformatics prediction has been recently proven by the detection of minor ultra-structural changes in heterozygous parents, compared to the severe changes in affected patients, as elucidated on electron microscopic examination of skin biopsy samples ( More extensively, whole genome sequencing is sometimes mandatory for elaboration of 'mysterious' clinical diseases, i.e. if dissection of known gene(s) candidates for a clinical state yielded no positive results. In other words, whole-genome and exome sequencing is likely to prove useful in the diagnosis of rare diseases, and in selecting the optimal individualized treatment option for patients. This approach typically involves the use of families; sequencing of affected individuals and relatives along with inheritance patterns is used to deduce variants that are associated with a disease. Whole exome sequencing performed on a four member family led to the discovery of the causative gene for Miller's syndrome, an extremely rare condition that gives rise to micrognathia and cleft lips among other features [34]. Nicholas Volker received a bone marrow transplant after his genome sequence indicated he had a mutation on the X chromosome that led to an inherited immune disorder that was giving him multiple problems. With the new diagnosis at hand, Volker was successfully treated, and his severe inflammatory bowel disease alleviated [35]. Richard Gibbs describes using complete genome sequences of twins diagnosed with dopa-responsive dystonia to identify the appropriate treatment option, which eventually resulted in significant clinical improvements of the twins [36].

Conclusion
Bioinformatics mainly deals with four facets of analysis: DNA sequence analysis, protein structure prediction, functional genomics and proteomics, and systems biology. High-throughput sequencing, with its rapidly decreasing costs and increasing applications, is replacing many other research technologies. Nonetheless, significant challenges remain with NGS; these include data processing and   Figure 2B: Sequencing results of LMNA gene; note homozygous mutation in the patients (T base) and heterozygous mutation in the consanguineous parents (G-T bases overlap), compared to wild sequence in control (G base). storage. Another significant challenge is genome interpretation, which includes not only the analysis of genomes for functional elements, but the understanding of the significance of variants in individual genomes on human phenotypes and disease. All these add to the still-impractical costs of vast sequencing applications in the clinic.

Control
The benefits of sequencing applications in the medical clinic definitely look promising, and also it is necessary, in the future, to develop ways to map sequencing data onto currently difficult-tomap regions, such as highly repetitive and low-expressed regions. Sequencing technology is rapidly improving, but the analytical capabilities to understand everything that is being generated by the sequencers is lagging far behind. We need to advance the computational technologies and skills in Bioinformatics as we progress towards the systemic use of high-throughput sequencing in research and medicine.