A Review on New Horizons of Bioinformatics in Next Generation Sequencing, Viral and Cancer Genomics

,


Introduction
In last decade most of the biological research revolved around molecular biology and genomics related studies. In genomic studies a major portion of study was focused on comparative genomics [1][2][3][4] and genome sequencing [5][6][7][8][9]. As soon as human genome project completed during 2003, there was an outburst in number of genome sequencing project was observed. This was because of a belief that all the complication related to human or any other organism is somewhere related to its genome composition and variation among these.
Genome sequencing techniques in early era were limited to Sanger sequencing method and Maxam-Gilbert sequencing method. Sanger sequencing method was anyways also called as chain termination method. And also these methods were too much expensive as well as time consuming. And thus high throughput sequencing methods were taken into consideration [9][10][11][12][13][14]. Maxam-Gilbert sequencing method was one of the early DNA sequencing methods where any DNA sequence can be determined using synthetic location-specific primers during 1973. Later on there were several modification and Sanger at MRC center, Cambridge, UK and demonstrated a new method for DNA sequencing as DNA sequencing [15][16][17][18][19][20][21][22] with chain-terminating inhibitors in 1977. By using this method later scientists from MRC center also displayed the first complete genome sequencing for Epstein-Barr virus in 1984, which was composed of 172,282 nucleotides. The interesting fact was in this that there was no prior knowledge about genetic profile [23][24][25] of this virus was known.

Sequencing data and management
After sequencing of small viruses now it was time for go for large sequencing projects. Even sequencing data for small virus was also very big and thus data handling was an issue for conducting these researches. This is where bioinformatics tool [26][27][28][29][30] helped in data management. Integrating these biological data with SQL and other programming languages helped a lot to support compilation and analysis of available data. This data handling was separated in two different departments; one was to create the complete database in forms of rows and columns as in table and the second part was to manage the available data.
Implementation of SQL was a great added advantage for bioinformatics as it was very easy to use and command lines were not very complex as other programming languages. And that was the reason; it became very popular and easy to handle by biological researchers. In management of data also there was a part of querying about select and view mostly and helped in various data mining [31][32][33].

Impact of NGS technology in virology
Viruses are the most abundant and the smallest organisms on this planet, which are comparatively simple to sequence. Although available data offers an opportunity to study viral diversity and taxonomic hierarchy at various levels, it also challenges for systematic and structured organization of data and its downstream processing as well. Extensive computational analyses using a number of algorithms and programs have opened exciting opportunities for virus discovery and diagnostics, in which bioinformatics played a vital role or key player. Molecular analysis of viruses using data generated by NGS has revolutionized complete idea of virology. The main idea of bioinformatics was to analyze sequence, structure and function relationships, but eventually also resulted in the development of new areas of research such as phyloinformatics and immunoinformatics, which translates raw data into information about evolutionary history and interaction of protein bodies [34][35][36][37][38].

Bioinformatics methods for viral genomics
Bioinformatics approaches help to estimate and analyze population diversity by studying genetic recombination, mutation, selection and, thereby, assist in correlation of genotype to phenotype. There are plenty of methods available, among which some of them are discussed below [39][40][41][42][43].
Quasispecies reconstruction: Quasispecies reconstruction is calculation of number of viral variants and their frequency. Every viral variant in a quasispecies is considered as a haplotype. Several tools can be implemented for this process, which include Short Read Assembly into Haplotypes, Quasispecies Reconstruction algorithm and QuasiRecomb.
Population genetics studies: Genetic structure of a population refers to the number of distinct subpopulations, identified using a characteristic set of allele frequencies. A population analysis can be performed using the model based STRUCTURE program using available genomic data. The program can infer the genetic structure in haploid, diploid and polyploid species as per requirement [44,45]. Simulation studies in population genetics play an important role in helping to better understand the impact of various evolutionary and demographic scenarios on sequence variation and sequence patterns, and they also permit investigators to better assess and design analytical methods in the study of disease-associated genetic factors. To facilitate these studies, it is imperative to develop simulators with the capability to accurately generate complex genomic data under various genetic models. Currently, a number of efficient simulation software packages for large-scale genomic data are available, and new simulation programs with more sophisticated capabilities and features continue to emerge. There are three basic simulation frameworks termed as coalescent, forward, and resampling.
Linkage equilibrium: Linkage equilibrium is actually the statistical independence of alleles at all loci and indicates evidence of free recombination. Thus, linkage disequilibrium is a measure of the correlation between the occurrences of nucleotides at different location of a complete genome. The extent to which recombination occurs can be estimated by specialized programs such as Linkage Analysis and DNA Sequence Polymorphism.
Pressure analysis: The selection pressure can be classified as pervasive and episodic. Various statistical methods for analysis of pervasive and episodic selection are available at the Datamonkey webserver of Hypothesis testing using Phylogenies software package.
Phylogenetic analysis for viruses: Whole genome-based phylogenetic trees are widely used for various viruses owing to their small genome sizes and conservation of genomic structure. Phylogenomics is getting popularity to monitor epidemiology and disease surveillance, in particular. This field when analysed in the context of spatio-temporal data helps to understand the disease spread and progression during sudden outbreaks. The program such as Bayesian Evolutionary Analysis by Sampling Trees (BEAST) is exclusively designed for phylogeography studies and is used widely to study spatio-temporal dynamics of viruses at population scale. BEAST software provides a Bayesian Markov chain Monte Carlo (MCMC) framework for parameter estimation and hypothesis testing of evolutionary models from molecular sequence data. It brings together a large number of evolutionary models into a single coherent framework for evolutionary inference.

Current Challenges of Next Generation Sequencing (NGS)
Now we have seen enough number of applications of bioinformatics as well as NGS in our ongoing and future researches. Most important challenge of NGS and bioinformatics is to implement these results into real medicine research or can say clinical translation of these results. As we can see in a study by Sandeep Pingle in illumine blog about cancer genomics, to detect direct somatic cancer genome there can be 3 major approaches.
1. Whole genome sequencing, 2. Whole exome sequencing, 3. RNA sequencing (Transcriptome) This somatic cancer genome alteration may be nucleotide substitution, copy number variation, insertion or deletion or it may be a chromosomal rearrangement. These kinds of studies can reveal not only about a clear pathogenesis of article, it may also lead to identification of certain important biomarkers for future target of drug development [46,47].
During this study one of the challenges may be quality and quantity of samples available. It can be overcome with increasing sequencing depth, which can ultimately increase low sample purity and increase ploidy.
Most of the data obtained with state-of-the-art sequencers is in the form of short reads [48]. Hence, analysis and interpretation of these data encounters several challenges, including those associated with base calling, sequence alignment and assembly, and variant calling. These challenges have led to the development of innovative computational tools and bioinformatics approaches to facilitate data analysis and clinical translation [49,50].

NGS and its role in personalized medicine
The potential of next-generation sequencing (NGS) to revolutionize personalized medicine and to peer into our genetic studies are very high. While recent technological advances in NGS have propelled our knowledge and understanding of genomics forward, several technical challenges still remain in order to gain that next level of understanding and clinical utility. These challenges need to be discovered and resolved with maximum available possibilities.

Current improvements and highlights
Almost 600 bioinformatics tools were developed during this period of 2012-14 to address these challenges and they are being used for data analysis and data interpretation. Some of these tools can detect quality of short reads, as for example Fast QC and htSeqTools. A tool called as Mutect can be used for sequence alignment with low allele fractions. The tool called MuSiC is a mutational analysis pipeline, which can also help in establishing correlation between mutation, genes and Pathways. As this tool uses sequencing data in addition to clinical information, it has ability to differentiate between passenger mutation as well as driver mutation.

Future perspectives
In current scenario, Multiple bioinformatics tools are being used by cancer researchers, among that everyone have specific requirements because cancer genome data: 1. Needs to be analyzed in association with normal matched genome 2. Contains highly rearranged genomes, and 3. Have enormous heterogeneity But still there is hope of development of a single interface tool or software, which can be utilized to detect all the anomalies in sequence data from somatic cancer cells.

Conclusion
With support of these data and current developments in field of bioinformatics tools, we may hope for a better tool associated with cancer genomic studies which can be used for both clinical information as well as next generation sequence data.