Bioinformatics for Viral Metagenomics

Detection of the presence of known and unknown viruses in biospecimens is today routinely performed using viral metagenomics. Because the sequencing speed and cost per base is rapidly declining with new next generation sequencing technologies, such as HiSeq (Illumina), 454 GS FLX (Roche), SOLiD (ABI) and Ion Torrent Proton (Life Technologies), the bioinformatics analysis is today a most important and increasingly demanding part of viral metagenomics analysis. In this review, we highlight some of the major challenges and the most commonly adapted bioinformatics tools for viral metagenomics.


Introduction
During past decade we have seen dramatic evolution of next generation sequencing (NGS) instruments like Genome Analyzer/ HiSeqSystem (Illumina), 454 GS FLX (Roche), SOLiD (ABI) and Ion Torrent Proton (Life Technologies). A variety of bench top NGS instruments, e.g. the 454 GS Junior (Roche), MiSeq (Illumina) and Ion Torrent PGM (Life Technologies) are now becoming standard equipment in virological laboratories. The performance of various NGS technologies has been reviewed elsewhere [1,2].
NGS technologies can be used to obtain a comprehensive and unbiased sequencing of the DNA present in a sample, without the requirement of any prior PCR or other amplification that requires prior information about sequences that may be present [3]. The complete sequencing of all microbiological sequences that may be present in a sample is termed metagenomics [4]. Viral metagenomics is nowadays routinely used for virus detection and discovery of new viruses [5][6][7][8][9][10][11][12][13].
As previously pointed out, viral metagenomics has the potential to further our knowledge of the role of viruses in human diseases such as cancer [14]. The last few decades have led to the realization that a considerable proportion of cancers are caused by infections and have also provided epidemiological indications that additional cancerassociated infections may exist [14]. With viral metagenomics, it is possible to perform a large-scale analysis of all infections that are present in cancers and in healthy individuals [14]. Sequencing of cancer specimens with NGS has already been used in the discovery of a new cancer-associated virus, MCV [15]. The Human Microbiome Project (HMP) is one of several international efforts to take advantage of metagenomic analysis and measure microbial diversity in microbiomes from healthy and diseased individuals [16].
Modern NGS technologies are capable of generating billions of bases, at a rapidly decreasing cost per base [1,2]. This increases the demands on the bioinformatics for the analysis of data produced by NGS instruments. In this paper, we review some of the most commonly adapted bioinformatics tools for viral metagenomic analysis, from quality filtering to genome assembly and taxonomic classification.

Bioinformatics Pipeline
The bioinformatics analysis of NGS data for viral metagenomics follows a number of distinct steps, as schematically depicted in Figure  1. This review mostly follows the procedures used in our previous publications [10][11][12][13], but with consideration of alternatives and possible improvements.

Quality checking and filtering
The bioinformatics pipelines to analyze next-generation sequencing data usually start by quality checking. The sequences are trimmed according to their Phred quality scores [17]. Phred quality scores are logarithmically related to the base-calling error probabilities. For example, a Phred quality score of 10 corresponds to a base calling accuracy of 90% (10 errors per 100 bp), while quality score of 20 equals to base calling accuracy of 99% (1 error per 1000 bp) [17]. Specific quality filtering conditions can be adapted for different downstream analyses [18].
Next generation sequencing technologies might produce exact and/or nearly duplicated reads due to PCR amplification, PCR errors and/or sequencing errors [19,20]. Identifying and removing these reads, a process also called de-duplication, can significantly reduce computational resources for downstream analysis and improve assembly. Presence of duplicated reads might also introduce overestimation of species abundance. On the other hand, duplicated reads might also include natural duplicates that by chance originate from the same start from the same genomic position [19,20]. Highly abundant species have a higher chance to have natural duplicates [20] and their removal might introduce bias towards underestimation of abundances [19]. The Cdhit-454 tool identifies and distinguishes artificial and natural duplicates in 454 pyrosequencing datasets [19]. CD-HIT-DUP tool identifies duplicates from single or paired Illumina reads [21].
To obtain the dataset that contains reads of interest, e.g. the virusrelated reads for viral metagenomics, sequences that are not a target of the investigation need to be filtered out. This decreases the risk of misassemblieS [18], and also speeds up downstream analysis. NGS projects directed towards detecting of viral communities, generated from human samples subjected to whole genome amplification (WGA) may contain more than 70% of human-related reads unless there has been prior separation of viral capsids or shorter DNAs from long chromosomal DNA [10] (Table 1) and viral reads typically constitute less than 1% of reads (Table 1). With prior selection for viral nucleic acids, the human and bacterial related reads will still be the most commonly obtained reads, followed by sequences classified as "other" and "unknown" [10,11] (Table 1). Enrichment for viral particles by ultracentrifugation is helpful in the analysis of serum samples (Table  1), but has not been useful in the analysis of biopsies or skin swabs (Table 1). Bacterial sequences and sequences classified as "other" and "unknown" may also be present in negative control samples (water) after NGS sequencing [11] (Table 1), and it is therefore imperative that all metagenomic sequencing projects also include sequencing of negative control samples [11]. The background sequences found in water samples might be present due to the background reactivity of Phi29 polymerase reaction [22] or represent environmental contamination. However, water controls have so far been found to be uniformly negative for viral sequences [11] (Table 1).
To identify possible contaminant sequences as well as sequences that are not of interest, the NGS sequences need to be aligned against reference sequences. Different alignment software's are available for different sequencing platforms. There are hash table based softwares such as SSAHA2 [23] MAQ [24] and BFAST [25] as well as suffix/prefix tries based such as BWA-SW [26], SOAP2 [27] and Bowtie2 [27]. Hash table based algorithms require a large amount of operating memory, whereas suffix/prefix tries requires less computational resources.

Assembly
NGS technologies produce billions of short reads from random locations in the genome by oversampling it. Assembly algorithms, in the process called de novo assembly; reconstruct original genomes present in the sample by merging short genomic fragments into longer contiguous sequences ("contigs"). There are two main types of de novo assembly programs: Overlap/Layout/Consensus (OLC) assemblers, most widely applied to the longer reads such as MIRA and Celera Assembler's CABOG pipeline and de Bruijn Graph Assemblers, most widely applied to the shorter reads such as Euler [28], Velvet (www. ebi.ac.uk), ABySS [29], All Paths [30] and SOAP de novo (http:// soap.genomics.org.cn/). The different assembly algorithms have been reviewed elsewhere [31][32][33].
The possibility always exists that assembly algorithms may construct erroneous "chimeric" sequences by the assembly of 2 different sequences from different organisms or species, a problem that may be particularly relevant for viral metagenomics where the bio-specimens may contain a multitude of related viral sequences. To validate assembly results, we suggest to use several assembly algorithms, as well as to perform a remapping of all singletons reads to assembled contigs [3,10].

Similarity based methods
Taxonomic classification or bining of metagenomic reads can be divided into similarity and non-similarity based methods. One of the most famous similarity-based taxonomic classifications is performed by NCBI BLAST searches where sequences are compared to known genomes. However, a large part of the sequencing reads from de novo sequencing projects are classified as unknown [10,11]. This can result from incompleteness of public sequence databases or drawbacks of NGS technologies such as short read lengths and sequencing errors. Because metagenomes might contain a large amount of sequences that have very distant homologs or even no homologs at all in public databases, we suggest that the use of BLASTn [34] nucleotide searches is suboptimal and that more sensitive algorithms, prone to identify more distant homologs may be preferable. One such possibility is to search against the protein database using BLASTx, or the tBLASTx algorithm, that translates query and reference nucleotide sequences in all six frames and then compares them to each other. Remote protein homologs can also be identified by exploring conserved protein domains using BLAST (such as deltablast [35]) or HMM-based (such  as HMM-FRAME [36]) alignment against Pfam [37], CDD [38] and TIGRFAM [39] databases.
Conventional BLAST-based search tools are extremely time consuming and may take days or even weeks to complete when large metagenomic datasets need to be compared against nucleotide or protein databases. Paracel Blast (Striking Development) is software that helps to save time by executing searches on multiple non-sharedmemory processors simultaneously.
To classify sequences from alignment results several methods have been developed. One of the first and most frequently used is MEGAN [40]. In BLAST searches, sequences might have multiple matches and MEGAN finds the 'Lowest Common Ancestor' node of all matching sequences in the phylogenetic tree, which reduces the risk of false positive matches. However, MEGAN might produce false negative results by discarding sequences if they do not satisfy user-defined cutoffs. Because the size of genome is related to the number of reads in metagenomic samples, MEGAN is suboptimal for quantitative metagenomic analyses. This problem has been addressed by the development of the GAAS (Genome relative Abundance and Average Size) tool [41] that iteratively weights each reference genome for all matching reads and the number of reads is then normalized to the length of their genomes. GRAMMy (Genome Relative Abundance estimates based on Mixture Model theory) [42] is another useful tool that, compared to GAAS models, reads assignment ambiguities, genome size biases and read distributions along the genomes on a unified probabilistic framework [42]. However, both GAAS and GRAMMy estimate similarities from the alignment qualities of the reads to the reference genomes and not from the reference genomes directly. Thus, they are suboptimal in case there are highly similar genomes in the reference databases. The Genome Abundance Similarity Correction (GASiC) considers reference genome similarities to correct the observed abundances estimated via read alignments [43].

Composition based methods
Taxonomic classification methods that explore composition of genome such as GC content, codon or short oligomer (k-mers) usage are called composition-based methods. Their advantage is that they can be used for taxonomic classification of sequences that do not have any homologs or are highly divergent from sequences in public databases. Composition-based methods are computationally faster compared to similarity-based methods. However, they have lower accuracy and are very dependent on sequence length.

Genotype abundances, community diversity and structure
To estimate the number of different genotypes (richness) and their relative abundances (evenness) in a metagenomic sample, simple read counts may introduce biases, because longer genomes have a higher chance to be sequenced [41]. Another problem is that large parts of metagenomic sequences are classified as unknown, most probably because of shortcomings in the similarity-based taxonomic classification methods, which might result in biased diversity estimates.
Microbial community structure and their differences between different metagenomic samples can pinpoint the influences of patterns of microbial communities and among them presence of yet unknown microbes. Viral metagenome diversity and Community structure estimation pipelines mainly consist of generating contig spectra by tools like Circonspect (http://biome.sdsu.edu/circonspect), calculating average genome size by tools like GAAS [54,55], and using these two parameters to estimate biodiversity by PHACCS [56]. Genotype abundances and community diversity is estimated by the number of different genotypes in the sample, defined as richness (alpha diversity) and their relative abundances and distribution, defined as evenness (gamma diversity) among the metagenomic samples [57]. The analysis is based on the assumption that more abundant organisms will have longer and higher coverage contigs whereas less abundant organisms will have many small and low coverage contigs in the sample (alpha diversity) [57,58]. The gamma diversity uses the same assumption but estimates diversities among different metagenomic samples [57,58]. It assembles mixed sequences from metagenomic samples to be compared and the amount of similarity is measured by the degree of overlap (i.e., if fragments from one sample can be assembled with fragments from another sample) between the sequences from different samples. Monte Carlo analyses are then performed to estimate the degree of morphing [58].

Sequential blast analysis
Sequential blast analysis is another technique used to find shared and non-shared sequences between metagenomic samples [59]. If there are more than two metagenomic samples one is chosen randomly and is compared to a second randomly selected metagenome, which is used as a BLASTn database [59]. Applying user-defined cutoffs, the common sequences are identified and used as BLASTn database to be compared with a third randomly chosen database. The procedure continues until all metagenomic samples are compared. The entire pipeline may be repeated several times for different random orderings [59].

Discussion
As NGS technologies continue to develop rapidly, the metagenomics and viral metagenomics fields are expanding rapidly. NGS instruments generate large amounts of data that increase the demand on bioinformatics tools and algorithms. We have reviewed some of the most commonly used bioinformatics tools used to construct bioinformatics pipelines for viral metagenomic analysis.
One of the biggest challenges for bioinformatics analysis is taxonomic classification of NGS data as many of the sequences have no homologs in the public databases or are highly divergent, which is especially true for viral sequences [60]. Taxonomic classification by composition-based methods is in its infancy and very few methods have been developed for viral sequence classification and abundance estimations in metagenomic datasets [60]. MGTAXA (http://mgtaxa. jcvi.org) and Taxy-Pro [53] are particularly useful in this regard. As viruses are underrepresented in current genomic reference databases, accurate and realistic estimation of the proportion of viral DNA in metagenomics is a great challenge [53]. Thus, further development of viral sequence classification and abundance estimations methods is essential.
Sequence quality checking, identification and removal of sequences of no interest as well as artificial duplicates are necessary steps to obtain as realistic datasets as possible that represent the sequences of interest (e.g. virus-related reads for viral metagenomics). This will decrease the risk form is assembly [18] and reduce the computational resources for downstream analysis. Different downstream analyses require different quality filtering methods [18].
Because the field of viral metagenomics is rapidly developing, both regarding the NGS technologies used and the bioinformatics tools applied, comparison of results from different studies is difficult and establishment of open access databases with metagenomics data also faces challenges in international comparability. We think that regular reviews of the best practices in the bioinformatics used in viral metagenomics, their advantages and shortcomings, are essential for the development of this important field.