Single-Cell Genome Sequencing for Viral-Host Interactions

The heterogeneity of microbial communities has historically been ignored in metagenomic studies. Most of our current understanding about the dynamics of natural microbial communities has been derived from studies carried out on bulk population: Generally millions of cells are collected and analyzed in an ecological survey [1,2]. In the past, the outcome of such analyses answered the questions in the form, “What is there?” to address the “dark matter” [3-5] of microbial species found in a particular environment and sample their genetic composition. The data obtained from such population studies is processed using metagenomic tools which are not well-suited to answer questions about the individual components of a species. These questions entail topics such as the organization of the genes, evolutionary history of the organisms in the community and their metabolic exchangerepertoire [3,6]. Recently, the interest has shifted towards answering questions of the form “What does this particular part of the community do?” Consequently, these topics address the immense complexity of network-based interactions in a microbial community and the diversity within a species present in that environment [7-9]. The advent of single cell genomics (SCG) has has improved our understanding and augmented our ability to answer these questions by making it accessible to compare the differences between individual cells. The practical advantages of single cell techniques were demonstrated with the recent applications of SCG in identifying copy-number variations in the human brain neurons and their consequences for neural cells in pathogenic states [10,11].


Introduction Current State of Metagenomics
The heterogeneity of microbial communities has historically been ignored in metagenomic studies. Most of our current understanding about the dynamics of natural microbial communities has been derived from studies carried out on bulk population: Generally millions of cells are collected and analyzed in an ecological survey [1,2]. In the past, the outcome of such analyses answered the questions in the form, "What is there?" to address the "dark matter" [3][4][5] of microbial species found in a particular environment and sample their genetic composition. The data obtained from such population studies is processed using metagenomic tools which are not well-suited to answer questions about the individual components of a species. These questions entail topics such as the organization of the genes, evolutionary history of the organisms in the community and their metabolic exchangerepertoire [3,6]. Recently, the interest has shifted towards answering questions of the form "What does this particular part of the community do?" Consequently, these topics address the immense complexity of network-based interactions in a microbial community and the diversity within a species present in that environment [7][8][9]. The advent of single cell genomics (SCG) has has improved our understanding and augmented our ability to answer these questions by making it accessible to compare the differences between individual cells. The practical advantages of single cell techniques were demonstrated with the recent applications of SCG in identifying copy-number variations in the human brain neurons and their consequences for neural cells in pathogenic states [10,11].
The success of SCG approaches resulted from major breakthroughs in two different areas: It was in part due to the advances in nextgeneration sequencing and another part was due to new extraction and isolation techniques of biological samples [12][13][14]. SCG provides access to sequences of all DNA in the analyzed cell, including chromosome material, plasmids, and pathogens. The genomic material of individual uncultured cells is amplified by techniques such as multiple displacement amplification (MDA) and whole-genome sequencing is carried out to recover viral genomes for processing [6,12,15,16]. Further analysis of the data obtained from the sequencing of the cellular components has allowed for accurate and uncultured viral-host pairing. In parallel to sequencing, new flow cytometry and microfluidic manipulation techniques have allowed for single-cell isolation from most environments [14,17]. Despite these advances in SCG approaches, many challenges remain to be addressed before SCG becomes a standard technique for researchers.
The exponential growth in the amount of biological data being generated and curated from single-cell studies will require drastically new measures for data management, analysis and accessibility [6]. High-throughput platforms analyzing whole genome amplification products of single-cell isolates generate a tremendous amount of data. The interpretation of this data in an efficient and fast manner to distinguish true genomic variants from the background noise presents a major challenge for bioinformatics [18,19]. To investigate subtle biological heterogeneity and diversity between collected samples, precise methods with high sensitivity are required due to limiting amount of genetic material [20,21]. In addition, to enable single-cell genome sequencing of the uncultured microbial hosts or samples, a number of technical issues critical to the pipeline must be first addressed. These include problems such as the removal of background noise, amplification bias and errors, and the compatibility with currently available genome sequencing pipelines [2,20,22,23]. Although next-generation sequencing devices provide us with high-resolution genomic data-maps from single cells, identification of unique elements to viral predation and reprogramming such as chromosomal variants and alternative transcript processing in viral hosts is still in its infancy [17,[24][25][26] (Figure 1).

Viral Host Responses
Host response-screening of a single-cell Sequencing single cells was not possible until very recently because most bacteria contain a miniscule amount of genetic material that could not be properly extracted and processed [1,2,6]. The first breakthrough was the development of shotgun sequencing of DNA extracted from environmental samples which broadened the accessibility to sampling environments [6]. The second major development was the introduction of Multiple Displacement Amplification (MDA) which allowed for high-resolution amplification of the extracted genome, followed by eventual reconstruction of the target genome in the process through bioinformatics algorithms [6,[12][13][14]27]. The development of single cell amplification came in the form of phi-29 (Φ29) DNA-pol based replication of small circular DNA-templates [6,28]. The polymerasebased rolling-circle amplification allowed for hyper-branching of the DNA-strands which further allowed for thousands of copies to be generated. The early efforts had an increasing amount of amplification bias where random regions of the genome became under-represented in the resulting genome [6]. However, with gradually improving protocols, supplementation from computational corrections and crossverification with the curated databases, better results with reduced bias are being obtained in recent studies [5,28,29].
The assembly of the target genome from the amplified genetic material still remains a pertinent challenge [4,9]. In most of the metagenomic studies performed on the single-cell genome, it has been incredibly difficult to assemble the genome of any individual species except for the most abundant ones [3,29,30]. Any accurately constructed assemblies obtained from the samples have been generated as a consensus genome from multiple fragments [31,32]. The genomerecovery process produces varying levels of success for different species which might result from a higher GC content or restrictive access to the genome because of the presence of DNA-bound proteins [5,16,25].
To better understand virus-host interactions in uncultured viruses and find biologically relevant genomic variants, accurate de novo sequence reconstruction of viral-genome is critical [12,13]. With improved genome-recovery and reconstruction techniques in near future, we will also gain the ability to precisely assign viral hosts ignoring the background noise and predict virus-related genomic elements with greater accuracy and efficiency [5,7,11,23,33].

Host-invasion record-viral-host interactions in the transcriptome
Pathogenicity involves multiple pathways and molecular mediators causing long-term changes in the genomics of the host, some of these changes have recently been observed via single-cell studies. Once the genetic material from a sample has been extracted and amplified, one approach for quantifying transcriptional complexity of gene expression in single-cells is through RNA-sequencing resulting in whole-transcriptome analysis [8,34]. The transcriptome extracted from a single-cell reflects the internal composition of the cell, the changes in gene expression in response to stimuli and more importantly a record of pathogenic invasion [8,24,25,35]. Viral markers can thus be identified from the transcripts made accessible from the host-genome amplification [8]. Single-cell transcriptomics has revealed complex patterns of heterogeneity within sub-population of cells by classification using computational techniques such as k-means clustering which have already show multimodal expression in some studies [7,36]. Currently, new viral diagnostic patterns are being investigated by uncovering similar multimodal expression peaks [34,36]. Recent studies have shown new and promising applications of transcriptome analysis in the context of viral predation [8,17,34]. Gene-element integration during the lysogenic conversion, for instance, in eukaryotic hosts can be identified from transcriptome instability patterns in differential-exon expression of the transcripts [8,9].
The transcriptome is only a small fraction of total genomic sequence therefore RNA-seq can be used to collect early insights into the relational phylogenics, early versus late gene transcription or to analyze viral-reconstruction of host-phenotypic response. For instance, Sen et al. [46] used SPAdes for clustering and construction of basal phenotype maps to understand how the Varicella-zoster virus alters tonsil T-cells. Furthermore, SPAdes is compatible with several of the commercially available sequencing platforms such as Ion Torrent and Illumina making it easy to integrate into existing pipelines [37,38,47]. In broader application, single-cell cytometry in combination with bioinformatics tools like SPAdes can be used to provide new hypotheses on the reprogramming of host cells by intracellular parasites such as viruses [2,4]. In addition, the data can also provide insight into the local alteration in the genome for the differentiated cells as well as how those alterations differ from the changes happening in the microbial community to support pathogenesis [2,4]. The molecular mediators involved in such processes will be part of a complex interaction network and researchers have started to curate those networks as interactome databases [10,11].
Karr et al. [29] proposed a comprehensive computational model to understand how molecular interactions result in complex phenotypic responses for a cell. The whole-cell computational model is based on fundamental principles of signal transduction and curated components that integrate numerous biological pathways and cellular processes. Moreover, the simulation contains programmable modules which can be used to gain granular control over cell physiology in a manner that is not possible in a physical cell. The compartmentalization of the whole-cell model is being used to gain new insights on protein-DNA associations and biological discovery of new cellular mechanisms.
Through the whole cell simulation, Karr et al. hypothesized the emergence of a cell-cycle control mechanism as a result of the synchronous activity of the independent simulation modules that regulate various cellular behaviors. Similarly, the metabolome, the transcriptome, the genome and the proteome can lead to the emergence of interactions necessary for viral integration into the host as a "super" module. In near future, this would allow researchers to explore the capabilities of the modules in new directions to facilitate the understanding of a pathogenic episode.

Interactions at large (Viral-host Interactome)
The application of single cell genomics to viruses has only recently started to gain prominence but considerable progress is being made towards one of the broad goals of SCG: Compilation of vast proteininteraction networks between viruses and their hosts in the microbial community [34,41]. The rationale behind this goal is to elucidate the alignment [17,34,37]. There are two approaches to convert the RNAseq data into transcript sequences: The first one involves the use of previously established model organisms and model-genomes and the second one is via de novo assembly [5]. The first approach has become standard for model organisms however it does not perform well for studying non-model organisms and this is especially true for viruses where model systems for viral-host interactions are limited [5,28]. One solution to this problem has recently emerged in the form of Trinity platform [27]. De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for complete genome coverage from the amplified fragments. Trinity platform also allows for de novo transcriptome assembly from RNA-seq data for non-model organisms [27,30]. The Trinity platform represents a unique opportunity for bioinformatics tools in single-cell genomics as a full-stack toolset for transcriptome analysis facilitating complementary information to metagenomic studies [4,15]. Assembly and reconstruction of the whole genome is still a very expensive endeavor however advances in nanofluidic and lab-on-chip approaches in near future will make single-cell experiments more affordable and accurate [13,26,30].

Modeling host responses to viral invasion
The bacterial genome stores a memory of the recent pathogenic invasions [21]. The identification of viral markers can be critical in reconstructing that response along with the key host-viral interactions taking place during the process. Single cell techniques can provide access to that level of detailed information from an infected cell and the data obtained can in turn be used to create simulations or models for the host response. Advanced simulation methods such as whole cell simulation [29] are very exhaustive in terms of computational resources and still only produce results in a limited context. However, next-generation sequencing technologies focusing on single-cell techniques along with the support from bioinformatics algorithms have greatly reduced the cost and increased the effectiveness of single-cell sequencing, priming it to tackle the problems of pathogenic-response reconstruction [3,6,16]. Algorithmic developments have also extended the previous uniform-genome coverage assumptions to be inclusive for non-uniform coverage from assembly and also to account for chimeric DNA segments accumulated during the MDA reaction [3,33].
A versatile algorithmic tool known as SPAdes, uses k-mers for creating the de-Bruijn graph upon which it performs graph-oriented operations based on factors such as sequence size and graph-shape [4]. With an increased coverage, SPAdes has been utilized in recent studies A recent study exemplifies this process in the case of applying genome-wide screens to influenza virus to scan for the host factors that are required in viral replication [37]. The study identified 1,000 host factors in the virus-host interactome that are less likely to mutate under selective pressure from drugs and can therefore be used in a hit-to-lead progression for drug-development [39]. Similar studies using single cell techniques have identified biologically-relevant gene variants from interactome analysis and highlighted the clinical relevance of using host-targeting drugs [39,40,42]. The host-targeting drugs offer a reduced risk approach because it takes longer for the host to develop resistance. Unveiling the interactome and extending it to clinically relevant viruses as mentioned is being done in two approaches: A direct approach involving genetic screens and an indirect approach involving single-cell genomics [37,39,40]. The direct approach suffers from the limitation of using well-established model systems for virus-host systems [42]. However, the indirect approach bypasses this limitation and involves extraction of genetic material from individual cells followed by hybridization with the viral genome from the extracted genome. This allows us to select the positively hybridized viruses and sequence them for assignment to the correct host-pair without ever culturing the virus-host system [28]. This methodology was adopted by Matrinez-Garcia et al. [5,28] in the general sense to identify viruses from the 'microbial dark matter' which comprises of the uncultured environmental viruses. Similar indirect approaches have been used in the past to study transmission and mutation of human immunodeficiency virus type I (HIV-1). Salazar-Gonzales et al. [43] used single-cell amplification to obtain intact HIV-1 envelope units from viron-RNA and demonstrate evidence for early mutation and diversification of the viral-envelope. In this case, conventional PCR-based techniques using Taq-polymerase were generating more noise from events such as recombination and nucleotide misincorporation [6]. To reduce the noise, single cell techniques such as amplification methods were applied to generate a neighbor-joining tree which was used to find an estimate for early mutation. The estimate was derived from the most recent common ancestor (MRCA) between the mutated viral strains which was obtained from the neighbor tree and ranged between 10-31 days [43]. The versatility of single-cell techniques augmenting metagenomic studies has become even more apparent in the case of virus-host systems and as the interactome expands, new indirect techniques will allow for complex networks to be generated to provide clinical insights [42,44]. Most of the single-cell genomic techniques have been applied to identify patterns in the genetic code; however in near future it will become critical to also analyze the modifiers of the genetic code [15]. In eukaryotic systems, gaining a deep understanding of the modifiers such as histones and the code related to them, the so-called histone code will pave the way for complex system-level insights into the biology of viralhost interactions (Figure 3).

Viral-Host Interactions in Eukaryotic Cells
The dynamics of viral-host interactions in eukaryotic hostcells are collectively more complex and rich. Eukaryotic single-cell amplifications are more complex because they involve a greater number of regulatory elements and a sharp contrast in the genome involving the presence of histones and nucleosomes [45][46][47]. Histonetail modifications have well-defined roles in gene and chromatin regulation however their long-term impact on cellular physiology during a pathogenic episode remains largely unclear [48]. Alongside the genetic code, in eukaryotic cells, the recently discovered histonecode is also known to play a major role in gene regulation.
The current single-cell sequencing studies have shown promise in distinguishing complex patterns of variance in the sequenced cells however interpreting the epigenomic code of a cell remains in its infancy [23,49]. The complexity of epigenetic regulations is immense and defining an underlying code to predict their function in parallel to transcriptome analysis will remain of great interest in near future [15]. Interestingly, protocols for detecting methylation status in singlecell analysis are becoming more sensitive and assisting in exploring the 'methylome' landscapes of individual cells during pathological processes such as tumorigenesis [45,48].
Detection of viral-host modifications to the epigenetic code are only beginning to be understood however single-cell techniques can replace traditional methodologies to supplement metagenomic and clinical studies. Furthermore, in near future, existent single-cell techniques would be applied to search for common overexpression changes such as hypermethylation [46,49]. For instance, a recent study by showed that hypermethylation of Somatostatin receptor-1 in the CpG islands assist in the progression of the Epstein Barr virus to gastric cancer [49,50]. Such changes in hypermethylation states are predictive of known pathogenic conditions. In the future, the hypermethylation modifications would become visible through the reconstruction process and also highlight the changes common to a viral community thriving inside a patient [49]. As single-cell techniques evolve to provide epigenetic information, a database of epigenetic code interactome parallel to the protein databases will emerge providing the complete picture of the connections between the transcript and the genetic modifications in the nucleosome [23,45].

Final Remarks
In this review, we discussed the use of single-cell genomic techniques applied to investigate viral-host relationships and the incredible challenges that stand in path of achieving that goal. We reviewed the advancements in next-generation sequencing and the bioinformatics algorithms that have allowed for reduced noise and better coverage. Next, we discussed the role of transcriptome constructed from the RNA-seq data which allows for quick and early diagnostics. Lastly, we described the application of single-cell techniques to creating the interactome what the future holds for deciphering the histone code. Presently, there are several limitations to the practical applications of single-cell techniques in regards to extraction, amplification, and data-analysis [51][52][53][54][55]. Nanofluidic manipulation and Fluorescenceactivated cell sorting (FACS) approaches required for the isolation of individual cells and the subsequent extraction of the genomic contents have had varying amounts of success with different environmental samples. Several groups [13,26,30] are working on improving those results to give reasonable success with most samples obtained. On the other hand, the multiple displacement reaction (MDA) required for amplification of the extracted genome suffers from bias wherein certain regions are repeated more often than others. Finally, the amount of data generated from the whole-cell sequencing of even a single-cell is enormous. One pertinent challenge in the area is to reduce the background noise and focus on the novel expression targets and new bioinformatics algorithms will improve detection hit-rates.
Single-cell genomic techniques have reduced the need for established and cultured model systems to study microbial communities. The interactions among viruses and their hosts have far-reaching consequences on the host genetics, and biochemistry. Single-cell genomics is also being used in drug-discovery pipelines along with advanced metagenomics and bioinformatics enabling fast identification of relevant sequences and "genome islands" [4,39]. The discovery of new drug targets through viral-host interactions follows a hit-to-lead progression. Next-generation sequencing technologies have made understanding genomics much more integrative: The highthroughput sequencing provides massive amounts of data which is analyzed by bioinformatics algorithms to identify potential hits [56][57][58][59][60]. The hits get processed to generate biologically relevant leads which are then understood in the larger context of -omics data such as the interactome for viral-host interactions discussed here. In addition, advances in metagenomics are creating new tools to act as supplements for standard techniques and provide additional information about the complexity emerging from a glimpse of the full landscape.