Received date: May 30, 2015; Accepted date: July 24, 2015; Published date: July 26, 2015
Citation: Wang Y, Cheung AC, Guo JT, Feng B (2015) Genome-wide Massive Sequencing in Embryonic Stem Cell Biology: Recent Insights and Challenges. J Stem Cell Res Ther 5:296. doi:10.4172/2157-7633.1000296
Copyright: © 2015 Wang Y, et al This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Stem Cell Research & Therapy
The discovery of pluripotent Embryonic Stem Cells (ESCs) in mammals has been vastly transforming stem cell research and regenerative medicine. To understand the molecular mechanism of pluripotency, massive sequencing technologies have been adopted with intense scientific interest due to their advantages, including high resolution, low noise, as well as their extensive coverage across the entire genome. Here we review the principles of genome wide massive sequencing technologies widely performed in ESCs studies, including ChIP-Seq, RNA-Seq and methylCSeq. Recent improvements and applications of these technologies will also be discussed. In addition, a summary of various methodologies used to integrate the massive genome wide sequencing data will be presented. Integrating the massive data that delineate different aspects of ESCs can prompt numerous innovations for understanding the transcription networks in maintaining pluripotency as well as gene regulations and epigenetic modifications in ESCs, which are important for research and clinical applications. Furthermore, we highlight the features that are worthy to pay attention from biologists due to current challenges in massive sequencing data analysis in bioinformatics and biostatistics.
Pluripotent; Embryonic; Stem; Endoderm
Since mouse Embryonic Stem Cells (ESCs) were successfully established and cultured in vitro in the early 1980s [1, 2], research on pluripotent stem cells has become one of the most exciting areas in life sciences. ESCs are derived from the inner cell mass of mammalian blastocysts. They have the ability to indefinitely self-renew while maintaining pluripotency, which means that they can differentiate into all three germ layers (ectoderm, endoderm, and mesoderm). Somatic cells differentiated from ESCs in vitro are shown to possess the morphology and function similar to their counterparts isolated from adult tissues (reviewed in [3-5]). Thus, ESCs have been prospected as a novel source for cell replacement therapies in clinical applications, including organ transplantation and the treatment of debilitating diseases such as diabetes, Parkinson's, and Huntington's disease .
Furthermore, mammalian somatic cells were successfully reprogrammed to ESC-like pluripotent cells, referred to as induced Pluripotent Stem Cells (iPSCs), by Yamanaka and his colleagues in 2006 and 2007 [7,8]. Four pluripotency transcription factors Oct4, Sox2, Klf4, and c-Myc were first used to obtain iPSCs, and subsequent studies have found other factors could also facilitate this reprogramming process. Successful derivations of iPSCs allow us to get access to the pluripotent stem cell without leading to ethnic concern. Patient-specific iPSCs provide a valuable platform for autologous cell therapy and the modelling of human diseases.
The unique properties and unprecedented potential of these pluripotent ESCs have attracted much attention towards its underlying molecular network. Transcription factors and epigenetic modulation complexes specific to pluripotent stem cells have been extensively studied to investigate the molecular regulations of pluripotency in ESCs and iPSCs as illustrated in Figure 1A [9,10].
Emerging massive sequencing technologies, also referred to as next generation sequencing (NGS), have played crucial roles in unraveling genome-wide epigenetic landscapes, DNA binding profiles of transcription factors, as well as transcriptome discoveries. Genomewide NGS datasets provide abundant information with ultra-high resolution (single base pair level) for depicting molecular mechanisms of transcription factor regulations, gene expressions, and epigenetic regulation. So far, three categories of genome-wide deep sequencing technologies have been applied in ESCs research: Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), wholegenome RNA sequencing (RNA-Seq) and whole-genome bisulfite sequencing (MethylC-Seq) (Figure 1B). ChIP-seq has become one of the most popular techniques in demonstrating histone modifications and transcription factor (TF)-DNA binding profiles in ESCs studies. ChIP-Seq offers the opportunity for researchers to study gene regulation and epigenetic regulation conveniently due to its advantages including its high resolution, low noise performance and wide coverage [11,12]. RNA-Seq can be applied to quantify gene expression levels in transcriptome-wide levels and determine exon/intron boundaries . DNA methylation, a major epigenetic regulatory mechanism for gene expression and cell differentiation, plays a critical role in functioning and regulating pluripotency networks in ESCs [14,15]. Emerging MethylC-Seq data in ESCs studies provide a new insight into the dynamic nature of DNA methylation and demethylation during cell reprogramming and differentiation, which is fundamental to the knowledge of epigenomics in ESCs. As genome-wide deep sequencing data in ESCs research rapidly expands, it is important and worthy to have a clear insight of genome-wide deep sequencing techniques and their data processing methods for the researchers in the field. In this review, besides addressing the current "state of the art" in ChIP-Seq, RNA-Seq and MethylC-Seq research in ESCs studies, we also discuss the recent progress of technical optimizations and summarize a framework of computational methods and software packages for the processing of giant sequencing data. More importantly, due to the limitations of bioinformatics and biostatistics, we will highlight issues that need attention from biologists in this quickly expanding field.
Chromatin immune-precipitation followed by deep sequencing (ChIP-Seq)
Chromatin immune-precipitation (ChIP) is a technology for assaying histone modifications or protein–DNA binding in vivo [11,16]. In ChIP, antibodies are applied to pull down target proteins or nucleosomes, which bind to specific DNA fragments. Due to rapid development of NGS, CHIP-Seq has been successfully applied since 2007 [11,16]. The DNA fragments pulled down by ChIP are sequenced directly with tens or hundreds of millions of readings and then mapped to genome. The enrichments of DNA fragments on genome pinpoint the binding loci of target proteins. Compared to previous ChIP assay technology (ChIP-chip), ChIP–Seq has advantages such as ultra-high resolution and low noise owing to the single base-pair resolution and high accuracy of NGS. The coverage of ChIP–Seq is “real” genome-wide. Furthermore, the prospect of ChIP–Seq is promising since the cost of NGS is sustainably decreasing and the development of “third or fourth” generation sequencing techniques is with keen anticipation. Currently, ChIP–Seq is widely used for the genome-wide profiling of histone modifications, DNA-binding proteins, and nucleosomes in ESCs studies. More importantly, ChIP–Seq data have been massively mined to analyze direct binding or co-factor effects between transcription factors in pluripotency network in ESCs [16,17].
Sequencing of mRNA for gene expression profiling (RNA–Seq)
Determining the relative abundance of mRNA which reflects gene expression level is a significant topic in cell biology. Since the DNA microarray was developed through hybridization with labeled cDNA probes in 1996 , it has been widely used to detect relative gene expression levels in the past two decades [19-21]. However, the process of nucleic acid hybridization may lead to unavoidable noise . Therefore, the NGS technique, referred to as RNA-Seq, was quickly adopted for genome-wide transcriptome analysis. Compared to microarray technology, RNA-Seq has similar advantages to ChIPSeq, including higher resolution, lower noise and “real” genomewide coverage. More importantly, RNA-Seq opens the gate to study noncoding RNAs, which were hardly covered previously in microarray despite their importance in biological research [23-25]. The principle of conducting RNA-Seq is simple: RNA samples, such as whole transcriptome of cells, are reversely transcribed to construct a cDNA library and then sequenced by NGS platforms. The readings are mapped to the genome and enriched regions will be picked and represented as FPKM (Fragments Per Kilo base of exon per Million fragments mapped) values. In this way, relative gene expression levels and exon/intron boundaries can be discovered. It is worth noting that the quality of RNA sample is very important. The contamination of DNA, ribosome or tRNA should be prevented. At the current stage, RNA-Seq data are cohesively analyzed with ChIP-Seq data on genome map to investigate the molecular mechanisms of transcription factors and their functions of gene regulation in ESCs .
Whole-genome bisulfite sequencing (MethylC-Seq)
DNA methylation and covalent modifications of histone proteins are regarded as epigenetic modifications, and are crucial in controlling transcriptions. Additions of methyl groups to the adenine or cytosine bases of DNA are stable during different states in cell fate process. A bivalent state of DNA demethylation formed by active H3K4 trimethylation (H3K4me3) and repressive H3K27 trimethylation (H3K27me3) [27,28] was identified in ESCs. Genome-scale mapping of DNA methylations and histone modif?ications can be carried out by NGS techniques. MethylC-Seq, also known as bisulphite conversion followed by sequencing (BS–seq), is a technique using bisulphite treatment of DNA . The bisulphite treatment of DNA can lead to deamination of cytosine to uracil, which is subsequently converted to thymine following PCR amplification, whereas methylated cytosines are resistant to deamination and remain as cytosines. Treated DNA samples are analyzed by reading thymidine as indicators of demethylated cytosine positions and cytosine as indicators of methylated cytosine positions. By mapping sequence readings to the genome and calculating the ratio between thymidine and cytosine at base pair resolution, the methylation levels can be compared. MethylC-Seq has been widely applied in ESCs integratively with ChIP-Seq and RNA-Seq data on genome mapping [30-32]. Moreover, genome-scale mapping of DNA methylations plays a critical role in cancer diagnosis and therapies [33-35]. Recently MethylC-Seq technology coupled with ChIP, referred to as BisChIP-seq [36,37], has been developed and performed to investigate cross-talk between chromatin and DNA methylation.
Recent improvements in genome-wide sequencing techniques
Since other recent reviews have covered the details of conventional protocols of ChIP-Seq, RNA-Seq and MethylC-Seq [11,13,29], here we only discuss the protocols of these techniques briefly and focus on the recent progress in technical improvements in terms of four aspects: lowering sample input, increasing throughput, improving accuracy, and reducing costs . Reduction in costs can be optimized by engineering commercial improvements in NGS platforms. However, there are other ways that can make genome-wide sequencing technologies more cost-effective, such as lowering sample input, increasing process efficiency, and improving experimental and computational accuracy. Thus, in this review we focus on possible approaches for lowering sample input, increasing throughput, and improving accuracy.
Approaches in sequencing library construction
It has been shown that heterogeneity of cells populations may exist in biological samples from in vitro derivation of ESCs and reprogramming of iPSCs [30,38]. Such variations may result in different cellular compositions and complicate contaminations of target cell samples. Therefore, only a limited amount of homogeneous cells is obtained in ESCs studies with fluorescence-activated cell sorting (FACS). In RNA-Seq and MethylC-Seq experiments, the sequencing library is constructed from whole transcriptomes and the genome of cell samples respectively. Therefore, the amount of mRNA or DNA samples easily meet the sequencing requirement. In a recent RNA-Seq study, the number of cells used for library construction was reduced to 10 cells . However, in ChIP-Seq, the quality of a library is mainly dependent on the efficiency of immunoprecipitation (IP) antibodies. The steps of DNA purification and fragmentation during library construction apparently result in sample loss. Thus, to obtain a sufficient starting amount of DNA fragments (1–10 ng) following several cycles of PCR, a large number of cells are required in ChIP-Seq experiments. Currently, 107 of homogeneous cells are commonly applied in most of ChIP-Seq experiments in ESCs studies. Therefore, the need to lower sample input and increase throughput is urgent for ChIP-Seq.
To lower the sample input, several attempts have been applied in ChIP-Seq. Native ChIP (N-ChIP), which avoided the formaldehyde crosslinking in library construction, was developed prior to the applications of ChIP-Seq [40,41]. N-ChIP has a higher resolution than cross-linked ChIP (X-ChIP), and lacks unspecific interaction due to formaldehyde cross-linking. N-ChIP is also more sensitive than X-ChIP, making N-ChIP suitable for studies with low cell numbers, such as ESCs. Native ChIP-Seq protocol was developed by Zhao et al. [42,43], and optimized by Lyle et al. . The input number of cells was successfully reduced 200 folds to 100k per IP. However, low cell numbers in Lyle’s group study led to increasing levels of unmapped sequence readings and PCR-generated duplicated readings. Further improvements are needed to overcome this challenge. Furthermore, an ultra-low-input micrococcal nuclease-based native ChIP (ULI-NChIP) sequencing method was developed by Lorincz et al. . In this study, genomewide histone mark H3K27me3 profiles was demonstrated from as few as 1K ESCs. Thereby, high quality libraries from rare cell populations were proved successfully and illustrated by this method. In addition, several other approaches have reduced the number of cell samples to 10K or even 5K in ChIP-Seq [46-49]. However, pre-amplifications of ChIP DNA fragments were required in these experiments before sequencing by in vitro transcription or PCR. These comprehensive pre-amplifications introduced potential bias significantly. To reduce this bias, a small number of 10K cells was successfully used as starting material without pre-amplifications in ChIP-Seq . To further reduce the sample input without pre-amplifications, Huang et al. applied automated microfluidic ChIP technique to obtain the high-quality ChIP-Seq data from only 1K ESCs in 2015 . More importantly, by developing and applying the semi-automatic microfluidic devices in this experiment, the whole ChIP process has been greatly shortened to 8 h. As a result, this protocol shows a great potential to have high throughput in ChIP-Seq applications commercially.
Approaches in processing large data sets
The most impressive feature of NGS is the unprecedented amount of data. Usually, raw data and images are measured in the scale of terabytes per run. Processing such a large amount of raw data from ChIP-Seq, RNA-Seq and methyC-Seq presents a great challenge. Computationally, data analysis performed using reasonable computer time and resources should be of high accuracy. Here we review the data analysis for genome-wide NGS data as a bottom-up process as shown in Figure 1C, which starts with mapping sequence readings to the genome. The recent optimizations of data processing techniques will be discussed. All discussed software packages in this section are illustrated in Table 1.
|Mapping||MAQ||Mapping and Assembly with Quality||Durbin et al., The Wellcome Trust Sanger Institute, UK|||
|Bowtie/Bowtie2||---||Langmead et al., University of Maryland, USA||[54, 55]|
|BWA||Burrows-Wheeler Aligner||Durbin et al., The Wellcome Trust Sanger Institute, UK||[56, 57]|
|Calling Peaks||FindPeaks||---||Fejes et al., BC Cancer Agency, Canada|||
|peakSeq||---||Rozowsky et al., Yale University, USA|||
|cisGenome||---||Wong et al., Stanford University, USA|||
|SiSSRs||Site Identification from Short Sequence Reads||Zhao et al., National Institutes of Health, USA|||
|QuEST||Quantitative Enrichment of Sequence Tags||Sidow et al., Stanford University, USA|||
|MACS||Model-based Analysis of ChIP-Seq||Liu et al., Harvard University, USA|||
|USeq||---||Nix et al., University of Utah, USA|||
|Mapping||TopHat||---||Salzberg et al., The Johns Hopkins University, USA|||
|ERANGE||Enhanced Read Analysis of Gene Expression||Wold et al., California Institute of Technology, USA|||
|QPALMA||Optimal Spliced Alignments of Short Sequence Reads||Rätsch et al., Max Planck Society, Germany|||
|Subread||---||Shi et al., The University of Melbourne, Australia|||
|STAR||Spliced Transcripts Alignment to a Reference||Dobin et al., Cold Spring Harbor Laboratory, USA|||
|HISAT||Hierarchical Indexing for Spliced Alignment of Transcripts||Salzberg et al., The Johns Hopkins University, USA|||
|RPKM/FPKM calculations||IsoInfer||Inference of isoforms||Fenget al., Tongji University, China|||
|Scripture||---||Guttman et al., Massachusetts Institute of Technology, USA|||
|SLIDE||Sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation||Huang et al., University of California, Berkeley, USA|||
|IsoLasso||Isoforms of Least Absolute Shrinkage and Selection Operator||Li et al.,University of California, Riverside, USA|||
|iReckon||Isoform reconstruction and abundance estimation||Brudno et al.,University of Toronto, Canada|||
|Traph||Transcripts in gRAPHs||Tomescu et al., University of Helsinki, Finland|||
|Cufflinks||---||Pachter et al., University of California, Berkeley, USA|||
|MiTie||Mixed Integer Transcript IdEntification||Behr et al., Sloan-Kettering Institute, USA|||
|StringTie||---||Salzberg et al., The Johns Hopkins University, USA|||
|Mapping and determine of T/C ratio||BSMap||Bisulfite sequence MAPping||Li et al., Baylor College of Medicine, USA|||
|Bismark||---||Krueger et al., The Babraham Institute, UK|||
|BS-Seeker||Precise mapping for bisulfite sequencing||Pellegrini et al., University of California, Los Angeles, USA|||
|MethylCoder||---||Pedersen et al., University of California, Berkeley, USA|||
Table 1: Popular software packages in data processing of genome-wide massively parallel sequencing data.
Mapping: The first step to handling the genome-wide NGS data is mapping the short sequence readings to the genome. The reading lengths of ChIP-Seq, RNA-Seq and MethylC-Seq are 30-50bp, 200-300bp and 200-300bp respectively for high resolution readings. In ESCs studies, the typical mammalian genome sizes are in scale of several gigabytes . Thus, mapping of millions of short readings to a mammalian genome is one of the most intensive steps in the entire process. The mapping of ChIP-Seq and MethylC-Seq data is simpler than RNA-seq data that contains large gaps corresponding to introns that must be considered. Popular aligner software for ChIP-Seq data include Eland of Illumina platform, MAQ , Bowtie/Bowtie2 [54,55], and BWA [56,57]. For MethylC-Seq data mapping, the additional step is to detect the ratio of thymidine/cytosine at methylation sites, which can be achieved by specific aligner software such as BSMap , Bismark , BS-Seeker  and MethylCoder . Aligners for RNA-seq data include TopHat , ERANGE , QPALMA , as well as Subread  and STAR  (for both ChIP-Seq and RNA-Seq data). First generation RNA-seq aligners such as TopHat were based on mapping algorithms for ChIPSeq readings such as Bowtie, and then the addition of splicing steps of transcriptome fragments were handled in loops. However, this method needed enormous amounts of memory and computational time to run. Later on, optimizations of mapping algorithms of RNA-seq were applied in Subread and STAR. Reports on these improved methods describe them to be highly accurate and ultra-fast [65,66]. The only limitation of Subread and STAR is the excessive usage of memory (over 30GB), which makes them impossible for a typical desktop computer to run. Recently, a new RNA-Seq aligner, HISAT , was developed by Salzberg et al. who are also the developers of Bowtie and TopHat. HISAT requires much less memory than previous RNA-Seq aligners while maintaining high accuracy and ultra-fast speed.
It’s worthy to note that it is more challenging to map RNA-Seq readings than ChIP-Seq and MethylC-Seq readings. The challenges in mapping RNA-seq readings are caused by splice junctions, paralogous gene families and pseudogenes. For instance, some readings from one paralog may be mapped to other paralogs or pseudogenes due to sequencing errors, which vary around 1% so far. On the other hand, pseudogenes can be masked when the differences between pseudogenes and encoding genes are greater than the sequencing error, which is expected to improve with the development of new NGS platforms. It is an advantage for RNA-Seq since some of the pseudogenes are hardly masked in traditional RNA experiments .
Peak calling of ChIP-Seq signals: In ChIP-Seq, once alignment is completed, the results of mapped readings can be visualized on genome browsers such as UCSC (https://genome.ucsc.edu) or Ensembl (http:// www.ensembl.org). The visualizations can provide a semi-quantitative observation of informative impressions of enriched regions at a genome loci. However, this visualization cannot quantitatively identify the binding and transcription events or detect the global protein/DNA binding patterns across the entire genome. Therefore, enriched DNA fragments (peaks) at target protein binding locus need to be selected based on statistics. As , the current peak calling software packages for ChIP-Seq signals generally covers following basic steps: (i) detect signal profiles from experimental group, (ii) collect background profiles from control group, (iii) peaks calling criteria, (iv) post-call filtering of artificial peaks and (v) significance ranking of called peaks.
For ChIP-Seq reading signals, additional adjustment will be applied to discriminate the artifacts of single-end readings, which are typically performed nowadays. Single-end sequencing of DNA fragments reads from one of the two strands in the 5’ to 3’ direction, which results in two related distributions besides the expected read upstream and downstream. These “shifts” will be normalized to the standard signal tags input to peak calling criteria. Current peak calling software apply various models to handle the shifts. For instance, in FindPeaks  and peakSeq , shift distances can be input by user; in cisGenome  and SiSSRs , the average of paired tags is applied; in QuEST , shifts are applied locally by cross-correlation. In MACS , the most widely used software package, a global shift is applied by evaluating 1000 high quality pairs on genome wide.
Next, the enriched regions of experimental data will be compared to control data. A region will be identified as a “peak” if the fold enrichment between them is statistically significant. Generally the cutoff of fold enrichment is defined by user. In popular peak calling software such as cisGenome, QuEST, SiSSRs, peakSeq and MACS, control data can be considered as the background signal of peak calling. In recent publications of ChIP-Seq experiments, using control data such as ChIP-Seq signals of protein G or GFP is a popular and confident way of verifying the background. Recently, a novel strategy, KOIN (knockout implemented normalization), was developed to increase signal specificity and reduce noise by knocking-out the target transcription factor as control . Also, fold enrichment of control signals over experimental signals can be used to calculate the False Discovery Rate (FDR) in MACS and QuEST. If control data is not available, Poisson distribution can be applied to calculate the background profiles based on experimental signals. Finally, ChIP-Seqpeaks can be ranked by p-values (the significance of fold enrichment) in most software (cisGenome, QuEST, SiSSRs, MACS), or fold enrichment values if p-values are not provided.
Comparisons of peak calling tools have been investigated before [73,75,76]. Among these tools, MACS have shown remarkably lower FDR, better motif occurrence, and better spatial resolution of FoxA1 and NRSF ChIP-Seq data compared to Peak Finder , Findpeaks and QuEST. On the other hand, in ChIP-Seq analysis of histone modification marker profile of H3K27me3, although FindPeaks, PeakSeq, USeq , and MACS identified various peaks in terms of peak size, number, and position relative to genes, similar conclusions about the effect of H3K27me3 on gene expression were reached. Although the calling of each genome-wide peak was very different in a comparative analysis between 14 peak calling algorithms with extensive ChIP-Seq data of NRSF, FoxA1 and STAT6, the top 1000 and 2000 highest-quality peaks were very stable with high accuracy. Taken altogether, the efficiency of the peak calling tools depends largely on the experimental dataset.
Gene expression calculations of RNA-Seq data: In RNA-Seq experiments, Reads Per Kilobase per Million (RPKM) and Fragments Per Kilobase of exon per Million fragments mapped (FPKM) values are used to represent gene expression levels and quantify the enrichment of RNA fragments located on certain gene exons. Software packages which can be used to calculate the RPKM and FPKM include IsoInfer , Scripture , Slide , IsoLasso , iReckon , Traph  and Cufflinks  and MiTie . Cufflinks had been widely used in past ESCs research. However, it requires massive computational memory and time. Recently, a software running on less memory and time known as StringTie  was developed by Salzberg et al. who previously developed Bowtie, TopHat and HISAT. A comparison of Cufflinks, Traph, Scripture, IsoLasso and StringTie  showed that StringTie performs significantly better on transcriptome assemblies in both simulated and experimental data. Generally StringTie recognizes 40% more mRNA assemblies than Cufflinks, despite only needing 10% of Cufflink’s total computational time.
In summary, recent optimizations of algorithm in the processing of genome-wide NGS data have improved accuracy with reduced memory and computing time.
As discussed above, increasing genome-wide sequencing data are available in ESCs studies. In addition to the generation of genome-wide data in genomics, epigenomics and transcriptomics as discussed above, we also witnessed the rapid increase of proteomics data from ESCs studies . While insights can be provided from each data source, an integrative analysis of multiple data systems can offer a holistic view of gene functions. Data integration of ChIP-Seq, RNA-Seq and MethylC-Seq data in ESCs can provide valuable information about a) annotating functional features of the genome; b) inferring the functions of genetic variants; and c) understanding the mechanisms of gene regulation . However, the large amount of NGS data from diverse technology platforms presents challenges in integrated data analysis. Better strategies need to be developed and optimized in data access and processing.
Tips for integrative analysis of genome-wide NGS data
The key to integrate ChIP-Seq, RNA-Seq and MethylC-Seq data is to reduce data complexity. By calling peaks, data complexity of ChIP-Seq is reduced from tens of millions of sequence readings to only thousands of peaks that encompass the histone modification or transcription binding sites on a genome. Similarly in RNA-Seq data analysis, FPKM calculations and rankings are carried out, significantly expressed genes are ranked and mapped on the genome locus. In MethylC-Seq, the data complexity is reduced through identifying differentially methylated regions (DMRs). After reducing complexity, sequence readings of ChIPSeq, RNA-Seq and MethylC-Seq data are reorganized as thousands of marked regions on the genome. The genomic annotations of these regions can be further analyzed by clustering, principal component analysis (PCA), gene ontology as well as other approaches (Figure 1D).
Figure 1: Overview of genome-wide massive sequencing in pluripotency studies. (A) Histone modifications and transcription regulations in ESCs. Histone modifications cause chromosome conformational arrangement. Transcription factors bind to promotor and enhancer regions. Histone modifications and transcription factors regulate gene expression co-orperatively. (B) Library constructions for genome wide massive sequencing. Genomic DNA are fragmented in ChIP-Seq, and bisulfite-treated in MethylC-Seq. Short DNA fragments which bind to target protein are pulled down by ChIP. In RNA-Seq, mRNA transcriptome are converted to cDNA library. Millions of short DNA readings in ChIP-Seq, RNA-Seq or MethylC-Seq library are sequenced in NGS platform, and are further analyzed using bioinformatics tools. (C) Workflow of bioinformatics processing; DMR: differentially methylated regions. (D) Integrative analysis of genome wide sequencing data. ChIP-Seq, RNA-Seq and MethylC-Seq data can be integrated at a single locus, and further analyzed systematically genome-wide by unsupervised hierarchical clustering, PCA, and gene ontology.
Unsupervised hierarchical clustering and PCA analysis
Although the data complexity can be reduced in terms of gene annotation, it is still hard to represent the biological importance. To integrate high-throughput data, especially with multiple sample groups, the more scalable way is to apply unsupervised clustering approaches. Clustering is an efficient tool for partitioning a large data set into subsets based on their similarity. Unsupervised approaches do not use any prior knowledge of the samples. Unsupervised hierarchical clustering has been widely used in gene-expression profiles such as microarray and RNA-seq data. In ESCs studies, the gene expression profiles were commonly compared by hierarchical clustering. For example, gene expression values are calculated in various cell types or conditions, and clustering identifies sets of co-expressed genes. In hierarchical clustering, relationships among objects are represented by a tree (also referred to as dendogram) with similar objects being grouped into “clusters”. The advantage of hierarchical methods over non-hierarchical clustering methods is that the relationships found between or within clusters can be visualized directly.
One important step in hierarchical clustering is the way to measure the similarity or distance between any two objects or clusters. The pairwise distance can be calculated as Euclidean distance which focuses on the absolute expression value, or Pearson correlation coefficient and Spearman correlation coefficient which rely on the expression profile shape. There are three major methods for calculating the distance/ similarity between any two clusters. Single linkage method defines the distance as the smallest distance of all pairwise distances between members of the two clusters. Complete linkage method calculates the maximum distance of all pairwise distances between members of the two clusters. Average linkage computes the average distance of all pairwise distances between two clusters.
PCA  is an alternative way to cluster information among samples. As a statistical technique for determining key features of a high dimensional dataset, PCA can simplify data complexity and help to visualize high dimensional data . The goal of PCA is to reduce high dimensional data to a few sets (usually two or three) of new orthogonal variables called Principal Components (PCs) that capture most of the variances in the original data set. Whereas the last few PCs are often assumed to capture only the residual ‘noise’ in the data . In ESCs studies, PCA can be applied to clarify the different groups of transcription factors in various expression and regulation levels [92,93].
A number of annotated gene sets can be obtained by the integration of ChIP-Seq, RNA-Seq and MethylC-Seq data. The biological roles of these gene sets, including cellular components, molecular function and biological processes, can be mapped by gene ontology analysis. For instance, over-expressed gene sets in a certain condition can be selected to investigate the pathways involved. GO enrichment analysis highly depends on the GO database when carrying out the cross-relationship test. Briefly, the principle of GO enrichment analysis is testing sample frequency and background frequency. Sample frequency is the ratio between the number of genes in a certain GO term and total number of genes in the sample. Background frequency is the ratio between the number of genes in this GO term in the database and total gene number of whole database . The P-value of sample frequency vs. background frequency obtained from a statistical test, such as Fisher's exact or chisquare, represents the significance of this GO term in the sample.
In ESCs studies, DAVID [95,96] and PANTHER  are online tools that have been most widely used for GO analysis. As discussed above, the principle of GO test is quite straightforward. The accuracy of GO analysis highly depends on the annotations of genes in GO database. The number of functional annotation tools and the knowledge bases of human, mouse and rat genes in GO terms make DAVID and PANTHER efficient tools in ESC studies.
Protein–nucleotide binding motif discovery: Bayesian approach and hidden Markov model
Besides clustering analysis of gene expression profiles and GO enrichment analysis of cellular components, molecular functions, and biological processes as discussed above, the binding motif between protein and DNA (or RNA) on genome-scale is also an important topic in ESCs studies. Predictions of transcription factor binding sites (TFBSs) and motifs are crucial for biologists to understand gene regulation and pluripotent network. TFs bind to DNA in a sequence specific manner. However, TFBSs are generally short and have sequence variations at various positions, which make it a challenging task for predicting TFBS on a genome-wide scale. NGS genome-wide data, especially ChIP-Seq, provide extensive and high resolution information for TFBS data mining and bioinformatics prediction. Bayesian approaches , especially the hidden Markov models (HMM) , are applied in the investigations of protein-nucleotide interaction patterns. Bayesian approach and HMM are machine learning methods and are powerful to recognize the pattern from large data sets. In the modeling of DNA sequences, two sets of probabilities are needed in an HMM model. One is the hidden states emitting probabilities of nucleotides, and the other is the state to state transition probabilities .
Recently, a series of bioinformatics algorithms that identify proteinnucleoid binding motifs using genome-wide NGS data have been reported. In 2009, Choi et al. applied a hierarchical hidden Markov model to analyze simulation data and ChIP data of NRSF and CTCT. They demonstrated an improved TFBS identification with integrative resource usages over single data sources or a simple combination of two . Furthermore, efforts to improve and optimize TF binding motifs by Bayesian approaches had been made [100,102-107]. Bayesian mixture models were also used to perform the integrative analysis of ChIP-Seq and RNA-Seq data in TF binding motif and expression levels analysis . Twelve ECS-specific transcription factors were identified using the web-based TFBS predictor RSAT . RSAT, which can process several thousand peaks within minutes using less memory, makes it easy for new bioinformatics users. Other online databases of TFBS motif including TRANSFEC  and JASPER  allow biologists to conveniently access the TFBS information as well. Besides TF binding motifs, recently HMM was applied to identify RNA sequence motifs of RBM10 binding in 2014 .
Recent applications of genome-wide deep sequencing data in mammalian ESCs studies
Transcription factors discovery
Accumulating discoveries about the functions of transcription factors in ESCs have been reviewed in greater details by Ng and Surani  and Lee et al. . Most of the studies were done using ChIPSeq technology. For instance, Chen et al. performed an integrative analysis of ChIP-Seq data in mouse and identified an ‘Oct4-centric’ module of core pluripotency factors . Oct4, Sox2 and Nanog as well as Smad1, Stat3 and Tcf3 are the downstream effectors of signaling pathways controlled by BMP, LIF and Wnt respectively . Oct4, Sox2 and Nanog form the primary network that governs the robust pluripotent state by binding to their own promoters and forming an auto-regulatory circuitry. They play critical roles in controlling the self-renewal and differentiation of ESCs (reviewed in [114,115]). The associated transcription factors linked to the ‘Oct4-centric’ module identified by ChIP-Seq include Esrrb, Nr5a2, Tcfcp2l1 and Klf4 [17,116]. Besides, a second ‘Myc-centric’ module, also identified by ChIP-Seq, includes c-Myc, n-Myc, E2f1, Zfx, Rex1 and Ronin, which target the genes involved in protein metabolism [17, 117-119]. ‘Oct4- centric’ and ‘Myc-centric’ modules have been reported to auto-regulate their own expression and therefore be the central pluripotent network [9,10]. Additionally, other TFs have been identified by ChIP-seq, which is associated with the core pluripotent network including Prdm14 , SetDB1 , Chd7 , p300 , esBAF , E2F4 , Smad2/3/4, Foxh1 , YY1 , Mediator (Med1 and Med12) and Cohesin (Smc1a, Smc3) , PCL2 , Mbd3 , KAP1 and ZNF486 , GATA1 , BRD2/3/4 , KAP1 , TEAD4 , FOXO3 , p53 , NUP98 , FOXA1/2  and Tbx3 . Recently, a systematic analysis of 38 transcription factors with extensive ChIP-Seq data across the differentiation of hESCs in three germ layers were reported . More importantly, co-occupied transcription factors and their binding sites can be revealed with high resolution by studying the overlapping peaks of ChIP-Seq data. For example, so far at least 14 transcription factors have been found to bind at the enhancer region of Oct4  and eleven of them were identified by ChIP-Seq (Oct4, Sox2, Nanog , Stat3, Smad, Esrrb, Klf4, Tcf3, E2f1, n-Myc and Zfx) [17,113,126,142]. Similarly, at least nine transcription factors have been shown to co-occupy the enhancer region of Nanog gene  and five of them were identified by ChIP-Seq (Nanog, E2f1, Esrrb, Stat3 and Tcfcp2l1) .
Genome-wide NGS datasets can also be used to directly compare the genomic binding sites between species. Taking endogenous retroviral sequence as an example, human OCT4 and NANOG bind to humanspecific ERV1 (endogenous retroviral sequence 1)-repeat transposable elements, whereas mouse Oct4 and Nanog bind to murine-specific ERVK (endogenous retrovirus K)-repeat elements [143,144]. These comparative studies provide valuable information about sequence conservations of TF binding sites between species and demonstrate the diversity of the transcriptional circuitries.
Histone modification and DNA methylation of epigenetics
Histone modifications have been proposed to be essential for the maintenance of pluripotency of ESCs. ChIP-Seq is widely used in ESCs studies to probe histone modifications. It has been shown that histone demethylases can prevent the accumulation of repressive methylation at the promoters of genes that maintain pluripotency of ESCs  and hyperacetylated chromatin in ESCs is proposed to adopt an open structure and reduce repressive methylation . Therefore, histone methylation modifications, such as of H3 lysine 4 (H3K4), H3 lysine 9 (H3K9) and H3 lysine 27 (H3K27); and acetylation modifications, such as H3 acetyl lysine 9 (H3K9ac) and H3 acetyl lysine 27 (H3K27ac), are critical histone modification markers of cell pluripotent states . Analysis of ChIP-Seq data demonstrated that the expressions of Jmjd1a and Jmjd2c genes, which encode histone H3 lysine 9 demethylases, are positively regulated by Oct4 . Jarid2 and Mtf2, components of the Polycomb Repressive Complex 2 (PRC2) that mediate H3K27me3, are downstream targets of Oct4 [17,147]. Moreover, analyzing the ChIPSeq data of these histone markers in ESCs is an important way to depict the pluripotent gene network in ESCs, since the binding of histone modification are expected to enrich in the gene enhancer regions [113,136,139,148-154].
More importantly, massively epigenetic studies of ESCs or iPSCs have been carried out by the combination of ChIP-Seq, RNA-Seq and MethylC-Seq data recently [31,155-163]. DNA methylation plays the crucial role as the epigenetic switch driving somatic cells to pluripotent . Taking a recent epigenetic study of mouse iPSC reprogramming process as an example , by combining RNA-Seq (gene expression profiles), MethylC-Seq (DNA methylation profiles at TF promoter regions), and ChIP-Seq (histone modification profiles) data, the epigenomic mechanism of pluripotency in a roadmap of the reprogramming process was demonstrated. Genes with CpGrich promoters demonstrate consistent low methylation levels and strong engagement of histone markers, whereas genes with CpG-poor promoters are safeguarded by methylation.
non-coding RNA (ncRNA) studies in ESCs
RNA-Seq data can be used to systematically analyze the transcriptomes of ESCs and iPSCs with single-base resolution. Several transcriptome RNA-Seq datasets are available in mESC , mouse iPSC , hESC [150,151] and induced naïve-state hESC . One crucial molecular mechanism of pluripotency in ESCs is the function of non-coding RNAs (ncRNA), which inhibit gene expression by binding to mRNAs. Crosstalk bewteen ChIP-Seq and RNA-Seq data provides insight into the details of ncRNA-mRNA binding events [10,24]. For instance, microRNA (miRNA) such as mir302 and mir290 clusters, which are involved in regulations of the G1 phase of ESCs, were reported to be regulated by Oct4, Sox2 and Nanog with ChIP-Seq data in 2008 . Moreover, analysis of ChIP-Seq data revealed that a 30-aminoacid region of JARID2 mediated interactions with long noncoding RNAs (lncRNAs) and the presence of lncRNAs stimulated JARID2- EZH2 interactions in vitro as well as JARID2-mediated recruitment of PRC2 to chromatin in vivo . Recently, analysis of ChIP-Seq and RNA-Seq data revealed that pluripotency factors OCT3/4, SOX2, and KLF4 transiently activated LTR7, long-terminal repeats of HERV type-H (HERV-H), to maintain the gene expression profile required for the pluripotent state during the reprogramming [166,167]. In addition, ncRNA-mRNA gene pairs were identified through systematic analysis of RNA-Seq data in ESCs [80,168,169]. X chromosome inactivation (XCI) is another key feature of ESCs in pluripotent states. Previously XIST, a long noncoding RNA, was suspected to be crucial in events of XCI. Recently, ChIP-Seq and RNA-Seq experiments showed that loss of XIST expression is not the primary cause of XCI instability and that gene reactivation from the inactive X precedes loss of XIST coating in hPSC . Expression and coating by the long non-coding RNA XACT are early events in XCI erosion and may therefore play a role in mediating this process.
Up until now, alignment and analysis tools have been designed for the short sequence readings of NGS platforms. With the improvement of long sequence accuracy, software programs need to be updated to deal with the long readings of raw data from ChIP-Seq, RNA-Seq and MethylC-seq. Currently, the unmapped readings in raw data are commonly removed in analyses. With improvements of algorithms, the unmapped readings can be re-analyzed to gain further information including single-nucleotide polymorphism annotations and updated genome references. In addition, combination of ChIP-Seq data and chromatin conformation capture methods  can provide valuable information about distal regulatory elements and transchromosomal gene regulation networks. These questions may lead to critical information of hallmarks in ESCs study. Robust software tools for data analysis and closer interaction between laboratory scientists and bioinformaticians are clearly needed.
Besides ChIP-Seq, RNA-Seq and MethylC-Seq, other types of genome-wide NGS data have been carried out in ESCs studies. For instance, reduced representation bisulfite sequencing (RRBS-Seq) , methylated DNA immunoprecipitation (MeDIP-Seq) , as well as methyl-CpG binding domain (MBD-seq)  have been developed to detect DNA methylations. A comparison of these three technologies as well as MethylC-seq suggested the advantages and disadvantages among them . In addition, circular chromosome conformation capture coupled with NGS (4C-seq) , a technique of NGS application in chromosome conformation capture (3C), has been carried out to demonstrate the organization of chromosomes and the physical interactions that occur within and between chromosomes [177-179]. Another example is the recent deep sequencing data, which revealed the genome-wide profiling of Clustered Regularly Interspaced Short Palindromic Repeats-associated protein-9 nuclease (CRISPR-Cas9) off-target effects [180,181]. By mapping NGS readings of CRISPR-Cas9 target fragments to the human genome, CRISPRCas9 off-target rate has been analyzed in a very high resolution, which significantly improves the computational accuracy. In the foreseeable future, increasing novel applications of deep sequencing data in ESCs studies will be desirable.
This work was supported by grants from the Research Grants Council of Hong Kong [CUHK 478812, CUHK 14102214 and CUHK 14104614 to B.F.], the National Science Foundation [DBI0844749 and DBI1356459 to J.G], and in part by funds from the National Natural Science Foundation of China [NSFC 31171433 to B.F.] and Guangdong Science and Technology Bureau International Science and from the Technology Collaboration Program [20130501c to W.Y.C.].