Exploring the Neural Reprogrammome using Bioinformatics Approaches

Recent studies demonstrate that mammalian cells can be artificially reprogrammed by ectopic expression of transcription factors in an unforeseen straightforward manner. Patient-derived reprogrammed cells hold great potential for biomedical applications such as cell replacement therapy, drug toxicity studies and disease modeling. Somatic cells such as fibroblasts can be dedifferentiated into so-called induced pluripotent stem cells (iPSCs) that are similar to embryonic stem cells (ESCs) by overexpression of Oct4, Sox2, Klf4 and cMyc. However, clinical applications using iPSCs carry the risk of tumor formation due to incomplete differentiation. More recently, it has been demonstrated that transcription factor-driven reprogramming enables direct conversion of fibroblasts into neurons, cardiomyocytes, hepatocytes as well as neural progenitors. Various groups elaborated protocols for the direct conversion of fibroblasts into multipotent induced neural stem cells (iNSCs) using different combinations of transcription factors and media conditions. These studies have shown that iNSCs exhibit morphology, gene expression and self-renewing capacity similar to NSCs derived from primary tissue. Moreover, these iNSCs differentiated into neurons, astrocytes and oligodendrocytes indicating multipotency of these cells. Here, we compare the gene expression profile of reprogrammed cells reported in these studies to determine the similarity in expression profile between the generated iNSCs using bioinformatics approaches. We provide a general workflow that can be applied to evaluate the status of reprogrammed cell populations. Using hierarchical clustering analysis and principal component analysis (PCA), we show that iNSCs reported by Thier et al. resemble more closely to those of Han et al. On the other hand, iNSCs generated by Ring et al. are relatively similar to 4F iNSC (late) of the study by Han et al. as judged by hierarchical clustering analysis. Our study demonstrates that bioinformatics approaches are particularly valuable to robustly assess the transcriptional status of reprogrammed cells and to anticipate their cellular functionality.


Introduction
Cellular reprogramming describes a phenomenon of converting a somatic differentiated cell into a different cell type either somatic; or a multipotent and pluripotent, respectively, stem cell. A multipotent cell could be further differentiated into various but limited cell types, while a pluripotent cell could be differentiated into any desired cell type ( Figure 1). This process holds great promise for biomedical applications to use it as a potential treatment for human diseases and study cellular pathology in the cell culture dish [1][2][3][4][5][6][7][8]. Studies demonstrated that differentiated cells can be reprogrammed to an embryonic-like pluripotent state [9] or a multipotent state [10][11][12] and by that generated a lot of interest in understanding the molecular basis of cell fate determination and developmental biology. Takahashi and Yamanaka [9] achieved a breakthrough by converting somatic fibroblasts into pluripotent induced pluripotent stem cells (iPSCs) by overexpression of four transcription factors, challenging the view that cell differentiation is an irreversible process. However, for successful therapeutic applications, iPSCs need to be efficiently differentiated into the desired cell type. Disease-related applications of iPSCs harbor the risk of tumor formation due to incomplete differentiation [13][14][15]. Therefore, researchers have recently focused on reprogramming somatic cells into multipotent stem cells that have a more restricted developmental plasticity with less or no tumorigenic potential and are closer to the desired cell type that is required to be engineered [10][11][12]. Therefore, direct reprogramming of somatic cells into lineage-restricted multipotent stem cells should obviate iPSC generation and bypass the hurdle of unwanted iPSCs in differentiated cultures.
To understand the cellular reprogramming process from a somatic state to a multipotent or pluripotent state, bioinformatics approaches have been utilized to extract meaningful information from vast amount of data [16][17][18][19]. Attempts have been made to use genomic highthroughput technologies such as gene expression microarrays to assess genome-wide mRNA expression, chromatin immunoprecipitation with microarray technology (ChIP-on-ChIP) and ChIP with massive parallel DNA sequencing (ChIP-seq) to assess protein-DNA interactions to understand the cellular conversion process and provide us with new insights towards the reprogramming process [20,21]. Bioinformatics approaches have been used to explore and understand this complex and vast amount of biological data generated from high-throughput technologies. It integrates computational and statistical techniques to explore and extract meaningful information from biological data ( Figure 2A) [22]. By that, bioinformatics approaches could be used to identify novel transcription factors and help to understand the mechanisms of cell reprogramming during the cell conversion process ( Figure 2B).
Regulation of gene expression comprises of a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products that could be either protein or RNA. To understand the function of a cell, as well as higher levels of biological organization, we need to know the components that constitute it. With the inventions of high throughput technology like microarrays, gene expression of thousands of mRNAs across different samples can be analyzed simultaneously and compared to each other [23]. In order to elucidate the properties of stem cells many researchers are comparing gene expression analysis of pluripotent cells with multipotent and/ or unipotent cells [24]. Transcriptional profiling of multipotent hematopoietic stem cells (HSCs) and neural stem cells (NSCs) with pluripotent embryonic stem cells (ESCs) resulted in identification of genes that were differentially expressed among these three stem cell groups [25]. This knowledge could provide information about gene signatures of different cell types that could be used to determine the identity of artificially generated novel cell types. To unravel the wealth of information from the complex datasets, bioinformatics approaches have been deployed to explore the gene expression data. Clustering algorithms aims to group the genes based on the similar pattern of gene expression [26] while Principal Component Analysis [PCA; [27]] converts the gene expression data sets into diagonalized "eigengenes" and "eigenarrays" spaces in such a way that eigengenes are unique and orthogonal superpositions of the genes. Later the gene expression data is sorted based on eigengenes and eigenarrays reducing the features of the data to their principal components and determining the characteristic pattern of the dataset [24]. On the downside, existence of several microarray platforms including Illumina Bead Arrays, Affymetrix chips, Agilent, Nimblegen, and other platforms renders the comparison of data sets from different platforms complicated and error-prone.
In this study, we analyzed gene expression datasets of recently reported induced neural stem cells (iNSCs) generated from fibroblasts by three different groups [10][11][12] using bioinformatics approaches. The datasets were generated from different platforms. Using hierarchical clustering analysis and principal component analysis (PCA) we demonstrate iNSCs generated by Thier et al. [12] resemble closely to those of Han et al. [10], whereas Ring et al. [11] iNSCs appear distinct from these cells. Thus, coupling gene expression data with bioinformatics approaches will provide us with information about the transcriptional status of the reprogrammed cells.

Materials and Methods
Three freely available microarray datasets, namely GSE30500 [10], GSE37859 [11] and GSE36484 [12] were downloaded from Gene Expression Omnibus (GEO) [28]. All three datasets were recently reported in the context of direct conversion of fibroblasts into multipotent iNSCs using a slightly diverse combination of transcription factors ( Figure 3) and specific media conditions. The details of microarray datasets downloaded from GEO are mentioned in table 1.
All the three microarray datasets were subjected to quality control, data transformation, integration, normalization and analysis as shown in Figure 4. Due to diversified nature of microarray platforms (Table  1) and the formats of the raw data, the datasets were first examined to determine whether they are in the same comparable range. From our Box Plot analysis we observed that the heterogeneous datasets were not in the same comparable range ( Figure S1), as the expression values submitted to GEO from [12] was measured on a linear scale, whereas the other two datasets, [10,11], were measured in log2 scale. constitutively expressed Sox2, Klf4 and c-Myc, and Oct4 as a recombinant protein ( ) for the first 5 days to generate iNSCs. All three groups demonstrated in their respective studies that these iNSCs exhibit morphology, gene expression and the ability to self-renew similar to NSCs derived from primary tissue. Furthermore, these iNSCs differentiated into neurons, astrocytes and oligodendrocytes suggesting multipotency of these cells. Experiment starts with biological question(s) followed by its design and hybridization. Hybridized chip is introduced into the scanner and the intensities of every sample are acquired as image files and later the data is extracted from these image files. The quality of each sample is verified by Box Plot analysis. After verification, the data is transformed to log2 scale and integrated using virtualArray package. The integrated "Expression Set" is normalized between arrays using quantile method. The normalized data are later subjected to hierarchical clustering and PCA to address the biological question(s). The table represents list of gene expression datasets, sample types and its count and the platform and Chip used for generating the datasets. The expression values of the former were transformed into log2 scale using standard R functions and Bioconductor packages to bring all the datasets in the same comparable range for further analysis as performed in Heider et al. [ Figure 5; [29]]. The microarray datasets were then integrated using a virtualArray, a Bioconductor package that can combine raw datasets based on the recently updated annotations from NCBI (National Centre Biotechnology Information) [29]. The redundant probes that represent the same gene were collapsed and only those gene symbols, which were common across all three datasets, were identified and a single combined "Expression Set" (comprising of 12,818 genes) was generated [29]. Quantile normalization between the arrays was performed on this expression set [30] and subjected to further analysis.

GEO datasets Number of Samples
For hierarchical clustering analysis, the normalized "Expression Set" was clustered using agglomerative (using "virtualArrayHclust" function) hierarchical clustering on Euclidean distance matrix and average linkage method [29]. Many other distance measures other than Euclidean distance matrix are used but the Euclidean distance matrix with the commonly used average linkage method is more appropriate for normalized microarray data [31].
For Principal Component Analysis (PCA), the normalized "Expression Set" was also exported as tab-delimited text file from R and then imported into Partek Genomics Suite [32] for PCA with default parameters; correlation dispersion matrix and normalized eigen vector scaling and visualized in 3D scatterplot. The correlation method adjusts the data to be standardized to a mean of zero (mean centered) and a standard deviation of one. This adjustment is performed during the computation and does not modify the original data [32]. The eigenvectors are the direction cosines of the new axes (PCs) relative to the old (original variables), thus they define the rotations of the original axes [32].
The entire analysis was carried using Bioconductor packages, R, a statistical programming environment and Partek genomics Suite that runs in a 32-bit operating system comprising of 4GB RAM (Random Access Memory), a processor of Intel® Core (TM)2 Duo CPU T6500@2.10 GHz.

Results and Discussion
Bioinformatics approaches are increasingly embraced by the stem cell research community because of its potential to uncover the molecular fingerprint of a stem cell state in a short period of time. These approaches are also valuable to reveal critical information about the data obtained from different laboratories and experimental conditions. In this study, we have compared the gene expression profiling data    of three different studies that report direct conversion of fibroblasts into multipotent iNSCs using different combination of transcription factors and media conditions [10][11][12] (Figure 3). Han et al. used a cocktail of 4 factors (4F), Brn4, Sox2, Klf4 and c-Myc, and 5 factors (5F) that included the listed 4 factors plus E47/Tcf3 [10]. They analyzed the whole genome profile of 4F early-and a late-passage (in short 4F (early) and 4F (late)) as well as 5F to evaluate the reprogramming level of the entire transcriptome [10]. In this study, we have outlined cross platform comparison of three different microarray datasets generated by these groups on different platforms. The microarray datasets of the three studies were integrated using virtualArray and then compared using hierarchical clustering analysis and PCA to determine the similarity in gene expression profiling of iNSCs. First, the quality of the samples in all the three datasets before and after transformation was verified using Box plots ( Figure S1; 5). Each box represents individual samples in the datasets and all the boxes were in the same range indicating that all the samples were of good quality ( Figure 5). Hierarchical clustering analysis ( Figure  6 For PCA analysis, the combined expression datasets generated from virtualArray package were exported to Partek Genomics suite [32]. The high dimensionalities of three datasets were reduced and their patterns were identified using Principal Components (PCs) [33]. PCs are the set of new variables, created by linear transformation of original variables (gene expression). Rather looking into gene expression levels, PCA plot shows set of principal components calculated based upon actual gene expression values. Through this approach, the high dimensional complex data can be simplified and visualized, by retaining as much variance possible. The PCA plot explained in this study shows only first three PCs, as they contain the most information. The first 3 PCs were able to retain a total of 40.6% of variation among the data. Each PC explains the variance in the data and they are stacked in a decreasing order i.e. PC#1=20.18 %, PC#2=13.23%, PC#3=7.14% and so on ( Figure  S2). The list of total PCs and the cumulative % of variance proportion accumulating to 100% is shown in Figure S2. 3D scatterplot (Figure 7 [34][35][36][37]. Muller et al. [38] designed a new computational strategy for classifying the stem cells by designing a database, "stem cell matrix" of global gene expression profiles of pluripotent, multipotent and other differentiated cells. Using unsupervised clustering method, they were able to classify the pluripotent stem cells from the other cell lines based on their gene expression. They constructed a pluripotency network called PluriNet based on MATISSE (Module Analysis via Topology of interactions and similarity sets). With visualization and data analysis tools embedded within IntNetDB [39], it provides the researchers a good framework to analyze, visualize and interpret the results. Recently, PluriTest [16], a robust bioinformatics assay for assessing pluripotency, has been developed which can be used to upload and analyze gene expression data of human pluripotent cells based on Illumina platform. This tool compares the known pluripotent gene expression signature with a newly generated reprogrammed cell line to determine the pluripotency status of the candidate cell line and by that avoid teratoma assay in mice. However, this bioinformatics assay has a limitation in assessing pluripotency of human cells from Illumina platform only. More recently, a similar web portal Stemformatics has been generated to assist stem cell biologists to identify and assess relevant datasets, gene sets and pathways for easy visualization and comparison of gene expression generated across platforms, laboratories and cell models [18]. Similar advances using novel stem cell specific bioinformatics tools could replace cumbersome and time-consuming approaches like teratoma assays or in vitro differentiation assays to characterize cell lines to determine their identity [40].
Our study involving cross platform comparison could further help us in identifying biomarkers from various data handling/ organizational tools, which then can be used to compare published cell types with novel cell types. A web interface for iNSCs should be established to easily upload gene expression data for different species and platforms to determine the transcriptional status of newly derived iNSCs and other multipotent stem cell lines. The analysis can be extended in identifying the differentially expressed genes between cell lines, which could further shed light on the key genes involved in the cell conversion process. Therefore, with already existing algorithms and development of new tools and databases, the new experimental datasets can be applied to accessible databases and will infer new results based on known biological processes to understand more about stem cell fate and cellular reprogramming. Hence, combination of both, experimental and computational approaches will help biologists to understand cell fate determination. In conclusion, our study substantiates that bioinformatics approaches are essential to robustly determine the transcriptional state of reprogrammed cells and envisage their functionality.

Future Perspectives
The availability of different platforms for microarray experiments impedes comparison of expression data from the different platforms. This limitation also restricts the extraction of valuable information hidden in the datasets uploaded online in multiple data handling/ organizational tools. Therefore, a tool capable of performing cross platform comparison without losing much information is required and will provide infinite opportunities for researchers to carry out the comparative analysis. The mouse and human stem cells at the system level remain poorly explored; hence, more bioinformatics approaches have to be deployed. As the application of bioinformatics in cellular reprogramming is at an infant stage, therefore integrating the experimental data along with the other "OMICS" data will definitely provide more details of biological properties of the reprogrammome at a broader spectrum.