Mining Datasets for Molecular Subtyping in Cancer

Given the heterogeneity in the clinical behavior of cancer patients with identical histopathological diagnosis, the search for unrecognized molecular subtypes, subtype-specific markers and the evaluation of their clinical-biological relevance are a necessity. This task is benefiting today from the high-throughput genomic technologies and free access to the datasets generated by the international genomic projects and the repositories of information. Machine learning strategies have proven to be useful in the identification of hidden trends in large datasets, contributing to the understanding of the molecular mechanisms and subtyping of cancer. However, the translation of new molecular subclasses and biomarkers into clinical settings requires their analytic validation and clinical trials to determine their clinical utility. Here, we provide an overview of the workflow to identify and confirm cancer subtypes, summarize a variety of methodological principles, and highlight representative studies. The generation of public big data on the most common malignancies is turning the molecular pathology into a database-driven discipline


Introduction
The diagnosis of cancer is made primarily through histopathological classification systems that take into account the morphological characteristics of the tumor, allowing their identification and clinical stage assignment. The histopathological classification systems, despite their contribution to reducing cancer-related mortality, still present uncertainties in providing prognostic information or guidance for determining the most appropriate therapeutic direction. It is also clear that such systems fail to provide information about the underlying molecular mechanisms, which may be the origin of the clinically observed differences. Furthermore, the existing histopathological subtypes are heterogeneous; this is evident at the levels of molecular pathogenesis, clinical course, and treatment responsiveness [1,2]. These limitations necessitate the discovery of new molecular subtypes and the evaluation of their clinical and biological significance. Lowdimensional approaches that consider a limited number of genes and patients are insufficient to address the problem of cancer subtyping. It is necessary to identify patterns in large datasets and at a genomewide scale using machine learning strategies. This task benefits from the high-throughput genomic technologies, the enormous amount of genomic datasets generated by the international genomic projects, and the availability of data analysis algorithms, allowing a comprehensive and unprecedented characterization of the disease.
The machine learning approaches can be used to dissect the complexity of cancer. These are the computational tools that recognize and classify patterns based on models derived from the data. The motivation for this mini review is provide an overview of the workflow for molecular subtyping in cancer. Although there are various methods available for classification, a common analytical framework emerges across several research studies. This common workflow with its outstanding techniques is covered here with interest in the methodological principles and the biological interpretation.
Despite great efforts in cancer biomarkers several factors have impeded translation of research findings into clinical practice [3]. The precise role in the management of patients, of new molecular subclasses and predictors identifies by machine learning approaches, need to be refined and strongly validated. However, it is clear that they have the potential to provide insights about the underling molecular mechanisms and help to dissect the molecular heterogeneity of the disease. Machine learning for cancer subtyping has been performed mainly with expression data. However, this technique can also be applied to other levels of biological information, such as promoter methylation, miRNAs, and single nucleotide polymorphisms, analyzed with hybridization array technology or next generation sequencing, allowing the study of the data structure in many different levels and providing an integrated view of the biological processes involved.

Unsupervised and Supervised Learning for Cancer Study
There two main types of statistical problems associated with tumor classification: the identification of unknown tumor classes and the classification of malignancies into known classes. These two issues can be addressed using in a complementary manner unsupervised and supervised machine learning. These methodologies have supported the discovery of subgroups with biological significance and clinical implications in multiple types of cancer (Table 1). Representative examples include: 1) the distinction between acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) without the previous knowledge of the classes they belong to [4], a distinction that is critical for successful treatment. The class discovery was made through the use of self-organizing maps (SOM); the clusters identified by this method were compared to the known classes through linear discriminant analysis [4]. 2) The subclassification of diffuse large B-cell lymphoma (DLBCL) into groups related to the differences in the stages of B-cell differentiation (germinal center, activated B -cell) and its association with differences in patient prognosis [5,6]. Average linkage hierarchical clustering was used in this case as the unsupervised strategy. The use of hierarchical clustering analysis and supervised analysis also allowed the classification of breast cancer into at least 4 subtypes (basal, luminal A, luminal B, HER2+) [7,8]. The refinement and comprehensive study of these subtypes has allowed the identification of differences with regard to clinical features, response to treatment, and prognosis. 4) The supervised classification approaches have had a major impact on the ability to influence clinical management as they help predict the outcome of the disease; most efforts have focused on identifying prognostic (clinical outcome) and predictive (response to treatment) markers. Supervised predictors have motivated the development of large-scale validation efforts, especially in breast and colon cancer. Predictors in the areas of breast and colon cancer (MammaPrint, Oncotype DX, Coloprint) [9][10][11][12] have been noted for their progress in clinical trials.
A common methodological framework to identify subtypes can be found in many studies [13], while it does not use the same analytical techniques, follows a similar workflow: discovery cohorts are chosen and pre-processed, unsupervised clustering techniques are applied, supervised classifiers are developed, and clustering and classification are validated in independent cohorts ( Figure 1).

Cohorts and data pre-processing
The starting point should be the selection of characterized cohorts based on histopathological evaluation and clinical monitoring. The clinical relevance of a classification system lies mainly in the stratification of patients based on clinical outcomes such as survival or therapeutic response (if it is the case of cohorts with therapeutic intervention). The subdivisions are evaluated by methods such as the Kaplan-meier estimator or the cox proportional hazard model, for this purpose, it is necessary to have information about the current status of the patient as well as during long clinical follow-up periods, which will allow the estimation of survival endpoints (e.g. overall survival, recurrence-free survival). In the case of microarray data, raw data are pre-processed in a process that involves three steps: background correction to adjust the intensity readings for nonspecific signals; adjustment of the intensity readings for technical variability to ensure that the measurements of all samples are comparable (normalization); and computation of a summary value for the different probes representing each gene (summarization). The most commonly used types of normalization are the RMA [14], the Quantile [15], the Loess [16], and the VNS [17,18]. A filtration step is recommended for removing non-informative genes that may represent noise for clustering; genes that show little variation between patients and those with low signal intensity should be eliminated [19].
Given the heterogeneity of the disease, the cohort must represent a broad spectrum of molecular events to realize the identification of rare subgroups. Therefore, cohorts with a large number of patients are preferred. Because of the difficulty of finding large cohorts, it is a common practice to combine normalized datasets, which will have the challenge of making corrections for non-biological variation such as differences in origin and technical processing. Several correction or adjustment processes have been proposed. These include SVA [20], Combat [21], and DWD [22]. The use of different platforms (e.g. Affymetrix, Agilent) is also a limiting factor when combining cohorts, requiring homogeneous annotation of the probes corresponding to the same gene symbol.

Clustering of patients
The goal of clustering is the identification of natural or inherent structures in a dataset. The clustering divides patients into groups that may represent subtypes of the disease, using measures of distance or similarity (e.g. geometrical distances or correlation). The cluster analysis employs two basic strategies, namely hierarchical clustering and partitioning, in addition to hybrid methods.
The hierarchical clustering is one of the most widely used methods. This generates tree-like structures between elements and can be divided into agglomerative (bottom-up) and divisive (top-down). In bottom-up methods, each data set is considered a cluster. The clusters are iteratively grouped based on their similarity measures. The topdown method starts with a cluster containing all data, and splits are performed recursively until each cluster contains a single data. The type of clustering algorithm, the distance metric, the type of linkage or inter-cluster distances (when appropriate), and the number of clusters must be selected when employing these methods; guidance can be found in the work of Drăghici S [23,24]. The use of a hierarchical grouping system for the subdivision of the data may involve the notion of development or transition between the elements, whereas a nonhierarchical system may involve greater independence or independent emergence. On the other hand, partitioning methods are a group of methods based on a variety of mathematical models. Among the most widely used methods of partitioning are the self-organizing maps (SOM) and the K-means [24].
SOM is a type of artificial neural network (ANN), which computes a set of reference vectors (prototypes) representing the local means of the data. SOM then partitions the data set, with each prototype defining a cluster consisting of the data points nearest to it. The user specifies the number of clusters to be identified [25]. The k-means starts with a fixed a priori number of cluster centers (k). Each data point is assigned to the nearest center, based on its distance from each center, to form a set of temporary clusters. An iterative process recalculates the position of the cluster centers based on the current membership of each cluster and reassigns the points to the k clusters. This process continues until stabilization is achieved [26]. The most common methods for identifying robust subgroups (tolerant to outliers) include the use of a clustering algorithm together with a consensus clustering process originally proposed by Monti et al. [27]. The consensus clustering performs subsampling and, for each subsample, runs a particular type of clustering, estimating consensus values for different numbers of clusters, allowing the assessment of the stability of the clusters discovered. The number of clusters can be determined from the empirical cumulative distribution function (CDF) area [27].
Recently, the non-negative matrix factorization (NMF) consensus clustering has been extensively used [28][29][30][31][32][33]. The NMF is a dimensionality reduction method that can summarize outstanding functional properties in a small number of metagenes (positive linear combination of genes). This is accomplished via a decomposition of the gene expression matrix into two matrices with nonnegative entries. Each sample is assigned to a subtype or cluster by finding the metagene that is most closely related to the sample's expression profile. The robustness of clustering is evaluated by repeating the factorization process using different random initial conditions for the factorization algorithm. This creates a consensus matrix to assess the stability of the resulting clusters [34]. The NMF appears to have some advantages over other methods: it is not based on distances, does not assume a hierarchical structure and provides a quantitative measure to identify the number of clusters. The latter is performed by means of the cophenetic coefficient.
Given the large number of genes and patients, virtually everything can be clustered. On the other hand, given the different nature of clustering algorithms, it is possible to modify the parameters to generate different results using the same data (e.g. the clustering produced by a given algorithm is dependent on the distance metric used). Therefore, the clustering is expected to be useful in the discovery of groups of patients with functional, survival, or phenotypic differences. The value of clustering is demonstrated by the biological information it provides, the utility of the markers found and the extrapolation of the results.
Techniques such as multidimensional scaling (MDS) can be useful for the visualization and initial recognition of high-dimensional data and to identify patterns and evaluate metrics used for the separation of elements. The method converts a similarity matrix to a simple geometrical picture [35]. The principal component analysis (PCA) can be used with exploratory intent and for the purpose of dimensionality reduction. In this method, new features, principal components with the larges variance, are identified and used instead of the original ones [36].

Classification and validation of results
Once the clustering is accomplished, the next phase in many studies consists of exploring the potential clinical utility and validation using Figure 1: A common methodology workflow to identify and confirm cancer subtypes. The unsupervised or class discovery methods can be used to identify the unrecognized tumor subtypes, and the supervised or classification methods can be used to allocate the samples into previously defined classes. The entire workflow also can help to identify potential biomarkers, through feature selection methods applied before classification.  cohorts; for these purposes, the supervised analysis techniques are used. The goal is to design a classifier that is able to accurately predict the class membership of new samples (test data) using data with known class membership (training data) [37]; samples used for training and testing should be large and independent for obtaining reliable results. Once a classification model is executed, it is important to estimate the classifier performance with respect to the sensitivity (true positives), specificity (true negatives), and accuracy (total number of correct predictions). Among the methods used for evaluating the performance of a classifier by splitting the initially labeled data into subsets are the cross-validation and bootstrap methods [26,38].
Two of the most common classification algorithms used for mining genomic data on cancer is the support vector machines (SVMs) and the classifiers based on nearest centroids, such as prediction analysis of microarrays (PAM). The SVMs map the input vector into a feature space of higher dimensionality and identify the hyperplane that separates the data points into two classes. The marginal distance between the decision hyperplane and the instances that are closest to the boundary are maximized. The resulting classifier achieves considerable generalizability for the classification of new samples [38]. The PAM calculates a standardized centroid for each class; the method takes the gene expression profile of a new sample and compares it to each of the class centroids. The class whose centroid it is closest to is the predicted class for that new sample. The method uses a shrinkage technique to assess the contributions of genes to classification as an automated gene selection step [39].
An important consideration for the classification process is the choice of the most discriminative genes for the analysis since this specific group of genes, which can be considered subtype-specific markers, is what makes the distinction between classes possible. The choice of a group of genes is also important to avoid over fitting. If increased error rates of the classification are observed despite the decrease in the error rates during the training process, over fitting may be the cause [38]; this is associated with the presence of a disproportionate number of genes with respect to the number of samples and can be prevented with the use of feature selection methods. Selected features can lead to better classification performance, provide insights into the disease and offer biomarkers with clinical value. These markers could be tested experimentally for evaluate their functional value in the disease and validated in different cohorts for corroborate their role as biomarker. To mention just one example: Armstrong et al. [40] found that lymphoblastic leukemias with MLL translocations can be separated from conventional acute lymphoblastic and acute myelogenous leukemias, they identified a target gene FLT3 that was shown experimentally to be a drug target [40,41].
Approaches to feature selection can be divided into two categories: filter methods and wrapper methods. In the former methods, a statistical measure of the marginal relevance of the features is used (e.g. t-test, SAM); those methods perform explicit feature selection before the classifier construction. Wrapper methods use the accuracy of a resulting classifier to evaluate the features, for example, classification techniques such as the decision trees and random forests [42,43] intrinsically contain a feature selection step that evaluate the "variable importance".
Methods such as significance analysis of microarrays (SAM) can be used as filter to find highly discriminative genes between subgroups. This method identifies differential gene expression relative to the spread of expression across all genes. Sample permutation is used to estimate false discovery rates (FDR) [44]. By adjusting the threshold, it is possible to control the estimated FDR associated with the gene sets. Decision tree is nonparametric and have the advantage of be easy to interpret, it follows a tree-structured where the nodes represent the input variables and the leaves correspond to decisions outcomes. A decision tree classifies by posing a series of questions or decision rules, each question is contained in a node, and every internal node points to one child node for each possible answer to its question [45]. Random forests uses an ensemble of unpruned decision tres (grown fully), each of which is built on a bootstrap sample of the training data using a randomly subset of variables [42] (Figure 2).
The identification of subtype-specific markers or cluster markers is sensitive to outliers. Therefore it is advisable to use only those samples that belong statistically to the core of each of the clusters before the extraction of biomarkers. This can be accomplished using the silhouette width [46]. Classifiers with fewer genes tend to perform poorly. Therefore, a compromise between the size and the performance of the classifier should be considered, for example, considering a sufficiently large number of genes that do not generate error rates higher than 5%. Once a good classifier is generated, it is necessary to reduce the number of genes for potential clinical application and validation.
Once the list of genes that differentiates groups has been obtained, these can be examined in search of functional information, such as signaling pathways involved or biological processes affected, through enrichment analysis. DAVID is a tool used for pathway enrichment and gene ontology analyses [47]. The gene sets enrichment analysis (GSEA) can be used to evaluate whether the differentially expressed genes are enriched in specific gene sets [48].
Despite research efforts in recent years, only a few molecular markers have been established in clinical practice. For example, although several studies have been reported in breast cancer, markers recommended by the American Society of Clinical Oncology (ASCO) are reduced to the status of a few molecules, ER and PR indicated for endocrine therapy, HER2 for anti-HER2 therapy, and the 21-gene recurrence score to determine prognosis. [49][50][51]. Other representative examples include KRAS mutations in colorectal cancer to select patients for treatment with antibodies against epidermal growth factor receptor (EGFR) [52], and EGFR and ALK alterations for therapeutic direction in lung cancer [53].
For a tumor marker to be used in clinical settings, issues related to analytic and clinical validity, and clinical utility must be addressed. Analytic validity relate to analytic accuracy, reliability, and reproducibility. Clinical validity is the demonstration that the marker has a strong association with a clinical outcome. Clinical utility entail that use of the marker has shown to result in a favorable balance between benefit and harm, leading to improved outcomes compared with nonuse of the marker [3]. Recently, Yuan et al. [54] addressed a key issue related to the lack of cancer biomarkers with clinical utility: statistical significance vs. magnitude difference. Predictive models in cancer have relied on statistical significance (P value) to evaluate clinical utility but the size of the difference in the patient outcome should also be considered [55]. For their applications in the clinical management of patients, the new molecular subclasses and predictors identified by machine learning approaches need to be refined and strongly validated. However, it is clear that they have the potential to complement traditional histopathological systems and to provide insight into the underlying molecular mechanisms.  Each tree cast a vote in classifying and it is formed by first selecting at random, at each node, a small group of input features (genes) to split on and, secondly, by calculating the best split based on these features in the training set. The tree is grown to maximum size (without pruning). This randomization scheme is blended with bootstrap to resample, the training data set each time a new individual tree is grown. Each patient not used in the construction of the tree is used as test set (out of bag), the tree is then tested against the "out of bag" patients to estimate the accuracy of the classifier. The entire process is repeated with new random data set divisions and new random gene sets for selection of splitter variables to produce ultimately a forest. The forest can then also be applied to independent patients of unknown class.

Datasets and Analysis Tools
Data generated by genomic cancer projects can be accessed freely or in a controlled manner. Allowing free access to genomic data has become a practice in international projects that generate highthroughput data. This democratization of scientific information has allowed such data to become the starting point of analysis, development of workflows and analytical tools and the generation of new questions from the scientific community [56,57].
Among the projects with genomic data on cancer are The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/dataportal) and The International Cancer Genome Consortium (ICGC) (http://dcc. icgc.org). TCGA is a collaborative effort put together to characterize the genomic changes in major cancers and is jointly conducted by The National Cancer Institute (NCI), The National Human Genome Research Institute (NHGRI) and the different centers and institutes of the National Institute of Health (NIH). This initiative has analyzed over 30 human types of cancer using high-throughput technologies: exome and genome sequencing, expression analysis, copy number variations, DNA methylation, and miRNAs, evaluated using microarray platforms and/or sequencing. The goal of the ICGC is a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. It is possible to find several levels of data processing in these projects. Based on the level of intervention and integration (raw, processed or normalized, interpreted and summarized), these are referred to as levels 1-4 and are freely available or controlled for different analytical platforms. Among the public repositories containing large numbers of expression data in compliance with basic rules for publication to the community, are the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI) and the repository Array Express of the European Bioinformatics Institute (EBI).
For the analysis of high-throughput data, a wide range of methods are freely available through the Bioconductor project (http://www. bioconductor.org/). This resource has nearly 1104 software packages and an active user community. The Bioconductor uses the R programming language, which is open source and open development, allowing highly interactive protocols and providing an opportunity for programming one's own analysis. Another free software resource for use with the R environment and with a variety of solutions in statistical genomics is the Comprehensive R Archive Network (CRAN) with approximately 7590 packages. The Gene Pattern (http://www. broadinstitute.org/genepattern) stands out among the tools with a graphical interface, with hundreds of analysis tools and workflows for different types of genomic data.