Recent Advances in Discriminant Analysis for High-dimensional Data Classification

There are serious challenges posed by high-dimensional data sets. With the arrival of new technologies, high-throughput modeling is becoming a norm in many disciplines such as statistical genetics, epidemiology, astronomy, high energy physics, and ecology. Highdimensional data have emerged from various sources such as digital images, documents, next-gen sequencing, mass spectrometry, metabolomics, microarray, proteomics, online videos and web pages. One area with a growing need for new statistical methods and theory for high-dimensional data is the classification of subgroups. For example, cancer classification has primarily been based on histopathological appearance of tumor. However, patients with similar tumor appearance can have different prognosis and response to treatment. The traditional way to classify cancer by pathological review may cause biased results and misclassify the tumor subtypes for patients. The availability of microarray data allows simultaneous measures of thousands of genes. These high-dimensional data have become a standard tool for biomedical studies and are now commonly collected from patients in clinical trials. The identification of informative genes may result in potential molecular markers for tumor class prediction. Correct classifications can help practitioners identify the right treatment for patients. Due to the cost and/or experimental difficulties in obtaining sufficient biological materials, it is common to see studies with sample size much smaller than the number of dimensions. These problems are referred to as “large p small n” issues, where p is the number of dimensions (or say genes) and n is the sample size. High-dimensional data pose challenges to traditional statistical methods. For instance, owing to small n, there are increased uncertainties in the standard estimations of parameters such as means and variances. As a consequence, statistical analyses based on such parameters estimation are usually unreliable. To have improved parameters estimation, researchers have come up with innovative ways to deal with this.


Introduction
There are serious challenges posed by high-dimensional data sets. With the arrival of new technologies, high-throughput modeling is becoming a norm in many disciplines such as statistical genetics, epidemiology, astronomy, high energy physics, and ecology. Highdimensional data have emerged from various sources such as digital images, documents, next-gen sequencing, mass spectrometry, metabolomics, microarray, proteomics, online videos and web pages. One area with a growing need for new statistical methods and theory for high-dimensional data is the classification of subgroups. For example, cancer classification has primarily been based on histopathological appearance of tumor. However, patients with similar tumor appearance can have different prognosis and response to treatment. The traditional way to classify cancer by pathological review may cause biased results and misclassify the tumor subtypes for patients. The availability of microarray data allows simultaneous measures of thousands of genes. These high-dimensional data have become a standard tool for biomedical studies and are now commonly collected from patients in clinical trials. The identification of informative genes may result in potential molecular markers for tumor class prediction. Correct classifications can help practitioners identify the right treatment for patients. Due to the cost and/or experimental difficulties in obtaining sufficient biological materials, it is common to see studies with sample size much smaller than the number of dimensions. These problems are referred to as "large p small n" issues, where p is the number of dimensions (or say genes) and n is the sample size. High-dimensional data pose challenges to traditional statistical methods. For instance, owing to small n, there are increased uncertainties in the standard estimations of parameters such as means and variances. As a consequence, statistical analyses based on such parameters estimation are usually unreliable. To have improved parameters estimation, researchers have come up with innovative ways to deal with this.
A common approach for the analysis of high-dimensional data classification is discriminant analysis. The main goal of discriminant analysis is to assign an unknown subject to one of K classes on the basis of observed subjects from each class. Let ,1 , , , k k n k X X  be independent and identically distributed from p-dimensional multivariate normal distribution with mean vector µ k and covariance matrix ∑ k for class k=1,..,K. Let n = n 1 +...+n k be the total number of observations. Note that the sample covariance matrices are singular when p is larger than n. Therefore, traditional methods such as Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are not applicable to high-dimensional data classification directly.

Recent Advances
To overcome the singularity problem, Dudoit [1] introduced two simplified discriminant rules by assuming independence between covariates. For each class k, let be the sample mean, and be the sample covariance matrix where the off-diagonal elements are all set to be zero. Also let ˆ= / k k n n π be the estimated prior probability of observing a class k subject. The first rule developed in Dudoit [1] is called Diagonal Quadratic Discriminant Analysis (DQDA). It classifies a new subject X to class k that minimizes the discriminant score The second rule is called Diagonal Linear Discriminant Analysis (DLDA) that classifies the new subject analogously according to the discriminant score the pooled variances across the K classes. DQDA and DLDA classifiers are sometimes called "naive Bayes" classifiers because they can arise in a Bayesian setting [2]. Due to the small sample size, DLDA and DQDA, which ignore correlations between genes, perform remarkably well compared to some more sophisticated classifiers in terms of both accuracy and stability. In addition, DQDA and DLDA are easy to implement and have been adopted to analyze high-dimensional data in various fields of science.
Though DQDA and DLDA work for small sample sizes and perform better than some sophisticated classifiers, their performance under the "large p small n" setting is still unreliable due to various reasons. In this section, we review some significant results that have been developed in the literature to improve the diagonal discriminant analysis.
The Nearest Shunken Centroid (NSC) method proposed by Tibshirani [3] is among the first to improve the diagonal discriminant analysis. This method also assumes a diagonal covariance matrix. To improve the classification performance, the mean vector µ k is estimated by the "shrunken centroid" rather than the sample mean. NSC shrinks each class centroid toward the overall centroid by a certain amount. Specifically, let , s i is the pooled within-class standard deviation for the i th component and s 0 is a positive constant with the same value for all genes. By shrinking d ik toward zero via soft thresholding or hard thresholding, the NSC method uses the achieved shrunken centroids to perform DLDA and then classifies the new subject to the class with nearest shrunken centroid. Note that other variations of NSC are also available in the literature; see for example [4,5].
Uncorrelated discriminant analysis (UDA) is another extension of the diagonal discriminant analysis [6]. Let S b be the between-class scatter matrix, and S w be the within-class scatter matrix, and S i = S b +S w be the total scatter matrix. A special property of UDA is that the genes in the transformed space are uncorrelated. The goal of UDA is to find the optimal discriminant vectors that are S t -orthogonal. Suppose r vectors are obtained, then the (r+1)th vector will be the one that maximizes the Fisher criterion function subject to certain constraints. Ye et al. [6] showed that this can be solved efficiently by solving an optimization problem.
Due to the small sample sizes, another direction to improve the diagonal discriminant analysis is by shrinkage [7,8]. For instance, Pang [7] applied the shrinkage estimates of variances in Tong [9] into the diagonal discriminant scores, and formed two shrinkage-based rules called Shrinkage-based DQDA (SDQDA) and Shrinkage-based DLDA (SDLDA). Pang [7] also applied regularization as in Friedman [10] to further improve the performance of SDQDA and SDLDA. Combining shrinkage-based variances in diagonal discriminant analysis and regularization in a new classification scheme showed improvement over the original DQDA and DLDA, Support Vector Machine, and k-Nearest Neighbors in many scenarios. In addition, Pang H [11] have applied the shrinkage-based discriminant rules to identify genes that help differentiate between estrogen receptor positive and negative samples to investigate genes that are specific to the African American subjects with breast cancer.
Recently, Huang S [11] observed that the diagonal discriminant analysis suffers from serious drawback of having biased discriminant scores. Inspired by this, they proposed bias-corrected diagonal discriminant rules by using unbiased estimates of ˆ( ) be the bias-corrected estimators. The resulting bias-corrected discriminant score of DQDA is then defined as It was shown that the proposed biascorrected score improves the standard one under the quadratic loss function. Finally, both simulation study and prediction accuracy analysis demonstrated the superiority of bias correction over the original rules, especially when the design is highly unbalanced.

Discussion
Though the progress made thus far is encouraging, we believe that more needs to be done given the increased demand and further improvement are desired. First, note that genes are unlikely to be independent of each other. Therefore, the assumptions made in the diagonal discriminant analysis and its variations may not be realistic. Pang H [13] are studying and extending block-diagonal discriminant analysis methods. In some preliminary study, they have made further improvement possible for class prediction in real data analysis. Second, the performance of the NSC method and its variations may not be satisfactory when the sample size is small due to the large variation in variable selection using cross-validation. In Tong T [14], the authors are proposing a new algorithm that chooses the tuning parameter for variable section by minimizing certain risk functions. Some preliminary simulations indicate that the proposed algorithm performs well compared to the original NSC method by cross-validation when the sample size is small. Third, we can consider the bias-corrected rules for SDQDA and SDLDA. Recall that the shrinkage estimation is to trade off a "small" increase in bias for a possible "significant decrease" in variance. The good performance of SDQDA and SDLDA in Huang S [12] is mainly owing to the largely reduced variance in the shrinkagebased discriminant scores. Instead, the bias term in SDLDA and SDQDA still remains or may be even larger than that in DLDA and DQDA, respectively, as shrinkage may pull in extra bias. To conclude, we reiterate that there is still room for more innovative methdological developments in the area of discriminant analysis for high-dimensional data classification.