Introduction

J Comput Sci Syst Biol

opg

Journal of Computer Science & Systems Biology

0974-7230

OMICS Publishing Group

India, USA

10.4172/jcsb.1000028

000063

Research Article

Biochemistry

Proteomics Bioinformatics Genomics Transcriptomics Biomarkers

L₁ Least Square for Cancer Diagnosis using Gene Expression Data

Hang

Xiyi

1 Wu

Fang-Xiang

2 3

1Department of Electrical and Computer Engineering, California State University, Northridge, CA 91330, USA 2Department of Mechanical Engineering 3Divsion of Biomedical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, S7N 5A9, Canada

* To whom correspondence should be addressed: Fang-Xiang Wu, Divsion of Biomedical Engineering University of Saskatchewan, Saskatoon, Saskatchewan, S7N 5A9, Canada; E-mail: xhang@csun.edu, faw341@mail.usask.ca

04 2009

27 04 2009

2 2 167 173 19 03 2009 27 04 2009

2009

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The performance of most methods for cancer diagnosis using gene expression data greatly depends on careful model selection. Least square for classification has no need of model selection. However, a major drawback prevents it from successful application in microarray data classification: lack of robustness to outliers. In this paper we cast linear regression as a constrained l₁-norm minimization problem to greatly alleviate its sensitivity to outliers, and hence the name l₁ least square. The numerical experiment shows that l₁ least square can match the best performance achieved by support vector machines (SVMs) with careful model selection.

l₁-norm minimization least square regression classification cancer gene expression data support vector machine

citation

Hang X, Wu FX (2009) L₁ Least Square for Cancer Diagnosis using Gene Expression Data. J Comput Sci Syst Biol 2: 167-173. doi:10.4172/jcsb.1000028

Introduction

DNA microarray technique has the potential to provide a more accurate and objective cancer diagnosis than traditional histopathological approach with its high throughput capability of simultaneously measuring relative expression level of tens of thousands of genes. The success, however, greatly depends upon the supervised learning algorithm selected to classify gene expression data.

Many well-established methods are available for gene expression profile classification. According to Lee et al (2005), they can be classified into four categories: (1) classical methods, such as Fisher’s linear discriminant analysis, logistic regression, K-nearest neighbor, and generalized partial least square, (2) classification trees and aggregation methods, such as CART, random forest, bagging and boosting, (3) machine learning methods, such as neural network and support vector machines (SVMs), and (4) generalized methods, such as flexible discriminant analysis, mixture discriminant analysis, and shrunken centroid method. The performance of many methods, however, relies upon careful choice of model parameters, which can be done via model selection procedure such as cross validation. For example, the model parameters for SVMs include kernel parameters and the penalty parameter C. A recent controversy regarding the performance comparison between SVM and random forest just exemplifies the importance of model selection. The study by Diaz-Uriarte et al. (2006) concludes that random forest outperforms SVM, and the conclusion in paper (Stanikov et al, 2008) is totally opposite. The main difference between these two studies is that model selection is carefully designed in the latter study but not in the former study. The incident also shows that model selection may be the obstacle of the extensive application of SVM in classification of gene expression profile. Since classification performance is a nonconvex function of model parameters, it is usually difficult to find optimal model parameters by model selection.

Least square for classification, on the other hand, has no need of model selection. Consider a general classification problem with N classes. A linear model is built for each class k

Y_k = W^T_k X + W_k0, k = 1,2,..., N. (1)

The N equations can be grouped into

y=Wx˜ (2)

where y = [y₁, y₂,...,y_N]^T, W is a matrix whose kth row is [ w^T_k, w_k0 ], and x˜ =[x^T,1]^T. For a training dataset {(x_i , t_i ),i= 1, 2,....,n}, where t_i is 1-of-N binary coding vector of the label of the ith feature x_i , i.e., a vector containing zeros everywhere except 1 in the kth position, if x_i belongs to category k. Denote by X the feature matrix whose kth row is [x^T_k ,1] , and T the target matrix whose kth row is t^T_i.The linear regression model in (2) can be fitted simultaneously to each of columns of T, and the solution is in the form

Ŵ = (X^TX)^-1X^TT (3)

• Calculate the fitted output ŷ = Ŵ[x^T,1]^T(an N-dimensional vector);

• Label = argmax_k ŷ(k), k = 1,2,.. .,N. More details can be found in literature (Bishop, 2006; Hastie et al., 2001).

The above approach, however, is very sensitive to outliers, especially for multicategory classification (N ≥ 3). Furthermore, when least square for classification is applied to gene expression data, problems can become more severe due to the curse of dimensionality caused by the great number of genes in each sample.

Inspired by the recent progress in sparse signal recovery via l₁ – norm minimization (Candès et al., 2006, Candès and Tao, 2006; Donoho, 2006), we propose a new approach to overcome the major drawback of least square for classification by casting the linear regression problem as a constrained l₁ – norm minimization problem. The obtained sparse solution is much less sensitive to both outliers and curse of dimensionality. In addition, multicategory classification is realized via one-versus-rest (OVR) and one-versus- one (OVO) approaches which decompose the original multi-category problem into a series of binary problems. The new method is validated by comparing caner diagnosis performance with SVMs.

Methods Binary L1 Least Square

Consider a training dataset {(x_i, y_i);i=1,...,n}, x_i€ R^d, y_i €{-1, +1}, where x_i represents the ith sample, a d-dimensional column vector containing gene expression values with d as the number of genes, and y_i is the label of the ith sample. Two classes are described by a liner model

y = [x^T,1]w (4)

for any sample x. Applying the linear model to the training dataset, we have

y_i = [x^T_i,1]w, i = 1,2,...,n (5)

The n equations can be grouped into

y = Xw (6)

where y = [y₁, y₂,...,y_n]^T, and X is an n × (d +1) matrix whose ith row is [x^T_i,1]^T. Since the number of samples are much smaller than the number of genes, i.e., n << d, the system in (6) is underdetermined. The solution is obtained by casting the original problem as the following constrained l₁-norm minimization problem

min ||w||₁ subject to Xw = y (7)

The above formulation is inspired by the recent progress in compressed sensing (Candès et al., 2006, Candès and Tao, 2006; Donoho, 2006) and basis pursuit denoising (Chen et al., 2005).

There are quite a few solvers available for solving the optimization problem defined in (7), such as MOSEK (Andersen, 2002) PDCO-CHOL (Saunders, 2002), PDCOLSQR (Saunders, 2002), and l₁-magic (Candès and Romberg, 2006), which all belong to interior-point methods. In this study we choose a solver called SPGL1 (Friedlander and Van den Berg, 2008) for its efficiency in solving largescale problems. Unlike other methods, SPGL1 solves the optimization problem by converting it into a root finding problem. Please refer to paper (Van den Berg and Friedlander, 2008) for details on the theory of SPGL1.

Denote by ŵ the solution to (7). Then for any sample x, the label can be simply assigned as sign([x^T ,1]ŵ).

Multicategory L1 Least Square: OVR

Consider a multicategory training dataset {(x_i, y_i); i=1,...,n}, x_i € R^d, y_i € {1,2,...n}, where N is the category number. OVR approach needs to determine for each class a binary classifier to separate it from the remaining classes. The N linear models are defined as

D_k(x) = [ X^T ,1] w_k, k = 1,2,...N (8)

For category k, after changing the labels of those samples belonging to k to +1, and others to -1, we apply the linear model to the training dataset

y_k = Xw_k , k = 1, 2,....,N (9)

where y_k is a label vector containing either +1 or -1. Similarly, the above N underdetermined systems can be solved by the following N constrained l₁-norm minimization problems

min || w_k ||₁subject to Xw_k = y_k (10)

where k = 1,2,....N.

Denote by k_ŵk the solution to (10). Then for any sample x, the label can be determined by

arg max_k=1,2,...N D_k(x) = [ X ^T, 1] Ŵk. (11)

Multicategory L1 Least Square: OVO

In OVO approach, a binary classifier is constructed for each pair of classes. The linear model for class i against class j is given by

D_i,j(x) = [X^T,1]w_i,j (12)

For those samples of category i and j, changing their labels to +1 and -1, applying the linear model gives rise to

y_i,j = X_i,j W_i,j (13)

where y_i,j is a vector containing either +1 or -1, and X_i,j is a matrix whose kth row is [X^T_k,1]^T with x_k belonging to either category i or j. The underdetermined system is solved by

min || W_i,j ||₁ subject to X_i,j W_i,j = y_i,j (14)

Since D_j,i = -D _i,j, the number of the classifiers is (₂^N) ,i.e., N(N-1)/2, compared to N in OVR approach.

Denote by Ŵ_i,j the solution to (14). For any sample x, we calculate

D_i (x) = ∑^N_j=1,j≠1 sign(D_i,j(x)) (15)

with D_i,j (x) = ∑^N_j=1,j≠1 X_i,jŴ_ij. The label of x is determined by

arg max_1,2,...,N D_i(x) (16)

Numerical Experiment

Numerical experiment is carefully designed to validate the cancer diagnosis performance of l1 least square using gene expression data. The performance metric is classification accuracy obtained by 10-fold stratified cross validation. MATLAB R14 is used to implement the new method. The results are compared with binary SVM (Vapnik, 1998) and some popular variants of multicategory SVMs including OVR-SVM (Kressel, 1999), OVO-SVM (Kressel, 1999), DAGSVM (Platt et al., 2000), method by Weston and Watkins (WW) (Weston and Watkins, 1999), and method by Crammer and Singer (CS) (2000).

The results of SVMS are obtained from GEMS (Gene Expression Model Selector), which is software with graphic user interface for classification of gene expression data. It is freely available at http://www.gems-system.org/. GEMS is used by Stanikov et al.(2005) for the comprehensive study of the performance of multiple classifiers on gene expression cancer diagnosis. As for model selection, polynomial kernels are used with orders p = {1,2, 3}, and the penalty parameter C = {10^-3+0.5n, n = 0, 1, …, 6}.

Six datasets are used in the experiment, which are among eleven datasets used in reference (Stanikov et al., 2005). They are available on the website of GEMS in the format of both GEMS and MATLAB mat file. For easy comparison and reference, we adopt the names used in reference (Stanikov et al., 2005). The information about the six datasets is summarized below.

DLBCL (Shipp et al., 2002): The binary dataset comes from a study of gene expression of two lymphomas: diffuse large B-cell lymphomas and follicular lymphomas. Each sample contains 5469 genes. The sample number is 77.

Prostate_Tumor (Singh et al., 2002): The binary dataset contains gene expression data of prostate tumor and normal tissues. There are 10509 genes in each sample, and 102 samples.

9_Tumors (Staunton et al., 2001): The dataset comes from a study of 9 human tumor types: NSCLC, colon, breast, ovary, leukaemia, renal, Melanoma, prostate, and CNS. There are 60 samples, each of which contains 5726 genes.

11_Tumors (Su et al., 2001): The dataset includes 174 samples of gene expression data of 11 various human tumor types: ovary, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, prostate, pancreas, lung adeno, and lung squamous. The number of genes is 12533.

Brain_Tumor1 (Pomeroy et al., 2002): The dataset comes from a s study of 5 human brain tumor types: medulloblastoma, malignant glioma, AT/RT, normal cerebellum, and PNET, including 90 samples. Each sample has 5920 genes.

Brain_Tumor2 (Nutt et al., 2003): There are 4 types of malignant glioma in this dataset: classic glioblastomas, classic anaplastic oligoden-drogliomas, non-classic glioblastomas, and non-classic anaplastic oligodendrogliomas. The dataset has 50 samples, and the number of genes is 10367.

All the datasets are normalized by rescaling the gene expression values to be between 0 and 1.

Two methods are used in this experiment to study gene selection’s impact on classification performance: Kruskal- Wallis non-parametric one-way ANOVA (KW) (Gibbons, 2003), and the ratio of between classes to within class sums of square (BW) (Dudoit et al., 2002).

Results Classification without Gene Selection

Table 1 shows the classification accuracy values obtained by 10-fold stratified cross validation for both l₁ least square and SVMs. The results of SVMs are slightly different from what is reported by Stanikov et al. (2005) where the five datasets are also used. A possible explanation is that the distribution for cross validation in our study is different from that in paper (Stanikov et al., 2005).

For binary datasets Prostate_Tumor and DLBCL, the performance of l₁ least square is slightly below that of SVMs. Note that the results of SVMs are obtained by careful model selection using cross validation, while our method does not need model selection, and is totally automatic. In addition, just like SVM, when applied to binary datasets, the multicategory classifiers of l₁ least square are equivalent to binary classifier for both OVO and OVR approaches.

When applied to classification of multicategory datasets, OVR- l₁ least square can closely match the best performance achieved by SVMs. For both SVM and l₁ least square, OVO approach performs much worse than OVR approach for classifying 9_Tumors dataset.

Classification with Gene Selection

Table 2 shows the best performance achieved by OVR- l₁ least square and SVMs when gene selection methods KW and BW are used. The results show that both l₁ least square and SVMs perform slightly better compared with the performance without gene selection reported in Table 1. The improvement ranges from 0 to 9% for SVMS, while only from 0 to 3.48% for OVR- l₁ least square. Again, the performance of OVR- l₁ least square is comparable to SVMs.

Discussion

The success of l₁ least square may lie in its sparse linear model coefficient vector obtained from l₁ – norm minimization. Figure 1 shows the model coefficient vector w which is the solution of l₁ least square for classifying binary dataset DLBCL. The sparsity suggests that those genes with greater absolute coefficients could have played more important roles in classification. As a result, the classification performance does not depend on all the genes, especially those with very small absolute coefficients. The sparsity has the potential to greatly alleviate curse of dimensionality and increase the robustness to outliers.

Another implication of sparsity is that those genes with larger absolute coefficients may correspond to biological markers. Hence, sparsity could be also used for gene selection. We did a small experiment to verify this possibility. The binary dataset DLBCL is used to fit l₁ least square model. Gene selection is done by choosing M genes with M largest absolute coefficients. Binary SVM is used to classify the gene-selected data. The results are compared with KW and BW methods for gene selection. Figure 2 shows the performance of the three gene selection methods for M =10, 20, 30, 40, and 50, respectively. The new method significantly outperforms both KW and BW methods when a small number of genes are selected.

Figure 1 The sparse coefficient vector. Figure 2 The performance of three gene Selection methods.

The above gene selection approach is in spirit similar to lasso (Tibshirani, 1996) formulated as follows

min || Xw - y ||²₂ subject to || w ||₁ ≤ t (17)

where X, w, and y follow the definitions given in section 2.1 for binary l₁ least square, and t is the model parameter for lasso. In addition, lasso can also be used in classification by replacing (7) with (17) for binary case, (10) with

min || Xw_k - y_k ||²₂ subject to || w_k ||₁ ≤ t_k (18)

for multi-category OVR approach, and (14) with

min || Xw_i,j - y_i,j ||²₂ subject to || w_i,j ||₁ ≤ t_i,j (19)

for multi-category OVO approach.

Similarly, we can also replace l₁ least square regression with Dantzig selector (Candès and Tao, 2007), which is given below for binary classification

min || w || subject to || X^T(y - Xw) ||_∞ ≤ ( 1 + t^-1) √(2logd.σ) (20)

where t is model parameter, ands is the noise standard deviation. Dantzig selector for multicategory classification can be similarly defined.

Both lasso and Dantzig selector for classification, however, still need to select optimized model parameters by model selection procedure, such as cross validation.

Conclusion

In this paper, we have described a specialized regression method for cancer diagnosis using expression data. The new approach, called l₁ least square, casts linear regression as a constrained l₁-norm minimization problem to overcome the major drawback of least square for classification: lack of robustness to outliers. Besides binary classifier, multicategory l₁ least square including OVO and OVR approaches are also proposed.

Numerical experiment shows that OVR- l₁ least square can match the best performance achieved by SVMs with careful model selection. The main advantage of l₁ least square over other methods including SVMs is that it has no need of model selection. As a result, the method based on l₁ least square is totally automatic. l₁ least square also has the potential to be used for gene selection.

The l₁ least square classifier may become a promising automatic cancer diagnosis tool by consistently distinguishing gene profile classes. Those genes with great absolute regression coefficients may serve as biological marker candidates for further investigation.

References

Bishop

2006

Pattern recognition and machine learning

Springer

New York

Candès

Romberg

Tao

2006

Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information

IEEE Trans. on Information Theory 52 489 509

Candès

Tao

2006

Near optimal signal recovery from random projections: Universal encoding strategies?

IEEE Trans. on Information Theory 52 5406 5425

Candès

Romberg

2006

l1 -magic: A Collection of MATLAB Routines for Solving the Convex Optimization Programs Central to Compressive Sampling

[Online] Available: www.acm.caltech.edu/l1magic/

Candès

Tao

2007

The Dantzig selector: Statistical estimation when p is much larger than n

Ann Statist 35 2313 2351

Chen

Donoho

Saunders

2001

Atomic decomposition by basis pursuit

SIAM Rev 43 129 159

Crammer

Singer

2000

On the learnability and design of output codes for multiclass problems

Proceedings of the Thirteen Annual Conference on Computational Learning Theory

Standford University Palo Alto CA

Diaz-Uriarte

Alvarez de Andres

2006

Gene selection and classification of microarray data using random forest

BMC Bioinformatics 7 3

Donoho

2006

Compressed sensing

IEEE Trans on Information Theory 52 1289 1306

Dudoit

Fridlyand

Speed

2002

Comparison of discrimination methods for the classification of tumors using gene expression data

J Am Stat Assoc 97 77 87

Friedlander

Van den Berg

2008

SPGL1, a solver for large scale sparse reconstruction

[Online] Available: http://www.cs.ubc.ca/labs/scl/spgl1/

Gibbons

2003

Nonparametric Statistical Inference

4th edition

CRC

Hastie

Tibshirani

Friedman

2001

The elements of statistical learning

New York: Springer

Kressel

1999

Pairwise classification and support vector machines

In Advances in Kernel Methods: Support Vector Learning, (Chapter 15.) Cambridge

MA: MIT Press

Lee

Park

Song

2005

An extensive comparison of recent classification tools applied to microarray data

Computational Statistics & Data Analysis 48 869 885

Nutt

Mani

Betensky

Tamayo

Cairncross

2003

Gene expression-based classification of malignant gliomas correlates better with survival than histological classification

Cancer Res 63 1602 1607

Platt

Cristianini

Shawe-Taylor

2000

Large margin DAGS for multiclassclassification

In Advances in Neural Information Processing Systems 12 MIT Press

Pomeroy

Tamayo

Gaasenbeek

Sturla

Angelo

2002

Prediction of central nervous system embryonal tumour outcome based on gene expression

Nature 415 436 442

Saunders

2002

PDCO: Primal-Dual Interior Method for Convex Objectives

[Online] Available: http:// www.stanford.edu/group/SOL/software/pdco.html

Shipp

Ross

Tamayo

Weng

Kutok

2002

Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning

Nat Med 8 68 74

Statnikov

Wang

Aliferis

2008

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

BMC Bioinformatics 9 319

Staunton

Slonim

Coller

Tamayo

Angelo

2001

Chemosensitivity prediction by transcriptional profiling

Proc Natl Acad Sci USA 98 10787 10792

Statnikov

Aliferis

Tsamardinos

Hardin

Levy

2005

A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

Bioinformatics 21 631 643

Singh

Febbo

Ross

Jackson

Manola

2002

Gene expression correlates of clinical prostate cancer behavior

Cancer Cell 203 209

Welsh

Sapinoso

Kern

Dimitrov

2001

Molecular classification of human carcinomas by use of gene expression signatures

Cancer Res 61 7388 7393

Tibshirani

1996

Regression shrinkage and selection via the lasso

J Roy Statist Soc ser B 58 267 288 The MOSEK Optimization Tools Version 2.5. User’s Manual and Reference 2002 [Online] Available: www.mosek.com

Van den Berg

Friedlander

2008

Probing the Pareto frontier for basis pursuit solution

Technical Report 2008, Department of Computer Science, University of British Columbia

Vapnik

1998

Statistical learning theory

New York: Wiley

Weston

Watkins

1999

Support vector machines for multi-class pattern recognition

In Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99) Bruges

April 21-23

Table 1. Performance without gene selection.

Methods	Prostate Tumor	DLBCL	9 Tumors	11 Tumors	Brain Tumor1	Brain Tumor2
SVM	Binary	93.27%	97.32%	N/A	N/A	N/A	N/A
	OVR	93.27%	97.32%	67.06%	94.99%	90%	75.5%
	OVO	93.27%	97.32%	54.63%	90.22%	90%	73.83%
	DAGSVM	93.27%	97.32%	54.63%	90.22%	90%	73.83%
	WW	93.27%	97.32%	68.17%	94.31%	90%	77.17%
	CS	93.27%	97.32%	68.17%	94.31%	90%	75.5%
l₁LRC	Binary	91.36%	96.07%	N/A	N/A	N/A	N/A
	OVR	91.36%	96.07%	72.21%	96.63%	90%	76.67
	OVO	91.36%	96.07%	55.33%	91.93%	90%	77.00%

Table 2. Performance with gene selection.

Methods	Prostate Tumor	DLBCL	9 Tumors	11 Tumors	Brain Tumor1	Brain Tumor2
SVM	Accuracy	94.27%	98.75%	72.89%	96.66%	90%	82.83%
	Variant	OVO	OVO	CS	OVR	WW	OVR
	GS	KW 1000	KW 500	BW 3000	KW 1000	NG	KW 500
OVR l₁LRC	Accuracy	94.18%	98.75%	75.69%	96.66%	90%	78.33%
	GS	BW 3050	BW 500	KW 1060	KW 2000	NG	BW 9000