Department of Economics and Statistics, California State University, USA
Received date: July 13, 2015; Accepted date: July 15, 2015; Published date: July 22, 2015
Citation: Sapra S (2015) High-Dimensional Statistical Analysis in Business and Economics. Bus Eco J 6:168. doi:10.4172/2151-6219.1000168
Copyright: © 2015 Sapra S. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Business and Economics Journal
High-dimensional models have become increasingly common in business and economics in recent years. Modern data in business involve thousands to millions of records on individuals. Over the past few decades, with the availability of new technologies, it has become common in economics, finance, and marketing to collect data on a large number of features for a limited number of individuals. Fan et al.  review the literature on high-dimensional models and their applications in economics and finance. In vector autoregressive models, the number of parameters grows with the size of the model and the problem of estimating these models can quickly become computationally intractable. Panel data used widely in economics also offers an application of high-dimensional data analysis. In finance, volatility matrix estimation is an example of high-dimensional statistics. Scanner data on transactions by households on a large number of products is yet another example in marketing.
High-dimensional statistical analysis refers to aforementioned situations in which the number of parameters is much greater than the sample size. Unfortunately, common econometric techniques, which are suitable for low-dimensional data for which the number of records is greater than the number of covariates, are not suitable for highdimensional data. So, what can go wrong if a technique suitable for lowdimensional data is applied in a high-dimensional setting? The main problem is that low-dimensional data analysis techniques, such as least squares or logistic regression will produce a perfect fit on the training data, but will produce a poor fit on an independent test data. James et al.  attributed this over-fitting for test data to excessive flexibility of the least squares and logistic regression in high-dimensional settings. Over-fitting means that the statistical procedure fits mostly noise to the data, instead of fitting the signal. The key therefore is to avoid overfitting with high-dimensional data by fitting less flexible least squares or logistic regression models to the data, such as LASSO, principal components regression, ridge regression, forward stepwise selection, etc. The main idea underlying these techniques is regularization or shrinkage, which means reducing the number of non-zero coefficient estimates. Another key idea is that high-dimensional statistical problems are plagued by the curse of dimensionality: deterioration in the quality of the fitted model and predictions as new features are added to the model. As James et al.  explained, if additional signal features, which are truly associated with the response are included in the model, the quality of the fit indeed improves. However, if additional noise features, not associated with the response are included, the model fit and predictions deteriorate and the risk of over-fitting increases. High-dimensional data can be a blessing if the additional features are relevant and associated with the response resulting in a superior predictive model, but can also be a curse if these features constitute noise in which case the quality of predictions is poor. Furthermore, even for the relevant features, the additional variance that accompanies their inclusion may more than offset the reduction in bias. Finally, the results for high-dimensional techniques should be interpreted with caution due to extreme multicollinearity among features. Common measures of model fit, such as p-value, R-squared for low-dimensional settings can be misleading in high-dimensional settings since these measures will make the model fit appear almost perfect due to overfitting.
High-dimensional statistical analysis is a fast evolving area of research and is highly promising for researchers in business and economics. It is likely to witness contributions of better techniques for prediction as well as inference in future years. Augmentation of penalized least squares and penalized likelihood methods for variable selection in sparse regression models discussed in Buhlmann and van de Geer  and Fan et al.  with data mining techniques as detailed in Belloni et al.  is a promising approach for improved prediction and inference.