Received date: July 31, 2017; Accepted date: August 27, 2017; Published date: August 31, 2017
Citation: Wang KS, Liu Y, Xie X, Gong S, Xu C, et al. (2017) Principal Component Regression Analysis of Nutrition Factors and Physical Activities with Diabetes. J Biom Biostat 8: 364. doi: 10.4172/2155-6180.1000364
Copyright: © 2017 Wang KS, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
The associations of nutrition factors and physical activities with adult diabetes are inconsistent; while most of these factors are inter correlated. The aims of this study are to overcome the disturbance of the multicollinearity of the risk factors and examine the associations of these factors with diabetes using the principal component analysis (PCA) and regression analysis with principal component scores (PCS). Totally, 659 adults with diabetes and 2827 non-diabetic were selected from the 2012 Health Information National Trends Survey (HINTS 4, Cycle 2). PCA was utilized to deal with multicollinearity of the risk factors. Weighted univariate and multiple logistic regression analyses were used to estimate the associations of potential factors and PCS with diabetes. The odds ratios (ORs) with 95% confidence intervals (CIs) were estimated. The first 3 PCs for nutrition factors and physical activities could explain 70% variances. The first principal component (PC1) is a measure of nutrition factors (fruit and vegetables consumption), PC2 is a measure for physical activities (moderate exercise and strength training), and PC3 is about calorie information use and soda use. Weighted multiple logistic regression showed that African Americans, middle aged adults (45-64 years), elderly (65+), never married, and with lower education were associated with increased odds of diabetes. After adjusting for others factors, the PC1 showed marginal association with diabetes (OR=0.84, 95% CI=0.70-1.01); while PC2 and PC3 revealed significant associations with diabetes (OR=0.73, 95% CI=0.61-0.86 and OR=0.85, 95% CI=0.74-0.99, respectively). In conclusion, PCA can be used to reduce the indicators in complex survey data. The first 3 PCs of nutrition factors and physical activities were associated with diabetes. Promotion of health food and physical activities should be encouraged to help decrease the prevalence of diabetes.
Diabetes; Nutrition; Physical activities; Principal component analysis; Weighted logistic regression
Globally, there were 284.6 million of people with diabetes in 2010 and it was predicted to be 438.4 million in 2025 . In the United Sates (US), it was reported that over 29 million people were living with diabetes and 37% of adults aged 20 years or older were pre-diabetic in 2012 [2,3]. The burden for diabetes will rise from 418 billion dollars to 490 billion dollars from 2010 to 2030 . Several factors have been reported to be associated with diabetes such as family history, ethnic background, aging, being overweight, physical inactivity, alcohol use and smoking [5-8]; however, the impact of alcohol and smoking on diabetes has inconsistent findings.
It has been reported that regular consumption of fruit and vegetables, reduced consumption of saturated fats, sodium and sugary drinks, as well as increased physical activity and control of smoking habits could reduce the incidence of diabetes . For example, dietary patterns characterized by high intakes of fruits and vegetables, whole grains, low-fat dairy products, and low glycemicoad have been associated with lower risk of type 2 diabetes [10-15]. A metaanalysis showed that increasing the amount of green leafy vegetables in an individual’s diet could help to reduce the risk of type 2 diabetes . Two-three servings/day of vegetable and 2 servings/day of fruit conferred a lower risk of type 2 diabetes than other levels of vegetable and fruit consumption, respectively . However, it was found that vegetable but not fruit consumption reduced the risk of type 2 diabetes in Chinese women ; while another study showed that fruit or vegetables separately were not associated with diabetes, only green leafy vegetable intake was inversely associated with diabetes . A recent study revealed that fruit and vegetable intake was not related to incidence of type 2 diabetes in older subjects . Furthermore, only small differences were found in dietary behavior in comparison with cohort members without diabetes [21,22]. Another study found nonlinear association of fruit intake with type 2 diabetes . Muraki et al. concluded that there was heterogeneity in the associations between individual fruit consumption and risk of type 2 diabetes . Previous study has suggested a correlation between drinking diet soda and glucose control in adults with diabetes ; while reduced sugar intake showed improvements in key risk factors for type 2 diabetes . A recent study suggested that the impact of sugar on diabetes may be independent of sedentary behavior and alcohol use, and obesity . A more recent study showed that consumptions of soft drinks, sweetened-milk beverages and energy from total sweet beverages were associated with increasing risk of type 2 diabetes .
Principal component analysis (PCA) is one of the most popular methods used for variable reduction, which can overcome the disturbance of the multicollinearity of the risk factors and has been used in social sciences, health service, and health sciences [29-32]. For example, PCA has been used to examine dietary patterns with diabetes in the US adults , Chinese population [34-37], and Japanese population [38,39]. It is also used to investigate the relationship between the physical activity and diabetes [38,40,41].
The associations of nutrition factors and physical activities with adult diabetes are inconsistently reported. For example, high levels of physical activities are associated with reduced risk of diabetes; however, some patients at risk for diabetes were inactive [40,41]. On the other hand, as shown previously, higher intakes of fruit, berries, and vegetables have been associated with reduced risk of diabetes in some observational studies; however, the evidence is limited and inconclusive . Furthermore, most of these nutrition factors and physical activities are correlated. No study has been found to use PCA to extract PCs of these nutrition factors and physical activities followed by a logistic regression analysis to examine their associations with diabetes. In the present study, we collected data from the 2012 Health Information National Trends Survey (HINTS 4, Cycle 2). The aims of this study were to overcome the disturbance of the multicollinearity among the risk factors and examine the associations of these factors with diabetes using PCA and weighted logistic regression.
The data was drawn from the 2012 HINTS4, Cycle 2. The HINTS is a nationally-representative survey which has been administered by the US National Cancer Institute (NCI) since 2003. The HINTS target population includes adults aged 18 or older in the civilian noninstitutionalized population of the US. The collection of the Cycle 2 data was conducted from October 2012 through January 2013. The sample design for the Cycle 2 survey is a two-stage design. In the first stage, a stratified sample of addresses was selected from a file of residential addresses. In the second-stage, one adult was selected within each sampled household. The respondent selection would be conducted uniformly for all households in Cycle 2 using the Next Birthday Method, in which one questionnaire was sent with each mailing so that the adult who would have the next birthday in the sampled household was asked to complete the questionnaire. Every sampled adult who completed a questionnaire in Cycle 2 received a full-sample weight and a set of 50 replicate weights. The full-sample weight is the weight which is used to calculate population and subpopulation estimates from the HINTS data collected in Cycle 2; while the replicate weights are used to compute standard errors for these estimates. More extensive background about the HINTS program and data collection efforts are available elsewhere [43,44]. The final HINTS 4 Cycle 2 sample consists of 3,630 respondents. The overall household response rate using the Next Birthday Method was 39.97%. This current study was approved by the IRB of East Tennessee State University.
Subjects were considered to have diabetes if they responded “yes” to the question “Has a doctor or other health professional ever told you that you had any of the following medical conditions: Diabetes or high blood sugar?” Controls were those if they responded “no” to the question. Of the 3,630 adults, 2,586 responded to the question including 659 with diabetes and 2,827 non-diabetic individuals (Table 1).
|Variable||Total (N)||Diabetes||Prevalence (%)||95%CI||p-value|
Abbreviations: AA: African American, CI: Confidence interval, p-value is based on χ2 test.
Table 1: Prevalence of diabetes in lifetime (%) within each group of exploratory variables.
Demographic characteristics included gender, age group (18-49 years, 50-64 years, 65+), race, marital status (married/living together, widowed/divorced/separated, and never married), and education. Race was recoded as Hispanic, Non-Hispanic White, Non-Hispanic Black or African American (AA), and other. Education was determined by asking whether he/she had a high school degree or not. Smoking status was classified as never smoking, current smoking, or past smoking.
Soda use was defined by the question “Not counting any diet soda or pop, about how often do you drink regular soda or pop in a typical week?” There are six ordinal levels (don't drink any regular soda or pop, less than 1 day a week, 1-2 days a week, 3-4 days a week, 5-6 days a week, and every day). Fruit consumption was defined by the question “About how many cups of fruit (including 100% pure fruit juice) do you eat or drink each day?” Seven levels were categorized such as none, ½ cup or less, ½ cup to 1 cup, 1 to 2 cups, 2-3 cups, 3-4 cups, and 4 or more cups. Vegetable consumption was defined by the question “About how many cups of vegetables (including 100% pure vegetable juice) do you eat or drink each day?” Seven levels were categorized such as none, ½ cup or less, ½ cup to 1 cup, 1 to 2 cups, 2-3 cups, 3-4 cups, and 4 or more cups. Calorie information use was defined by “When available, how often do you use menu information on calories in deciding what to order?” Five levels were categorized such as never, rarely, sometimes, often, and always” Moderate exercise was defined by the question “In a typical week, how many days do you do any physical activity of at least moderate intensity?” There are seven levels (none, 1 day a week, 2 days a week, 3 days a week, 4 days a week, 5 days a week, and 6-7 days a week). Strength training was defined by the question “In a typical week, how many days do you do leisure-time physical activities specifically designed to strengthen your muscles?” There are seven levels (none, 1 day a week, 2 days a week, 3 days a week, 4 days a week, 5 days a week, and 6-7 days a week).
Descriptive statistics and prevalence
All the analyses were conducted using Statistical Analysis System (SAS) (version 9.4, SAS Inc., Cary, NC, USA). Data were weighted to produce overall and stratified estimates that would be nationally representative of the US population. Weights were derived initially from selection probabilities to compensate for planned oversampling procedures. The resulting weights were then calibrated using comparable population characteristics for sex, age, race, and education from data publicly available through the current population survey. A set of 50 replicate weights was used in order to generate an unbiased estimation of population variance. The PROC SURVEYFREQ procedure was used to weight and estimate population proportions in cases and controls and in different stratified demographics; while PROC SURVEYMEANS was used to estimate the overall prevalence. The chi-square test was used to compare the prevalence of diabetes across age, gender, and races.
Principal component analysis
The PCA is an effective method to reduce the dimensionality of multivariate data. It is possible to account for the most information in the original data set with a relatively small number of PCs and there is no correlation among PCs . Generally, the first Principal Component (PC1) will be the linear combination of the variables that captures the maximum amount of information in the data and will be correlated with at least some of the observed variables. The general formula (1) is used to compute scores on the PC1 extract in a PCA .
PC1=b11 (X1)+b12 (X2)+… b1k (Xk) (1)
PC1=the participant’s score on the first PC (the first component extracted)
b1k=the coefficient (or weight) for observed variable k, as used in creating PC1
Xk=the participant’s score on the observed variable k, k=1,2,…k.
The second PC (PC2) accounts for a maximum amount of variance in the data that was not accounted for by PC1 and will be correlated with at least some of the observed variables that did not display strong correlations with PC1. An eigenvalue reflects the amount of variance captured by a given PC. The eigenvalue-one criterion (eigenvalue≥1) is commonly used to decide how many PCs to be retained [45,46]. The proportion of variation explained by each PC can be calculated with formula (2). Any PC which accounts for at least 5% or 10% of the total variance can be retained.
A varimax rotation produces uncorrelated components and is the most commonly used orthogonal rotation in practice . A factor loading of one independent variable is considered as large if its absolute value exceeds 0.40 . Using ordinal and dichotomous indicators is a very common practice in social sciences and health sciences. It has been suggested that a polychoric correlation was created instead of Pearson’s correlations for the categorical variable in PCA and other multivariate analyses . A polychoric correlation and Pearson’s correlation were calculated using PROC CORR for PCA; while PCA was performed with PROC FACTOR with SAS statistical software. A Scree diagram, a visual graphic display of the eigenvalues, was obtained using the SCREE option in the PROC FACTOR. The components in the steep curve before the first point that starts the flat line trend were retained.
Weighted multiple logistic regression analysis
Multiple logistic regression analysis (3) with diabetes as a binary trait, adjusted for covariates, was performed using SAS software.
where Y1 is diabetes status (1 if diabetes) and βm is the slope for observed mth variable and Xp+m is the value of observed variable m; while βpcn is the slope for the nth PC and PCn is the score of the nth PC.
The SURVEYLOGISTIC procedure fits logistic regression models for discrete response survey data by the method of maximum likelihood. The asymptotic p-values for this test were observed while the odds ratio (OR) and standard error (SE) of OR were estimated. Variances of the regression parameters and odds ratios were computed by using either the Taylor series (linearization) method or replication (resampling) methods to estimate sampling errors of estimators based on complex sample designs [49-52]. Two models were conducted to investigate the relationship between the occurrence of diabetes and its exploratory variables. In model one, simple logistic regression was used to examine the role of each potential risk factor including first several PCs on diabetes. In model two, multiple logistic regression models were used to adjust for all potential risk factors including PCs of diabetes.
Prevalence of diabetes
Table 1 presents the prevalence of diabetes. The overall prevalence of diabetes was 14.6% (13.2% for males and 16.0% for females). There were no significant differences between males and females and among different race groups. The prevalence increased with age (8.1%, 20.8% and 25.7% for age groups 18-49, 50-64 and 65+ years, respectively). Higher prevalence was found for the individuals with lower education, being widowed/divorced/separated, and former smoking.
Principal component analysis
The correlation coefficients among nutrition factors and physical activity are presented in Table 2. The fruit and vegetables consumption, moderate exercise and strength training, and calorie information use have significantly positive correlations using both polychoric correlation and Person’s correlation (p<0.0001); whereas the regular soda use has significantly negative correlations with all other five factors (p<0.0001).
|Variable||Calorie information use||Fruit consumption||Vegetable consumption||Soda use||Moderate exercise||Strength training|
|Calorie information use||1.000||0.1927||0.1846||-0.2163||0.1523||0.1282|
Above diagonal is Person correlation coefficient; below the diagonal is polychoric correlation coefficient.
The p values of all correlation coefficients are smaller than 0.0001.
Table 2: Correlation of nutrition factors and physical activities.
The first three PCs explained about 70% of total variation. The eigenvalues of first three PCs were 2.1009, 1.118 and 0.9786, respectively and the proportions of variation explained by these three PCs were 35%, 18.6% and 16.3%, respectively (Table 3). The Scree diagram in Figure 1 also revealed the first three PCs are appropriate to choose by considering proportion of variation. The rotated factor patterns of the first 3 PCs are presented in Table 4. The first PC1 is strongly and positively correlated with fruit and vegetables consumption. More specifically, the PC1 increases as the consumption of fruit and vegetables increases. This component can be viewed as a measure of nutrition with high loading values for fruits and vegetables (both loadings were 0.85). The PC2 increases with increasing moderate exercise and strength training (loading values were 0.82 and 0.86, respectively); therefore, it can be treated as component for measuring of physical activities. The PC3 increases with increasing soda use (loading value was 0.85), but decreasing calorie information use (loading value was -0.68).
|PC||Eigenvalue||Difference||Variance proportion||Cumulative variance proportion|
PC: Principal component.
Table 3: Eigenvalues and the proportion of variation explained by the principal components.
|Calorie information use||0.16||0.19||-0.68*|
PC: Principal component.
*A factor loading of one independent variable is considered as large if its absolute value exceeds 0.40.
Table 4: Rotated factor pattern of nutrition factors and physical activities.
Weighted logistic regression analyses
The results of univariate and multiple logistic regression analyses of independent factors including the first 3 PCs are presented in Table 5. By using univariate analysis, all factors except for gender and race were associated with diabetes (p<0.05). Multiple logistic regression analyses showed that lower education (OR=1.80, 95% CI=1.27-2.54), middleaged adults (OR=2.18, 95% CI=1.53-3.14) and elderly adults (OR=2.83, 95% CI=1.83-4.36) were positively associated with diabetes. African Americans (AAs) (OR=1.98, 95% CI=1.30-3.02) were more likely to have diabetes compared to the Whites. Univariate logistic analysis revealed that the first 3 PCs were negatively associated with diabetes (p<0.05). After adjusted for others factors, the PC1 showed a borderline association with diabetes (OR=0.84, 95% CI=0.70-1.01); while PC2 and PC3 revealed significant associations with diabetes (OR=0.73, 95% CI=0.61-0.86 and OR=0.85, 95% CI=0.74-0.99, respectively).
|Variable||Crude OR||95% CI||p-value||Adjusted OR||95% CI||p-value|
|> High school||1||1|
|≤ high school||2.22||1.62-3.04||<0.0001||1.80||1.27-2.54||0.001|
Abbreviations: AA: African American; PC: Principal component; OR: Odds ratio; CI: Confidence interval.
Table 5: Univariate and multiple logistic regression analyses.
In this study, we found the prevalence of diabetes to be significantly higher in older adults, being widowed or divorced or separated, with low education, and being former smoking. The first 3 principal components (PC1-PC3) for nutrition factors and physical activities could explain 70% variances. The PC1 is a measure of nutrition factors (fruit and vegetables consumption), the PC2 is a factor for physical activity (moderate exercise and strength training), and the PC3 is a measure of calorie information use and soda use. The results from weighted multiple logistic regressions showed that race, age, marital status and education were associated with diabetes. Univariate logistic analysis revealed that the first 3 PCs were negatively associated with diabetes (p<0.05). After adjusted for other factors, PC2 and PC3 were significantly associated with diabetes; however, the PC1 showed a marginal association with diabetes.
Previous studies have shown that smoking is an independent risk factor for the development of diabetes [53-55]. Recently, a metaanalysis suggested that passive smoking is a risk factor of diabetes even in those who were not themselves active smokers . However, both passive and active smoking is associated with diabetes in the elderly population ; whereas in men aged 25 years or over, morbid obesity and smoking were significantly associated with diabetes in Southern California American Indians . In the present study, former smoking was a risk factor of diabetes in the univariate logistic analysis; however, after adjusting for other factors, the association disappeared. We speculated that smoking may have relationship with other factors. We further examined the polychoric correlation among these factors and found that smoking was correlated with age group (p=0.0121), education (p<0.0001), gender (p<0.0001) and marital status (p=0.0281).
Previous studies suggest that PCA can reduce recallable bias and the complexity of correlated data, which can be easily collected as single indicator variables in surveys [58,59]. For example, PCA has been used in dietary patterns with diabetes. It has been shown that fruits, green leafy vegetables, and regular soda were associated with lower risk of incident type 2 diabetes using the Multi-Ethnic Study of Atherosclerosis (MESA) . Furthermore, the consumption of vegetables, fruits, soy and other legumes, whole grains, nuts, and seeds, likely decreases the risk of diabetes, while higher intake of processed meat, sweetened foods and beverages, fried foods, and refined grains increases the risk of developing type 2 diabetes in the Singapore Chinese health study ; while the dietary pattern of more vegetables, fruits and fish were associated with reduced risk and the dietary pattern of more meat and milk products were associated with an increased risk of diabetes in the Hong Kong Dietary Survey . One Japanese study showed that consuming a healthy diet was associated with a lower risk for diabetes among the Japanese . However, dietary patterns may not be appreciably associated with type 2 diabetes risk in Japanese . In addition, one study suggested that consuming a healthy diet was associated with a lower risk for diabetes among the Japanese, particularly among those who eat regularly, habitually exercise are either non- or ex-smokers . In the present study, we found that PC1 was negatively associated with diabetes in univariate logistic analysis (p=0.045); however, after adjusting for others factors, the PC1 showed marginal association with diabetes p=0.066); which indicated that diabetic individuals may have not realize the importance of nutrition on their health.
Another risk factor of diabetes is the lack of physical activity. Previous studies have shown that high levels of physical activity are associated with reduced risk of diabetes [60-63]. However, about 46% of primary care patients at risk for diabetes did not do physical activity per week ; while two-third of patients with diabetes remain inactive . It has been recommended that moderate to vigorous physical activity can reduce the risk of chronic diseases such as diabetes and its complications [65-67]. Health counsellors should address these barriers to increase the patients' adherence to physical activity as the recommendations . Our current results showed that physical activities (PC2) were associated with a decreased risk of diabetes. To the best of our knowledge, few studies have used PCA to address the physical activity. For example, one study conducted exploratory principal components factor analyses of influences on physical activity instrument ; another study used PCA to extract the factors of barriers with physical activity level .
Previous study has suggested a correlation between drinking diet soda and glucose control in adults with diabetes ; while soda use was associated with greater risks of metabolic syndrome components and type 2 diabetes [28,68]; whereas reduced sugar intake showed improvements in key risk factors for type 2 diabetes . A recent study suggested that the impact of consuming sugar on diabetes may be independent of sedentary behavior and alcohol use, and obesity . In the present study, the PC3 was negatively associated with diabetes, which suggested that diabetic individuals used less regular soda than non-diabetic. In addition, the calorie information use was negatively correlated to PC3, and the logistic regression revealed that diabetic individuals used more calorie information than non-diabetic. The above results reflected the diabetic individuals pay more attention to their calorie intake to comply their physician’s recommendation for diabetes treatment.
There are several important strengths in this study. First, new valuable variables were used, including strength training, regular soda use and calorie information use, which have not been intensively investigated in the past studies. Furthermore, the PCA was used to reduce variable dimension with keeping most of information followed by PCA. We are also aware certain limitations of this study, including the cross-sectional study design, which limits the ability to establish the causality as well as possible recallable, differential misclassification biases, and the effects of differences in how respondents interpreted survey questions.
Our findings support the notion that PCA can be used to reduce the indicators in complex survey data. The PCs of nutrition factors and physical activities were associated with diabetes. Promotion of health food and physical activities should be encouraged to help decrease the prevalence of diabetes.
The authors would like to thank the NCI for providing the Data from the 2012 Health Information National Trends Survey.