# Open Access Scientific Reports

## The Logistic Regression and ROC Analysis of Diagnostic Tests Results for Gestational Diabetic Mellitus

 Research Article Open Access
Oyeka ICA1 and Okeh UM2*
1Department of Applied Statistics, Nnamdi Azikiwe University, Awka, Nigeria
2Department of Industrial Mathematics and Applied Statistics, Ebonyi State University, Abakaliki, Nigeria
 *Corresponding author: Okeh UM Department of Industrial Mathematics and Applied Statistics Ebonyi State University Abakaliki, Nigeria E-mail: [email protected]

Received February 07, 2013; Published March 30, 2013

Citation: Oyeka ICA, Okeh UM (2013) The Logistic Regression and ROC Analysis of Diagnostic Tests Results for Gestational Diabetic Mellitus. 2:654 doi:10.4172/scientificreports.654

Copyright: © 2013 Oyeka ICA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

This paper proposes a matrix approach for estimating parameters of logistic regression, with a view of estimating the effects of risk factors of gestational diabetic mellitus (GDM). The proposed method, unlike other methods of estimating parameters of non-linear regression, is simpler, and convergence of parameters is quicker. The odds ratio obtained from the logistic regression were used to interpret the effects of these risk factors of GDM, where obesity and family history as risk factors, were positively associated with GDM on application of the proposed method, with data from five randomly selected hospitals in Ebonyi State, Nigeria. The proposed method was seen to compare favorably with other known methods.

Keywords

GDM; Odds ratio; Logistic regression; Dichotomous; Newton-raphson

Introduction

The constant evolution of medicine over the last two decades has meant that statistics has had to develop methods to solve the new problems that have appeared, and has come to play a central part in methods of diagnosis of diseases [1]. A diagnostic method consists of the application of a test with a group of patients, in order to obtain a provisional diagnosis regarding the presence or the absence of a particular disease [2]. In this work, logistic regression has been proposed for the purpose of estimating the effects of various predictors on some binary outcome of interest. Here, logistic regression regresses a dichotomous dependent variable on a set of independent variables, as a way of knowing the effects of these independent variables [3,4].

We, therefore here, propose to develop a matrix approach for solving a system of nonlinear equations, with P+1 unknown parameters. These methods will be applied in estimating the effects of risk factors on the occurrence of gestational diabetic mellitus (GDM) [ 5-7]. The proposed method will be illustrated using data on gestational diabetic mellitus (GDM), and have been shown to compare favorably with other existing methods in terms of efficiency.

The Proposed Method

The fundamental model for any multiple regression analysis assumes that the outcome variable is a linear combination of a set of predictors, and this is represented as:

(1)

Where β0 is the expected value of Y, when the x's are set to 0, βk is the regression coefficient for each corresponding predictor variable, x ik, ε is the error of the prediction. The binary logistic model is based on a linear relationship between the natural logarithm (ln) of the odds of an event, and a numerical independent variable. The form of this relationship is as follows:

(2)

The logistic regression model indirectly models the response variable based on probabilities associated with the values of Y. Let πi be the probability that Y=1 and πi -1 be the probability that Y=0. These probabilities are represented as:

(3)

But, the general form of logistic model is given by

(4)

Where i=1,2,....N

And are the odds of developing any disease for a subject with risk factor. By logit transformation of the inverse of log odds to favour Y=1, we obtain the linear component as

Similarly,

Using the inverse of logit transformation of the natural logarithm of the odds (log odds) to favor Y=1, we equates to the linear component to have:

(5)

Therefore,

(6)

Maximum Likelihood Estimation (Mle) for Logistic Regression

We here estimate the P+1 unknown parameters β in Equation 5, with MLE, by finding the set of parameters for which the probability of the observed data is greatest. Since each yi represents a binomial count in the ith population, the joint probability density function of Y is:

(7)

Where β is from πi in Equation 3. For each population, there are different ways to arrange yi successes from among ni trials. Since the probability of a success for any one of the ni trials is πi, the probability of yi successes is Likewise, the probability of ni-yi failures is The joint probability density function in Equation 7 expresses the values of Y as a function of known, fixed values for β. The likelihood function has the same form as the probability density function, except that the parameters of the function are reversed: the likelihood function expresses the values of β, in terms of known, fixed values for Y. Thus,

(8)

The maximum likelihood estimates are the values for β that maximize the likelihood function in Equation 8. Thus, finding the maximum likelihood estimates requires computing the first and second derivatives of the likelihood function. Since the factorial terms do not contain any of the πi, they are essentially constants that can be ignored. Therefore, maximizing the equation without the factorial terms will come to the same result, as if they were included. By rearranging the terms, the equation to be maximized which is the conditional likelihood can be written as:

or

(9)

Recall from Equation 6 that

(10)

Which, after solving for πi (the same thing as the result of Equation 3) becomes,

(11)

Substituting Equation 10 for the first term and Equation 11 for the second term, Equation 9 becomes:

(12)

Use to simplify the first product and replace 1 with to simplify the second product. We have:

(13)

This is the kernel of the likelihood function to maximize. We here simplify further by taking its log. Since the logarithm is a monotonic function, any maximum of the likelihood function will also be a maximum of the log likelihood function, and vice versa. Thus, taking the natural log of Equation 13 yields the log likelihood function:

(14)

To find the critical points of the log likelihood function, set the first derivative with respect to each β equal to zero. In differentiating Equation 14, note that

(15)

Since the other terms in the summation do not depend on βk,and can thus be treated as constants. In differentiating the second half of Equation 15, take note of the general rule that , as well as the fact that . But,

So that . Thus, differentiating Equation 14 with respect to each βk,

(16)

Therefore,

So that the gradient of the log likelihood in matrix form is given as:

(17)

Which is a column vector of length P+1, whose elements are Let μ be a column vector of length N, with elements The maximum likelihood estimates for β can be found by setting each of the P+1 equations in Equation 16 equal to zero, and solving for each βk. Each such solution, if any exists, specifies a critical point (either a maximum or a minimum). The critical point will be a maximum if the matrix of second partial derivatives (Hessian matrix) is negative definite; that is, if every element on the diagonal of the matrix is less than zero [8]. It is formed by differentiating each of the P+1 equations in Equation 16, a second time with respect to each element of β, denoted byβ k.The general form of the matrix of second partial derivatives is

(18)

The Hessian in a matrix form is given as

(19)

Where W is a square matrix of order N, with elements niπi(1-πi) on the diagonal, and zeros everywhere else. To solve Equation 18, we will make use of two general rules for differentiation. First, a rule for differentiating exponential functions:

(20)

In our case, let Second, the quotient rule for differentiating the quotient of two functions:

(21)

Applying these two rules together allows us to solve Equation 18.

(22)

Since

while, , clearly defined. Thus, Equation 18 can now be written as:

(23)

Newton-Raphson Iteration Procedure

In finding the roots of Equation 16 using Newton-Raphson method, we generalize the method to a system of P+1 equations. This is done by expressing each step of the Newton-Raphson (NR) algorithm, through letting βold or β(0) represent the vector of initial approximations for each βk, so that the result of this algorithm in matrix notation gives:

(24)

Substituting the values of l′(β ) and l′′(β ) above simplifies the equation to a matrix form, given as

(25)

Where is a vector and W is the diagonal weight vector, with entries πi(1-πi).

The last equation is called the weighted least square regression, which finds the best least-squares solution to the equation. The equation is called recursive weighted least squares, because at each step, the weight vector W keeps changing (since the β 's are changing). Now, Equation 25 can be written:

(26)

Continue applying Equation 26 until there is essentially no change between the elements of β from one iteration to the next. At that point, the maximum likelihood estimates are said to have converged, and Equation 19 will hold the variance-covariance matrix of the estimates. Because the estimation algorithm for the parameter of the logistic regression model is iterative, parameter estimates based on small samples way fail to converge, or converge to local rather than global, stationary points. This informed the application of large sample in this study. This iterative procedure is handled by SAS software in this work.

Illustrative Example

In estimating the effects of risk factors on GDM, 1000 subjects (pregnant women at risk for GDM) were sampled from the five randomly selected hospitals from January 2010 to December 2011 in Ebonyi State through a retrospective study, out of which 490 (49%) were those less than 28 weeks of their gestational age, and 510 (51%) were those at least 28 weeks of their gestational age. In the total sampled subjects, 530 (53%) were gestational diabetic and 470 (47%) were nongestational diabetic. Since GDM is a dichotomous variable, it is coded as 0 or 1, and the independent factors considered in this work are Age, Category of pregnant women, Obesity, Income group, Life-style and exercise, F.H of diabetes, Hypertension, and Diet habit are also categorical and coded between 0 and 3. These are presented in table 1.

 Table 1: Code sheet of concerned independent variables.

Results of Analysis

The results are shown in the following tables: Tables 2 and 3

 Table 2: Chi-square analysis of covariates showing significance, after comparison with p and phi-value for the sample.

 Table 3: Results of fitting the Multiple Logistic Regression Model, including O.R and 95% C.I, by using stepwise logistic procedure for the sample.

The table 3 shows that three risk factors: Obesity, F.H and Exercise, were significant because for all the above variables p-value was less than 0.05. Since the hospitals where these data were collected are mainly located in the urban areas, it means that by the results obtained, it implies that lifestyle of urban area, taking high calories food, less physical activity, invention of remote control equipments and less exercise are the causes of incidence of obesity in the sample data analysized. Moreover, genetical and environmental behaviors are also the reasons of obesity. The reference group for obesity was taken as non-obese persons. The O.R for obesity was 3.017, which shows that an obese person has 3.017 times more chance of getting a significant GDM, as compared to non-obese person keeping all other factors constant. As the O.R for obesity was greater than 1 and the 95% confidence interval for obesity did not include 1, therefore, obesity has a positive association with GDM, and was statistically significant. The reference group for F.H was taken as absent of F.H persons. The O.R for F.H was 2.489, which means that a pregnant woman in Ebonyi State with positive F.H has 2.489 times more chance of getting a significant GDM, as compared to a pregnant woman in which F.H of GDM was absent. Therefore, F.H was significantly different from reference group, and was positively associated with GDM. The reference group for, exercise was sedentary life style. The O.R for exercise was 0.519, which is less than 1 because by general rule, if O.R is less than 1 and chi-square is significant, then there is a protection of exposure against outcome; also 95% confidence interval for exercise did not include 1, therefore, O.R for exercise was significantly different from reference group, and shows that the person who take light exercise have 0.481 probability of protection against GDM. In the light of the above analysis for the 1000 sampled pregnant women, since it turns out that 3 risk factors, obesity, F.H and exercise were significant, that means empirical findings confirm concept and theory of risk factors. So clinicians and public health personal should take appropriate measures to control these risk factors, and prevention programs should be started against GDM. In the remaining 5 risk factors; age, category of women, income, hypertension and D.H, empirical findings do not confirm the concept and theories of risk factors. The theme of every study started with past literature and studies done by experts. According to the literature, these five variables were also the risk factors of diabetes in different regions of the world.

Multivariate Version with Interaction Terms

All the interactions terms were calculated separately and tested for significance at 5% level of significance (Table 4).

 Table 4: Results of significant main effects and interaction terms of sample.

In the sample analysis, the main effect factors: category of women, age, obesity and F.H were significant risk factors. Besides the independent factors age was interacted with gender (P=0.005), exercise (P=0.000), and D.H (P=0.016) showed significant effect. Similarly, the factor obesity was interacted with INCM (P=0.008), and D.H (P=0.01) was the significant factor, while the factor “D.H” (P=0.01) was interacted with INCM, and had significant effect. The odd ratio for category of women 0.365 and odd ratio for age 0.286 indicated that those women less than 28 weeks of their gestational age and number of pregnant women less than 30 years of age were protected against this disease. Obese (O.R=6.582, P=0.000) and F.H of GDM (O.R=2.679, P=0.000) indicated that obese pregnant women have 6.582 times of chances of disease, as compared to non-obese pregnant women, while the pregnant women having GDM in their family have 2.679 times of developing disease, as compared to that pregnant women in which F.H of GDM was absent. Exercise was insignificant factor, but when it was interacted with age, it become significant (P=0.000). The interaction of age with category of women (P=0.005) and D.H, (P=0.016), separately were the significant factors. Obesity was also significant when it was interacted with INCM (P=0.008), and with D.H (P=0.012), since obesity has “O.R”=6.582, (P=0.000) in the main effects, but when it was interacted with D.H, the “O.R” decreases to 2.223, (P=0.01); that means by using balanced or proper diet, obesity can be reduced. Some of these interaction terms were very important, while the others were not statistically significant, or explaining no biological relationship for interpretation. For example: in the main effect model, age and category of women showed insignificant effect, but their interaction showed significant effect with odd ratio greater than 1. Similarly, INCM and obesity when interact with each other gave misleading interpretation with O.R=0.592.

Logit Model for Overall Sample with and without Interaction Terms

The model with out interaction terms:

The model with interaction terms for the sample is given below

Summary of Conclusions

We here summarize and conclude as follows:

1. In this hospital base study, ratio of GDM pregnant women is greater than the ratio of non-GDM pregnant women, and the pregnant women from 28 weeks of their gestational age are more liable to diabetes than those less than 28 weeks of their gestational age. The pregnant women entering the hospitals for GDM screening, greater than thirty years of age are three folds than the pregnant women of less than thirty years, concluded that GDM is more common in people above thirty years, and prevalence rate of GDM clearly increased with advancing age. Similarly, obese pregnant women are 1.4 folds than the non-obese pregnant women, and pregnant women with family history of GDM are approximately equal to with out having F.H of GDM in this sample. It is also concluded from the epidemiological study that educated pregnant women have awareness of GDM, and are more careful than the uneducated pregnant women.

2. In the sample analysis, the risk factors: obesity, F.H, were positively associated with GDM, and factor exercise was protection against this disease.

Exercise is protection against this disease, that means pregnant women who take exercise and led a simple life-style are at lesser risk of GDM and other diseases, as compared to those pregnant women who led sedentary lifestyle.

Recommendations

References