Robust Logistic and Probit Methods for Binary and Multinomial Regression

In this paper we introduce new robust estimators for the logistic and probit regressions for binary, multinomial, nominal and ordinal data and apply these models to estimate the parameters when outliers or influential observations are present. Maximum likelihood estimates don’t behave well when outliers or influential observations are present. One remedy is to remove influential observations from the data and then apply the maximum likelihood technique on the deleted data. Another approach is to employ a robust technique that can handle outliers and influential observations without removing any observations from the data sets. The robustness of the method is tested using real and simulated data sets. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
Binary and multinomial regressions are commonly used by medical scientists and researchers for analysis of binary or polytomous outcomes. These methods are routinely used as diagnostic tools in all areas of medicine including oncology and cardiology. Zhou et al. [1] used logistic regression to relate the gene expression with class labels. They also used logistic regression for their microarray-based analysis of cancer classification and prediction. Sator et al. [2] applied a logistic regression model to identify enriched biological groups in gene expression microarray studies. Majid et al. [3] performed logistic regression analysis to predict endoscopic lesions in iron deficiency anemia when there are no gastrointestinal symptoms.
Morris et al. [4] applied multinomial regression technique to analyze the sub-phenotypes by allowing for heterogeneity of genetic effects. Richman et al. [5] investigated the association between European ancestry and renal disease when compared with African Americans, East Asians, and Hispanics. They concluded that European ancestry is protective against the development of renal disease in systematic lupus erythematosus. Their data had some outliers but they were excluded in their final analysis. Timmerman et al. [6] used the logistic regression to distinguish between benign and malignant adnexal mass before surgery. Merritt et al. [7] used the binary and multinomial logistic regressions to investigate the role of dairy food intake and risk of ovarian cancer. The validity of estimation and testing procedures used in the analysis of binary data are heavily dependent on whether or not the model assumptions are satisfied. The maximum likelihood method of estimating binary regression parameters using logistic, probit and many other methods is extremely sensitive to outliers and influential observations.
There is a large literature on the robustness issue of the binary regression. Most of the existing methods attempt to achieve robustness by down weighting observations which are far from the majority of the data, that is, outliers. The reader is referred to papers published by Pregibon [8], Carroll and Pederson [9], and Bianco and Yohai [10]. Bianco and Martinez [11] modified the original score functions of the logistic regression to obtain bounded sensitivity, which is a concept introduced by Morgenthaler [12] using the L 1 -norm instead of the L 2 -norm in the likelihood, resulting in a weighted score function of the original score function. Cantoni and Ronchetti [13] focused on robustness of inference rather than the model. Pregibon [8] suggested resistant fitting methods which taper the standard likelihood to reduce the influence of extreme observations. Kordzakhia et al. [14] introduced a robust logistic regression by minimizing the mean-squared deviance for the worst case contamination. Bergesio and Yohai [15] introduced projection estimators for generalized linear model. These estimators have the same asymptotic normal distribution as the M-estimators. Hobza et al. [16] introduced a median estimator to estimate the parameters of the logistic regression.
Robust binary and multinomial regression estimators for analysis of biomedical data are proposed. This robust method has a bounded influence and high breakdown point and efficiency under normal distribution and is able to estimate the parameters of logistic and probit regression models. The proposed model is computationally simple and can easily be used by researchers.
There are various estimation methods for the estimation of the parameter vector β. The most commonly used method is the logistic regression which is used to analyze the effects of explanatory variables on the binary response y. In the logistic regression the link function π L (x i ;β) is assumed to have the following functional form The logistic transformation of π L (x i ;β) is called the logit function and is given by The probit function is a link function of the form Where Erf function is defined as The tabaistic model introduced by Tabatabai and Argyros [17] has the link function: The tabaistic transformation function is called the tabit function and is defined as  Figure 1 shows the graph of π L , π CLL , π P and π T where of π L , π CCL , π P and π T take values between zero and one. The solid curve is the graph of π L function, the dotted curve is the graph of π P function, the dot-dashed curve is the graph of π CLL function, and the dashed curve is the graph of π T function. Figure 2 shows the graph of logit (π L ). cllogit (π CLL ), probit(π P ), and tabit(π T ). The solid curve, the dotted curve, the dotdashed curve and the dashed curve are the graph of logit (π L ) function, the graph of probit(π P ) function, the graph of cllogit (π CLL ) function, and the graph of tabit(π T ) function, respectively. The principle of maximum likelihood is ordinarily used to estimate the model parameters by Although the maximum likelihood estimator is asymptotically efficient, it is not recommended as a method of choice when outliers are present. The alternative techniques are robust statistical methods.
Tabatabai et al. [18] defined the one parameter family of differentiable functions ρ ω (x) of the form ρ ω (x)=1−Sech(ωx), where the positive real number ω is called the tuning constant.
The bounded function ρ ω : R→ R is a differentiable function satisfying the following properties: Under the normality assumption for the error term ε i , the asymptotic efficiency (Aeff) is defined as Where ψ ω is the derivative of ρ ω and is equal to Where Sech and Tanh represent the hyperbolic secant and hyperbolic tangent, respectively.
The tuning constant ω can be calculated by solving the following equation (1) for ω. The numerical values for ω at the efficiency levels 0.80, 0.85, 0.90, and 0.95 are approximately 0.721, 0.628, 0.525 and 0.405, respectively. Although the choice for tuning constant ω is left for the investigator to decide, we do recommend an efficiency of approximately 90 percent which corresponds to ω=1/2. We now consider the hat matrix of the form where X is the design matrix defined as If the model has intercept, then the column vector X 0 has the form  Figure 2: Graphs of logit (π L ). cllogit (π CLL ), probit (π P ), and tabit (π T ) functions. and for i=1,2,…,n, define The estimator and h ii is the ith diagonal element in the hat matrix. For the logistic model, we have So that the above integral (2) can be written as Where H(k 1 ,k 2 ,k 3 ,t) is the Gauss hypergeometric function 2F1 with parameters k 1 ,k 2 and k 3 . If ω=1, then we have and if ω=1/2, then we have Define the Hessian matrix H b for binary data as Then an estimate of the variance-covariance matrix for vectorβ is with an estimated variance σ 2 given by The null distribution of the statistic 2 n W is asymptotically a chisquare distribution with q degrees of freedom.

Robust Multinomial Logistic Regression Model
In this section we generalize the robust binary method to multinomial regression where the response y includes k categories. When k=2, this model reduces to the binary regression. Now, consider the response matrix Y given by 11 1 This means that y ij =1whenever the ith response is in category j.
The multinomial likelihood function of parameter vector β is defined by For the generalized logit when the first category is the designated reference category and the intercepts are β 01 ,β 02 ,…,β 0k-1 , we get The principle of maximum likelihood can be used to estimate model parameters. The maximum likelihood estimate of the parameters vector β is Then, for the multinomial response, an estimate of the variance- with an estimated variance 2 σ given by For the cumulative logit model for ordinal k-category response, the cumulative probability for the ith response belongs to the response category less than or equal j is and for j=1,…,k−1 the ordinal logit O j (x j ;β) is the log-odds of falling into or below category j against falling above it and is given by . The ordered logit model is sometimes called the proportional odds model.
Then we have and for the ordinal probit we have The robust estimate of ordinal multinomial parameter vector is giveb Vasoconstriction and vasodilation are two important physiological mechanisms used to control the circulation of blood throughout the body. These mechanisms directly affect both the blood pressure and the distribution of the blood in the body. Vasodilation refers to the expansion of blood vessels through relaxation of smooth muscles in the vessel walls. This allows increased flow of blood through these vessels and also decreases the blood pressure. Contraction of the same muscles tightens the blood vessels, which decreases blood flow and increases pressure. Thus, vasodilation and vasoconstriction work in opposition to adjust both blood flow and blood pressure. The usual controls for vasoconstriction and vasodilation are done by smooth muscles and autonomic nervous system, triggered by the medulla. These responses can also be affected by drugs promoting either constriction or dilation. Furthermore, there is a means of control by circulating hormones in the bloodstream, as well as control by intrinsic mechanisms to vasculature, called the myogenic response. The antagonistic operation of vasoconstriction and vasodilation is used by the body for numerous purposes. Primary among these is regulation of the supply of oxygen and nutrients to the cells of the body, to meet their needs. Furthermore, this regulation of blood flow is also needed for thermoregulation within the body. At times of increased metabolic needs or needs for oxygen in certain organs or systems in the body, the blood flow to these regions will also be modulated. Finally, vasoconstriction is also important in restricting blood flow to regions of the body in cases of traumatic injury.
The data set was analyzed originally by Finney [19]. It consists of 39 observations where the binary response variable y=1 or y=0 represent the presence or absence of vasoconstriction of the skin respectively. This experimental data set considers the effect of inhalation of air in a single deep breath on the presence or absence of vasoconstriction in the digits. Presence or absence of vasoconstriction is considered as a categorical variable, and the study considers the effect of two variables, the volume of inhaled air and the rate of inhalation. (Data has hidden outliers so that the robust logistic regression will be useful in its analysis.) In the remainder of this work, we denote by ML and BY, the Maximum Likelihood and Bianco-Yohai methods, respectively. By examining Table 1 we conclude that the new robust estimator has produced the closest parameter estimates to the maximum likelihood estimates when the outliers were removed. In addition, the

Plasma example
The erythrocyte sedimentation rate (ESR) is very important. It is a common hematology test which is simple and inexpensive but can be used to detect infection or acute phase response, which can alert physicians to a wide variety of conditions. The test is very versatile and can assist physicians in detecting conditions from rheumatoid arthritis to systemic lupus erythematosus to multiple myeloma; however it is non-specific and is usually combined with other tests. In practice, ESR is used widely to test for a range of conditions, including inflammation, trauma, and malignant disease. Studies have also suggested the utility of the ESR among the elderly as a general indicator of level of sickness or disease. Recently ESR has also attracted attention for a potential role as a predictor for the development of cardio-vascular disease and heart failure.
The ESR simply measures the rate at which red blood cells precipitate during a period of one hour. Anticoagulated blood is placed in an upright tube, and the rate at which the erythrocytes settle is measured in mm per hour. Although the test is a direct measurement of rate of sedimentation, the balance between factors stimulating sedimentation and factors resisting sedimentation allows for a number of clinically relevant factors to influence this rate. Fibrinogen is the most important factor promoting sedimentation, and the high level of fibrinogen in the blood during the inflammatory process makes this test sensitive to inflammation. High levels of fibrinogen in the blood decrease the repulsive forces experienced between the negatively charged erythrocytes and favor the formation of rouleaux. These stacks of erythrocytes that stick together will settle faster and lead to an increased ESR. Other acute phase reactants, or other large molecules, especially when positively charged, can have a similar effect, although fibrinogen has been observed to have the largest effect.
A recent focus on the inflammatory nature of artherosclerosis has been accompanied by a recent study of increased levels of ESR and elevated risk of coronary heart disease. Erikssen et al. [20] observed that elevated ESR is a strong predictor of mortality from heart failure, suggesting it may serve as a marker for aggressive forms of coronary heart disease. Andresdottir et al. [21] observed an increased risk of coronary heart disease among the top quintile or ESR rates, with a hazard ratio of 1.57 for men and 1.9 for women. The 2005 paper of Inglesson et al. [22] also observes a significant association between elevated ESR and heart failure, suggesting both that inflammation is involved in the processes leading to heart failure and that the ESR may be used in evaluating this process. In addition to the well-established uses of ESR, Saadeh [23] mentions some potential new applications of this test such as bacterial otitis media, acute hematogenous osteomyelitis, AIDS, pelvic inflammatory disease, prostate cancer, and early prediction of stroke severity.
Although the ESR usually detects acute phase response from fibrinogen in blood in conditions such as those mentioned above, in certain cases there are factors which decrease the rate of sedimentation. One important factor that can slow the rate of sedimentation is irregularity in the erythrocytes, either in shape or unusually small size. As a consequence, ESR can detect certain blood diseases (including sickle cell anemia and spherocytosis) which lead to a lower than normal rate of sedimentation, as observed in Bridgen [24]. Other conditions that may also lower ESR include the extreme levels of white blood cells as observed in chronic lymphocytic leukemia. Furthermore the surplus of erythrocytes found in patients with polycythemia makes rouleau formation difficult and decreases the ESR.
In clinical applications the erythrocyte sedimentation rate may in many cases be treated as a categorical variable, with a normal ESR for values less than some given α and an elevated ESR for values greater than α. When representing such a set of data where ESR depends on one or more variables the logistic regression may be used. For instance in the data set from Collett [25], the ESR is considered as a function of two variables, the level of fibrinogen and the level of γ-globulin. The data for 32 individuals represents the levels of fibrinogen and γ-globulin in the blood and whether the ESR level is healthy (< 20 mm/hr) or unhealthy (≥ 20 mm/hr), and the logistic regression is used to describe how both fibrinogen and γ-globulin affect the ESR variable. Since this data set contains (hidden/influential) outliers, both the probit method of regression and the logit method do not give accurate results. However we observed that our new methods for robust logistic regression do represent the data accurately. The logit, when all 32 observations are included in the study, is given by Again, examining Table 2 reveals that the new robust estimator has produced the closest parameter estimates to the maximum likelihood estimates when the outliers were removed as well as the lowest value for the 2 arc χ .

Mental health example
The following example involves the ordinal multinomial regression. The data comes from a mental health study for a random sample of adult  residents of Alachua County, Florida. This data was appeared in Agresti [26]. The mental impairment is divided into four categories (well, mild symptom formation, moderate symptom formation, and impaired). The explanatory variables are life events index X 1 and socioeconomic status X 2 , where X 2 is binary and takes high and low levels. There is no outlier in this data. We just want to show how the method works even when outliers are not present. The logit is given by ( ) Table 3 gives the parameter values using maximum likelihood method as well as the new robust ordinal multinomial method. The R program for this example is provided in the Appendix.

Simulation
To evaluate the performance of the new robust method for logistic regression we conduct a Monte Carlo simulation. In the first round of our simulation, we use one explanatory variable and in the next round, we increase the number of explanatory variables to two. We first generate an independent random sample of size 100 from the standard normal distribution with mean 0 and standard deviation equal to 0.5. We call the variable x. Then we generate a sample of error terms ε i of size 100 from the logistic distribution with mean zero and standard deviation equal to 1. The dependent variable y is generated using the  formula The model parameters are 1 (intercept) and 3 (coefficient for x). We then select a random sample (5%) from the generated sample x and contaminate the selected sample by multiplying each x value by a factor of 10. Then we repeat the above procedures 1000 times. Finally, we estimate both the bias and mean squared errors using the following equations For the two explanatory variables, we generate two independent normal random samples of size 100 from a normal distribution with mean 0 and standard deviation 0.5 and call them x 1 and x 2 respectively. Then we select 5% of this random sample (3% from x 1 and 2% from x 2 ) and multiply the selected samples by 10. Then we generate a sample of error terms ε i of size 100 from the logistic distribution with mean zero and standard deviation equal to 1. The dependent variable y is generated using the formula We calculate the parameter estimates and continue the iteration 1000 times. In addition, we calculate the bias and mean squared errors. Tables 4 and 5 show the results of simulations using ML, BY and the new robust method with one and two explanatory variables, respectively. For binary logistic regression the simulation results indicate that our new robust method is as good as the BY method. The BY method only covers binary logistic regression whereas our method not only covers binary but also covers multinomial regression for both nominal and ordinal responses.

Discussion and Conclusions
In this work we have proposed a new robust method to analyze binary and multinomial regression models. We believe that these new robust methods for binary and multinomial regressions have potential to play a key role in modeling categorical data in medical, biological and engineering sciences. We have shown the lack of robustness of the maximum likelihood technique when outliers are present. In both real examples and simulated ones and when the outliers are present, the new   robust method performed well. In conclusion the motivation was to introduce a new robust loss function of residuals which can attain high breakdown value. The method has high efficiency and high breakdown points with bounded influence function.