Received date: March 06, 2012; Accepted date: July 20, 2012; Published date: July 25, 2012
Citation: Zhao Y, Lee AH (2012) Revisiting the Concentration Curves and Indices as Useful Tools for Assessing Relative and Attributable Risks. J Biomet Biostat S7-019.doi: 10.4172/2155-6180.S7-019
Copyright: © 2012 Zhao Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Accurate assessment of the association between exposure and response is central to identifying causality in medical research. The concentration index has been commonly used to study income inequality and socioeconomic related health inequality. This study generalizes applications of the concentration index to measure the relative and attributable risks for describing exposure-response relationships in medical research. Based on cumulative distribution functions, a new measure of correlation is proposed to quantify the association between exposure and response. The
connection between the new and existing measures is discussed. The method enables the semi-parametric analysis of overall association and disparity by risk factors. Both grouped and continuous data situations are considered with two applications. The first example illustrates the relationships between the concentration index, relative and attributable risks. The second example demonstrates how the concentration index can assist in evaluating the association between the radiation dose and the incidence of leukaemia. Logistic regression based decomposition is compared with the new
approach. We found the concentration index analysis useful not only for examining socioeconomic determinants of health, but also for assessing quantitative relations between exposures to health risks and ill-health outcomes.
Dose-response relationship; Health status disparities; Odds ratio; Risk factors
Accurate quantitative assessment of the association between exposure and response is central to identifying causality in medical research [1,2]. Multiple rate ratios or odds ratios are commonly used for quantifying the exposure-response associations . However, dichotomizing continuous exposure may produce biased estimates and result in a loss of statistical efficiency, while multiple inferences can lead to false positive results [4-6]. The Lorenz curve and Gini index (GI) provide an alternative to assessing the overall relationship between continuous exposure and response [7-9]. The approach utilizes an integrated quantitative and graphical framework to make more efficient use of information. In addition to performing univariate assessment of inequality for highly skewed variables (such as individual income) [10,11], the GI has been applied to examine continuous exposure-response relations [7,12,13]. On the other hand, when bivariate relations between the exposure and response are of interest, a more general form of GI, the concentration index (CI), is available . Although CI is well suited for measuring socioeconomic inequality in health [15-17], there are still potentials for more general applications .
The present study generalizes the application of GI and CI in medical research and demonstrates their usefulness for summarizing rate ratios, odds ratios and attributable risks. A correlation measure is proposed to assess and summarize overall associations between risk factors and ill-health outcome. Two examples illustrate applications of the methodology in comparison with the regression based decomposition. Pros and cons of the approach are also considered in the medical research context. The variance of rate ratio and the derivation of continuous data are given in the appendices.
Gini and concentration indices
We first review the GI, Lorenz curve, CI, and their role in assessing exposure and response using grouped data. Consider a p-level exposure Xi and an ill-health response Yi, where Yi = di/ni denotes the response proportions sorted in ascending order (Y1 ≤ … ≤ Yi ≤ … ≤ Yp); di and ni represent respectively the number of ill-health events and population size for group i, with fi = ni/ N being the frequency proportion of the total population N. Let be the cumulative frequency proportion and be the mean. Under this setting, the Lorenz curve is defined by plotted on the ordinate against Fi along the abscissa [14,15]. It is the cumulative proportion of cases (Li) compared to the cumulative proportion of the at-risk population (Fi), ordered by the level of risk. If Li = Fi, the Lorenz curve coincides with the diagonal line, implying that Y is distributed in line with f so that the ill-health is evenly distributed. Otherwise, it lies beneath the diagonal line. The further the Lorenz curve deviates from the diagonal line, the greater is the degree of disparity. Let cov[Y,F] be the covariance between Y and F. The GI can be given by
which represents twice the area between the diagonal line and the Lorenz curve . If every group has exactly the same risk, GI = 0 representing perfect equality. If one group owns all the ill-health risks, GI = 1 representing perfect inequality. Typically, GI varies between 0 and 1, indicating the level of inequality in ill-health risks between groups. At the individual level, the total inequality is clearly , by ranking di from 0 (alive) to 1 (dead) for the Lorenz curve, where ni=1 and is the mortality rate. For grouped data, the GI actually reflects the degree of inequality under the current groupings.
Let Y(k) represents the Yi being reordered by exposure level X(k), where X(1) ≤ … ≤ X(k)≤ … ≤ X(p). We have f(k)=n(k)/N, where n(k) is the number of observations in group k. The concentration curve is defined by plotting
against . The CI is then given by
which is twice the area between the equalitarian line and the concentration curve . Here the groups are ranked by X instead of Y. Unlike GI, CI can be either positive or negative. If the exposure is harmful, 0 < CI ≤ GI. If it is protective, -GI ≤ CI < 0. The standard errors of GI and CI can also be estimated . The absolute value of the ratio
indicates the inequality explainable by the exposure . In this context, the concentration curve, CI and the ratio between CI and GI are often used to analyze socioeconomic inequality of health [15-17]. Assuming a regression model and the residual , where Z represents the predictor(s) and is a generalized linear link function, the GI and CI can be decomposed into a deterministic component and a residual component: 
where F* is either F or 1 for GI or CI decomposition respectively. In this paper, we examine the situation Z = X.
It is known that the above CI/GI ratio can overestimate the contribution of the exposure responsible for the health inequality . A new correlation measure is proposed for assessing the exposure-response relationship. The correlation between exposure and response can be examined by changes in the frequency proportions from being sorted by Y to being sorted by X. Let var[·] denote the variance of a quantity. We propose a correlation coefficient between F and F1,
as an overall measure of association between the exposure and the ill-health response. Note that ρ assesses the correlation between exposure distribution and response distribution based on cumulative functions, while F and F1 are rearranged using fi=f(k), and cov[F,F1 ]=cov[F1,F ]. If 0<ρ≤1, the proportional fractions ranked by Y and those ranked by X are positively correlated. Otherwise, -1≤ρ<0 implies a negative correlation. The coefficient of determination is then
which yields the proportion of disparity in Y explained by X. Let ε and ε1 represent the residual terms for F and F1 predicted by the ranked Y and those ranked by X. GI and CI are algebraically related to ρ. Similar to the GI and CI, ρ can be decomposed into a model component and a residual component:
Some basic properties of the CI measures are studied further in terms of risk assessment in this section. The relative risk (RR) may be considered as a ratio of the excess risk estimated by a rate ratio, or a density ratio of incremental change in ill-health in response to the change in exposure [20,21]. RR is thus the slope of the tangent line of the concentration curve, evaluated at a point Y(k), viz,
Using the concentration curve, it equals to the magnitude of the risk in comparison with the expectation (i.e., the average risk) . Clearly, the RR is slightly different from the usual case in epidemiology, based on the minimum level of exposure: . More generally, let . The variance of RR(k)m can be derived as
(see the derivation in the first section of Appendix). Note that RR is monotonically increasing for the Lorenz curve by definition, but this is not necessarily the case for the concentration curve. Application of RR is more meaningful in the context of concentration curve, because it involves both exposure and response, whereas the Lorenz curve involves only the response.
In medical research, the RR is often approximated by the odds ratio (OR). Denoting the total number of ill-health events by , then we have
As classically defined, attributable risk (AR) is the percentage of cumulative proportion of total population developing a disease over a specified interval, caused by an exposure . The AR also gives the proportion of ill-health events that can be avoided if the exposure is eliminated. When the exposure is continuous, the AR(k) can be viewed as the proportion of the incidence of ill-health that will be reduced if the exposure is reduced to X(k), rather than being totally eliminated [4,24]. In the case of grouped data, the AR(k) takes the form
which measures the health effect of a more relevant reduction in the risk rather than complete elimination of the risk. For example, a smoking cessation policy intends to reach a nominated target level X(k) rather than achieving an unrealistic zero smoking prevalence. The AR can take on negative values, if the AR is used for studying protective factors and the concentration curve lies above the diagonal line. As noted by Llorca and Delgado-Rodriguez , when Y = 0, AR =1 - RR. When the comparison standard is exchanged (the exposure of concern is changed from being harmful to being protective), AR = 1 - 1/RR [24,25]. The CI is in fact the weighted average of AR. The derivations of RR, AR and ρ for continuous data are given in the second section of the Appendix.
Colorectal polyps: To demonstrate the relationships between CI, RR and AR, let us consider the matched case-control study of the associations of vegetables, fruits, and grain intakes with colorectal polyps . The results of the analysis are summarized in the upper part of Table 1. The total individual level inequality for the matched case-control design is and the GI is 0.073, showing that the case-control grouping reflects 15% (0.073/0.5) of the total inequality. According to the new coefficient of determination ρ2, about 34% of the disparity in polyp incidence is explained by the mean servings of fruits and vegetables (X). This result appears more plausible than the |CI/GI| ratio (82%) and the logistic model GI decomposition assessment (49%). The fitted logistic model is
|Index||95% Confidence Interval||Logistic model decomposition (Contribution)|
0.065 to 0.082
-0.069 to -0.051
|N=976||=0.5|||CI/GI|=82%||ρ= -0.583||ρ2 =34%|
0.367 to 0.382
0.330 to 0.346
Table 1: Summary of the examples.
The CI or total AR is negative (-0.0603), indicating the concentration curve is above the diagonal line and the fruit and vegetable intake is a protective factor. An increase in the fruit and vegetable intakes to the average (5 to 6 servings) could potentially decrease the number of colon polyps by 6%. Decrease in the fruit and vegetable intake to zero can potentially increase colon polyps by 53% (AR(1)= -0.5294, see Table 2). In other words, the decreased levels of fruit and vegetable intake are associated with an increased risk of polyps among the matched pairs. As shown in Table 2, the RR decreases with an increased level of mean servings per day. Detailed AR and RR estimates are listed in Table 2, in comparison with the logistic model estimates.
|Cumulative proportions||Concentration index
Gini index GI=0.0734; Correlation coefficient ρ=-0.583; Determination coefficient = 34%.
Table 2: Concentration Index and Relative Risk for the Fruit and Vegetable Intake and Colon Polyps.
Radiation–induced leukaemia: The second example is taken from an investigation of leukaemia among patients treated with X-ray for ankylosing spondylitis at 81 British radiotherapy centres between 1935 and 1954 . The study aimed to determine the relationship between the doses of radiation given and the incidence of leukaemia. Details of radiation were recorded in the mean spinal-marrow dose (roentgens). The 38 leukaemia cases included definite, probable and presumptive diagnoses. The men-years at risk (61,902 in total) were used to estimate the incidence. We reanalyze the data using the proposed concentration curve approach. The GI and CI analyses are summarized in the lower part of Table 1.
The total individual level inequality for the study design is and the GI is 0.374, showing that the study grouping reflects 37% (0.374/0.999) of the total leukaemia incidence inequality (see Table 3). Re-ranking Y by X (radiation dose), CI has the value 0.338. In accord with the new coefficient of determination ρ2, the radiation dose accounts for 67% of the leukaemia inequality. This result seems more plausible than the CI/GI ratio (90%) and the logistic model GI decomposition (84%). The fitted logistic model is
|Radiation dose X (1)||Incidence /1000 Y (2)||i (3)||k (4)||f (5)||Fi(y)a (6)||Li(y)a (7)||F(k)(x) (8)||L(k)(x) (9)||GI (10)||CI (11)||RRm (12)||ORm (13)||AR (14)||(15)|
GI: Gini Index; CI: Concentration Index; RRm: Rate Ratio rebased on minimum level of exposure; ORm: Odds Ratio rebased on minimum level of exposure; AR: Attributable Risk. acumulative sum with respect to i.
Table 3: Concentration Curve and Index for the Radiation–induced Leukaemia Data, United Kingdom, 1935-1954.
A clear gradient is observed in the RR estimates (columns 12 and 13 of Table 3). Specifically, radiation dose over 2,750 roentgens could increase the leukaemia risk by about 27 times above that at the minimum radiation level. From the AR calculation (column 14 of Table 3), about 60% of leukaemia incidence in the spondylitic patients could be avoided, if the radiation exposure is reduced to the minimum level. If the radiation exposure level is reduced to the average level, 33.8% of the leukaemia incidence could be avoided. The logistic modelled RRs are listed in column 15 of Table 3.
Figure 1 shows that the concentration curve almost coincided with the Lorenz curve and the high radiation dose rankings explain the majority of leukaemia incidence disparity. The correlation coefficient ρ is 0.819, which means a high association between the leukaemia risk and radiation exposure; see Figure 2.
Note that when the exposure is used for the CI estimation and also used as the predictor for the logistic model based decomposition, the factor contributions for the exposure in the decomposed CI based on the logistic model are 98% and 112% respectively for the colorectal polyps and leukaemia example in Table 1. This indicates that the regression based CI decomposition can over-estimate the contribution of the exposure if the exposure variable is used as the underlying variable and explanatory variable simultaneously.
Applications of GI and CI for assessing risk factors can potentially provide more insightful information about the association between exposure and response in medical research [7,17]. The approach is appealing and straightforward, which summarizes RR, OR, AR, correlation coefficient and logistic regression model in a coherent manner. It can analyze three types of variables simultaneously: the exposure or underlying variable (X), the response or ill-health outcome (Y) and the predictors or determinants (Z). This feature is particularly attractive for analyzing the determinants of socioeconomic related health inequality . The method brings together the inequality and relative risk analysis in a unified framework and enables researchers to assess overall exposure-response association. Our study has demonstrated that the concentration curves and indices are closely linked with RR, AR and regression analysis. The instrumental method provides another approach to investigate the structure of exposure and response relationship. Different levels of exposure and response are modeled to allow a more detailed examination on the interplay between exposure and response in a graphical manner. The percentile based analysis is appropriate for skewed data and free of the underlying distribution. As demonstrated by the two examples, the method provides a powerful alternative for analyzing the cause - effect relationship for ordinal or continuous variables. It also allows further decomposition by multiple factors for identification of health determinants or adjustment of confounders using multivariate models [19,28]. A new variance estimate of rate ratio was derived using Taylor expansion without data transformation. The new correlation and determination coefficients are based on a semi-parametric approach to estimate factor contributions. Comparing the contribution estimates, this new method appears more plausible and robust than the |CI/GI| ratio and regression based decomposition. The logistic regression decomposition is a parametric model directly using the predictor information. If the model is chosen appropriately, the contribution estimates may be more accurate than the semi-parametric approach. There is empirical evidence that the exposure cannot be used simultaneously as the underlying variable for the CI and the predictor in the regression model , as previously recommended . This may overestimate the contribution of the exposure variable due to double-counting.
Several limitations of the method should be noted. The concentration curve analysis is semi-parametric. The measure is relative rather than absolute. It does not use the exposure levels directly but the rankings instead. The same applies to RR, OR and AR. Use of spline regression to define knots of exposure categories might be helpful to address this shortcoming . This limitation can also be addressed by jointly using logistic regression based GI decomposition as demonstrated in the examples. The GI and CI estimation by grouped data may underestimate the true association, because they overlook the within-group variation . The coefficient of determination ρ2 is a conservative measure of the exposure-response association based on the cumulative distributions. The CI is best used in a monotonic exposure-disease relation. With a non-monotonic situation (e.g. quadratic function), the positive and negative contributions may cancel out in the aggregate CI, although the concentration curve will reflect the detailed positive and negative areas, and OR and RR still remain valid.
In conclusion, the concentration curve approach provides a simple and useful alternative for risk factor analysis. It has desirable properties for assessing quantitative relationships between cause and effect. This approach is valuable for the overall assessment of exposure-response relationships, as the focus of health studies shifts from the proximal causes to the distal risk factors .
Variance of the relative risk
Consider two independent binomial random variables dl ~ Binomial(nl,pl), where l = (1, 2) and pl indicates the probability of ill health events in nl observations. For Yl = dl/nl within the domain (0, 1), r(Y) =Y1/Y2 . The first order Taylor expansion about Y is
Assuming , the second order approximation for is
The variance is
Because of the independence between Y1 and Y2, we therefore have
Derivation of measures for continuous data
To lighten notations, when there is no ambiguity, we adopt the symbols similar to those for the discrete grouped data. Let f(Y) be the probability density function (pdf) of the continuous and non-negative ill-health random variable Y. The Lorenz curve of Y is obtained by plotting
on the ordinate, against the cumulative distribution function (cdf) F(y) along the abscissa, where the expectation exists . By definition, and . The health inequality may be measured by
We further define X, a continuous and non-negative exposure or risk factor of Y with pdf f(X) = f(Y). The cdf for X is
Unlike F(y) in which f(Y) is integrated with respect to Y, F1(x) is given by integrating f(Y) with respect to X. The concentration curve of X can be obtained by plotting
against F1(x) . This definition means that RR may be measured by taking the first order differentiation of L(F1(x)) with respect to F1(x) as
which is the tangent of the concentration curve at a given value of x. In this case, OR is a function of Y,
The AR is
evaluated at F1(x) . Health inequality attributable to X can be measured by CI in the form of
In this paper, F and F1 are rearranged using f(Y) = f(X). As a measure of overall association between the risk factor and ill-health, the correlation coefficient ρ between F and F1 is defined by
Since , we have