Cubic Spline Regression of J-shaped Dose-Response Curves with Likelihood-based Assignments of Grouped Exposure Levels

In epidemiological studies that measure the risk at different levels of exposure, the data is often only available for the analyses that summarized the response data in grouped exposure intervals. In typical methods, the midpoints are used as the assigned exposure levels for each interval. Results of the analysis with grouped data may be sensitive to the assignment of the exposure levels. In this paper, we propose a procedure for assessing J-shaped associations based on the likelihood-based assignment of values to grouped intervals of exposure, and applying the cubic spline regression models. Numerical illustrations and comparisons based on simulations showed that the proposed procedure can yield better estimates for curves than those obtained using the typical assignment method based on the midpoints of each interval. J o ur na l o f B iometrics & Bistatis t i c s ISSN: 2155-6180 Journal of Biometrics & Biostatistics Citation: Takahashi K, Nakao H, Hattori S (2013) Cubic Spline Regression of J-shaped Dose-Response Curves with Likelihood-based Assignments of Grouped Exposure Levels. J Biomet Biostat 4: 181. doi:10.4172/2155-6180.1000181 J Biomet Biostat ISSN: 2155-6180 JBMBS, an open access journal Page 2 of 6 Volume 4 • Issue 5 • 1000181 exposure levels relative to that of 0, even if the level of reference category for reported RRs was not 0. We provide some numerical illustrations and comparisons based on simulations with typical assignments to determine the effects of exposure level assignments. Methods Likelihood based assignment of levels for grouped exposure intervals In some cohort studies with N individuals, the data on exposure xi of each individual i is summarized in a table of grouped exposure interval ( , ) ∈ = i j j j x I u v and the corresponding summarized response data with the frequency Nj, the number of cases aj, the log value of the relative risk yj and its standard error sj for each interval Ij, j=0,1,...,m (Table 3), where the lower endpoint u0 is known but the upper endpoint vm is unknown. Nj is given as the number of individuals, or the person-time, and yj is sometimes shown as the adjusted values of some covariates. sj can also be estimated from the confidence interval of the risks if the standard error of the logRR is not reported [14]. In case-control studies, however, the number of controls bj is also given and yj is the log value of the odds ratio (OR). In typical methods, the midpoints are used as the assigned exposure levels for each interval Ij, (j=1,2,...,m) (here after midpoints assignment). On the other hand, we assign the exposure level based on the summarized data according to the procedure proposed by Takahashi and Tango [6] as follows. We assume that the exposure levels of all individuals in the study are a set of random variables (x1, x2,..., xN) and that a power transformation of the exposure, λ i x , is obtained from a common normal distribution N(μ,σ2) with the mean μ and variance σ2, where 0 log = i i x x for λ=0. Given this assumption, the frequency Nj provides a log-likelihood based on a binomial distribution for the distribution of x=xi. The unknown parameters λ, μ, σ 2 and vm can be estimated by maximizing the log-likelihood. Based on the estimated distribution of x=xi with the probability density function f (x) such as xλ ~ N(μ,σ2) , the assigned exposure level dj for the jth interval Ij=(uj,vj) is calculated as the mean of its truncated distribution:


Introduction
In epidemiological studies, it is often necessary to determine the relationship between exposure levels and the risk of disease. However, the data on exposure levels are often available in intervals because they are generally not recorded for each individual subject. For example in studies on the association between alcohol consumption and the risk of disease, researchers often treat the exposure levels as intervals when they interview participants about their consumption levels, but they are unable to obtain the exact values as continuous variables. Also in traditional meta-analysis based on aggregated data, it is not possible to obtain the original data, and the published articles do not include enough data. In such situations, meta-analysis of observational studies often has to rely on the summarized data where the exposure levels are grouped into intervals available from research reports. Table 1 summarizes data from a study of the association between alcohol consumption and all-cause mortality, which was conducted by Lin et al. [1]. The alcohol intake of current drinkers was classified into five groups: non-drinkers, alcohol intake of 0.1-22.9 g/day, 23.0-45.9 g/day, 46.0-68.9 g/day, and ≥ 69.0 g/day. Table 2 summarizes the characteristics of two studies of coffee consumption and stroke, by Bidel et al. [2] and Grobbee et al. [3]. They were included into a metaanalysis of 11 studies by Larsson and Orsini [4], where the categories used for coffee consumption differed among the studies. Moreover the reference category was assigned to non-drinkers in Grobbee et al. [3], whereas 0-2 cups/day were used in Bidel et al. [2], and hence the meanings of the reported relative risks (RRs) were different and it was inappropriate to combine them directly.
Studies that measure the risk at different levels of exposure are usually analyzed based on a trend estimate by linear (or loglinear) regression analysis. When performing a regression analysis of summarized response data that are grouped into intervals, many researchers use the pre-assigned exposure levels from historical data or the midpoint values of each interval [5]. Results of regression analysis with grouped data may be sensitive to the assignment of the exposure levels. Recently, Takahashi and Tango [6] proposed a method for assigning values by applying the likelihood approach, and they showed the procedure can produce a more accurate linear regression coefficient than the typical procedure which uses the midpoint values.
On the other hand, some studies have reported that the risk of disease has a nonlinear relationship with the exposure level. For example, it is known that the association between alcohol and coronary heart disease [7] or total mortality [8] may be depicted as a J-shaped curve. Some evidence of a J-shaped association has also been reported recently between coffee consumption and the risk of stroke [4]. Curve-fitting methods based on a fractional polynomial model or a spline model has often been applied to these nonlinear dose-response associations for regression [9][10][11]. Di Catelnuovo et al. [8] applied fractional polynomials to fit the association between alcohol intake and the RR of total mortality in a meta-analysis of 34 prospective studies. On the other hand, for example, Larsson and Orsini [4] performed a dose-response meta-analysis and detected a potentially nonlinear association between coffee consumption and stroke using a cubic spline model, and cubic spline regression models may have many advantages over polynomials [12].
In this paper, we propose a procedure for assessing nonlinear associations between exposure levels and the risk of disease from a summarized grouped data, which is based on the assignment of levels to grouped exposure intervals by applying the likelihoodbased assignment procedure proposed in Takahashi and Tango [6]. In particular, we focus on the restricted cubic spline model that was described in Orsini et al. [13] and Larsson and Orsini [4] for J-shaped dose-response curves. We demonstrate how to estimate a J-shaped curve from the grouped summarized data using only four or five class intervals. Also this procedure can provide the log relative risk on each exposure levels relative to that of 0, even if the level of reference category for reported RRs was not 0. We provide some numerical illustrations and comparisons based on simulations with typical assignments to determine the effects of exposure level assignments.

Likelihood based assignment of levels for grouped exposure intervals
In some cohort studies with N individuals, the data on exposure x I u v and the corresponding summarized response data with the frequency N j , the number of cases a j , the log value of the relative risk y j and its standard error s j for each interval I j , j=0,1,…,m (Table  3), where the lower endpoint u 0 is known but the upper endpoint v m is unknown. N j is given as the number of individuals, or the person-time, and y j is sometimes shown as the adjusted values of some covariates. s j can also be estimated from the confidence interval of the risks if the standard error of the logRR is not reported [14]. In case-control studies, however, the number of controls b j is also given and y j is the log value of the odds ratio (OR).
In typical methods, the midpoints are used as the assigned exposure levels for each interval I j , (j=1,2,…,m) (here after midpoints assignment). On the other hand, we assign the exposure level based on the summarized data according to the procedure proposed by Takahashi and Tango [6] as follows. We assume that the exposure levels of all individuals in the study are a set of random variables (x 1 , x 2 ,…, x N ) and that a power transformation of the exposure, λ i x , is obtained from a common normal distribution N(μ,σ 2 ) with the mean μ and variance σ 2 , where 0 log = i i x x for λ=0. Given this assumption, the frequency N j provides a log-likelihood based on a binomial distribution for the distribution of x=x i . The unknown parameters λ, μ, σ 2 and v m can be estimated by maximizing the log-likelihood. Based on the estimated distribution of x=x i with the probability density function f (x) such as x λ ~ N(μ,σ 2 ) , the assigned exposure level d j for the jth interval I j =(u j ,v j ) is calculated as the mean of its truncated distribution: (j=0,1,…,m) (hereafter likelihood-based assignment).

Nonlinear association modeling using cubic splines
Splines are smooth functions that can assume virtually any shape, and the most useful type of spline is generally a cubic spline function, which is restricted to be smooth at the junction of each cubic polynomial [12]. In epidemiological studies, a restricted cubic spline model has been often applied to nonlinear dose-response data. As noted by Larsson and Orsini [4], a restricted cubic spline with three knots was recently applied to a potential nonlinear association that was depicted as a J-shaped curve. However, in some studies, such as that in Table 2, the exposure level of the reference group is not assigned on x=0. Thus, in this paper, we consider a cubic spline model for the log relative risk on the exposure x, logRR(x), that satisfies logRR(d 0 )=0 using assigned exposure levels d 0 for the reference interval I 0 , as follows: First, we construct an approximate covariance estimate for the adjusted log relative risks from a fitted table that conforms to the values proposed by Greenland and Longnecker [15], and we construct a variance-covariance matrix ∑. In this step, we assume the assigned exposure levels d j and standard errors s j for each interval I j to be fixed. The coefficients b of the restricted cubic spline model (1) are estimated using generalized least-squares regression with ∑, i.e., where A' and A -1 imply the transpose and inverse matrices of A, and respectively. Note that, if we need to estimate a spline model for the log relative risk on x relative to x=0, logRR 0 (x), when the reported y implies that relative to the reference x=d 0 , we can determine the spline model as   Table 3: Grouped exposure intervals and summarized response data in a cohort study.
using the same  b estimated from (1), because where is the probability of being a case with an exposure of = .
A crucial problem in spline regression is knot placement [10]. One simple approach is to have the observations x i determine the positions of the knots [16]. Some studies such as those by Larsson and Orsini [4], Harrell et al. [12] and Orsini et al. [13], placed knots at fixed percentiles in the data. Therefore, we examine a procedure here for selecting the positions of three knots among the assigned exposure levels d j (j=1,2,…,m) with the exception of d 0 of the reference group I 0 . Thus, we assign the knots k 1 =d 1 , k 2 =d 2 , k 3 =d 3 when m=3. If m>3, we require a procedure that is based on likelihood. Under the normal assumption of generalized linear regression, y~N m (Xb,Σ), we can derive the loglikelihood function of b and evaluate the likelihood of each resulting model for several candidate knot positions. Thus, the putatively better knots k 1 , k 2 , and k 3 are placed in a position that maximizes

Application
First for the data in Table 1 with person-years for N j , the likelihoodbased assignment procedure assigned the exposure levels as Next for the data in Table 2, the exposure levels were assigned by the likelihood-based assignment as d j =1.65, 3.56, 5.44, and 7.94 cups/ day with λ=2/3 and v m =14.4 in Bidel et al. [2] where the intervals were assumed as 0 ≤ x<2.5, 2.5 ≤ x<4.5, 4.5 ≤ x<6.5, and 6.5 ≤ x, respectively. k 1 =3.56, k 2 =5.44, and k 3 =7.94 were set as the knots, the coefficients were estimated as β 1 =-0.131 and β 1 =-0.024, respectively, and we can determine the estimated curve of logRR 0 from the model (2) ( Figure  2). On the other hand, the midpoints assignment with d 0 =0.0 estimated β 1 =-0.085 and β 1 =-0.023.

Simulation
In this section, we discuss the simulation studies conducted to assess our proposed procedure, wherein we used cubic spline regression for the nonlinear association with the likelihood-based assignments in grouped exposure intervals. Two cubic spline curves x x x p (4) with three knots (k 1 , k 2 , k 3 )=(I) (3, 5, 7) and (II) (2.3, 3.2, 10.5) were assumed to be the true model for the association between exposure x and the log relative risk y in a cohort study, respectively, where p x and p 0 are the probabilities of being a case with the exposure x and x=0, respectively ( Figure 3). We considered interval for grouping, 0 ≤ x<2, 2 ≤ x<4, 4 ≤ x<6, and x ≥ 6, such as that the number of knots in each interval I j was at most one for curve I, whereas the interval 2 ≤ x<4 had two knots, i.e., k 1 =2.3 and k 2 =3.2, for curve II.  Table 1 (black line: the likelihood-based assignment; gray line: the midpoints assignment). Dashed lines represent the 95% confidence intervals based on the asymptotic normal theory.   was close to true b. Overall the curve produced using this procedure had a good fit to the true curve.
Next, the proposed procedure (i), which used the likelihood-based assignment and the selected knots, also had a small bias, small MSE values for each x. In particular, when x ≥ 9 for true curve I, it was shown that the curve produced using this procedure had a better fit to the true curve than that produced using procedure (iv). When x<5 for true curve II, this procedure had a slightly higher bias than procedure (iv). However, the coverage probabilities remained higher than 90%. Procedure (ii), which used the midpoints assignment, was very similar to procedure (i) when x<5 for curves I and II. When x>5, the bias and MSE increased gradually, so this procedure could not deliver stable estimates with a large x.
Procedure (iii), which used the midpoints assignment and d 0 =0, estimated the value slightly higher with a low x compared with the other procedures. With a high x, it showed similar behavior to procedure (ii).

Discussion
In this paper, we proposed a procedure for assessing the nonlinear association between exposure levels and the risk of disease by assigning exposure levels to grouped exposure intervals. In particular, we focused on a restricted cubic spline model for J-shaped dose-response curves. The procedure can be applied to the log relative risks when they are given relative to the reference point x=0 and also to the interval We set the population size N=2,000 and the probability p 0 =0.05 for the reference x=0. We generated a set of exposures x={x 1 , x 2 ,…,x 2,000 } from a truncated normal distribution of N(3.5, 8.0) with the interval 0 ≤ x ≤ 12. By calculating i x p using (4), we generated a set of 1 or 0 Bernoulli random numbers w={w 1 , w 2 ,…,w 2,000 } using Pr{ = 1} = i i x W p for each x i , where the sample was counted as a case when w i =1, and we summarize the generated data{x,w} in Table 4. Note that each relative risk was calculated relative to the reference group 0 ≤ x<2.
We compared the following four procedures: Procedure (i) is our proposed method, and procedures (ii) and (iii) are typical methods that use midpoints, where the highest interval was assigned by assuming that the boundary had the same amplitude as the adjacent category [4]. In addition, we compared the proposed method with procedure (iv), which has true knots. Note that for true curve I, the positions of the three knots in curves according to procedures (ii) and (iii), (k 1 , k 2 , k 3 )=(3.0, 5.0, 7.0), were in the same positions as the true knots.
We generated B=1,000 sets of w 1 , w 2 ,…,w B for the fixed x. The curves were estimated using procedures (i-iv) for each set, and we estimated logRR 0 (x) as  ( ) , and the coverage probability CP(x) for a 95% confidence interval of B=1,000 sets for each x=1,2,…,12 based on the fitted curves produced by each procedure using the summarized data for true curves I and II, respectively. The means and standard deviations of estimated coefficients    Our simulation results showed that the estimated curve was sensitive to the assignment, and the likelihood-based assignment could estimate the nonlinear association accurately. It showed that the proposed procedure using the likelihood-based assignment had a lower bias when used for estimation compared with other procedures that used the midpoints assignment.
One of the applications to estimate a dose-response association from summarized data is that it should be applicable to a metaanalysis. In general, the exposure categories were different in the   Table 5: Comparisons of the estimated values for logRR 0 (x) on x=1,2,…,12, which were derived from the curves using procedure (i-iv) for Curve I, y=-0.2x-0.05x* with knots (3, 5, 7).   studies so they should not be combined. Furthermore, in some of the meta-analysis studies described in Table 2, the reference category was also different, so it was inappropriate to combine them directly. Some methods have been discussed for meta-analysis to obtain the pooled estimate without estimating the association within individual studies. For example, Greenland and Longnecker [15] described the pool-first method for meta-analysis of trend involving data pooling before trend analysis. On the other hand, Rota et al. [11] proposed a random-effects meta-regression model for nonlinear dose-response relationship fitting second-order fractional polynomial models, where the twostep procedures requires initially fitting second-order fractional polynomial models within each study, and then pooling the studyspecific two trend components. They tried to fit a pool-first method on the data of a small number of studies, and they obtained the identical results achieved by using their random-effects approach. It may also be possible to estimate pooled curves using our proposed procedure by a multivariate random effect meta-analysis [17], as discussed in Larsson and Orsini [4]. However, it has to be noted that both pool-first method and two-step procedure use pre-assigned exposure levels for grouped exposure intervals, and they assigned values by the midpoints assignment. Thus, the likelihood-based assignment could give different results in the pooled estimates. Moreover, in situations such as those for assessment of publication or other availability bias by the funnel plot, it is important to accurately estimate individually.
In the procedure reported herein, we fixed three knots in the J-shaped curve. The choice of the location of the knots is a crucial problem and the estimate of the curve was sensitive to their positions. Although we showed simulation results only for a situation of m=3 here, we also examined a procedure where the choice was based on likelihood for m>3, and the results of simulation studies could show that it produced well-fitted curves. In a similar manner to the procedure proposed herein, we can apply restricted cubic spline regression using other numbers of fixed knots to produce a more flexible curve shape. However, the choice of the number of knots is generally a crucial problem in spline regression. In this situation, the model with the best fit can be selected using a similar procedure by evaluating Akaike's Information Criterion (AIC), which is a penalized likelihood that takes into account the number of parameters estimated in the model based on likelihood (3). Also non-cubic spline models have been discussed in epidemiological studies. Further discussions of such models, including evaluations of different methods, are required in the future.
Our work, moreover, could be located in the errors-in-variable field, which aims to correct for bias that arises if measurement error in x is ignored. The statistical approaches developed in those fields might be applied in the situation discussed in this paper. We would also like to leave such a study including comparisons with the proposed procedure here, in our future work.