Department of Mathematics and Statistics, Memorial University, Canada
Received Date: July 29, 2013; Accepted Date: August 26, 2013; Published Date: August 30, 2013
Citation: Abarin T (2013) Gene-environment Interaction Studies with Measurement Error Application in the Complex Diseases in the Newfoundland Population: Environment and Genetics Study. J Biomet Biostat 4:173. doi:10.4172/2155-6180.1000173
Copyright: © 2013 Abarin T. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Newfoundland and Labrador NL has had the highest percentage of overweight/obese residents in Canada since 2007. This complex trait is determined by multiple genetic and environmental factors that interact with one another. The existing studies examine such factors under the assumption that they are measured accurately. However, error- prone environmental and genetic factors are unavoidable. The impact of ignoring these errors varies from bias to false results in detecting associations. Motivated by COD-ING study, we present methodologies to estimate model parameters, while accounting for measurement error and misclassification. We applied bias-corrected methods for three separate studies: candidate-gene association study and two gene-environment interaction models, where both environmental and genetic factors are subject to error. Our results based on simulation studies show that the proposed methodologies perform quite satisfactory.
Bias-corrected; Gene-environment interaction; Measurement error; Genotyping error; Misclassification
CODING: Complex Diseases in the New found land Population Environment and Genetics; BC: Bias-corrected; ME: Measurement Error; GEI: Gene-environment Interaction; PA: Physical Activity; TFP: Trunk Fat Percentage; FMO: Fat Mass and Obesity
Obesity is a major health issue in Canada. Newfoundland and Labrador (NL) has had the highest percentage of overweight/obese residents in Canada since 2007, and had risen by nearly 7% to 69.3% in 2011 (Statistics Canada). Obesity is determined by multiple genetic and environmental factors that interact with one another in complicated ways. The existing studies examine such factors under the assumption that they are measured accurately [1-4]. However, unobserved or errorprone environmental factors, and/or misclassification in genotyping are unavoidable. In reality, both genetics and environmental factors are likely measured with errors. It is now well-known that measurement, and/or classification errors can influence the results of a study. The impact of ignoring these errors varies from bias and large variability in estimators to low power or even false-negative results in detecting genetic associations [5-7]. In fact, in the presence of measurement error and misclassification, detecting the interaction terms is more challenging than either the genetic or the environmental factors . Motivated by an ongoing, large scale nutrigenomics (CODING) study of Newfoundland adults' population, we present methodologies to estimate model parameters, while accounting for measurement error and misclassification. We applied bias-corrected methods for three separate studies: candidate gene association study and two geneenvironment interaction models, where both environmental and genetic factors are subject to error. This paper is organized as follows. In Section 2, we introduce the three models, and present bias-corrected estimators. We investigate the finite sample performances of the proposed estimators in comparison with the naive estimators, using some simulation studies, in Section 3. The estimation approaches are also illustrated in this section, with the analysis of the CODING data.
Model I: Candidate gene association study
Motivated by CODING study of Newfoundland population, we present the methodologies to estimate the model parameters, for three separate studies: candidate gene association study, and two different GEI models. In all these three models, we assume that the response is measured accurately.
In this section, we consider a simple linear regression model for typical candidate-gene association studies. The model can be written as
Where is the response for the ith individual, β0 and are unknown parameters, and G is a binary variable, coded for a candidate gene with dominant effect. One can write model (1) in matrix format as
where is the vector of response, is the vector of model error terms with mean zero and variance 2 the vector of parameters, and is the n× p design matrix.
Moreover, binary variable G with probability of success π is not observable, and instead a binary variable g is observed with classification error. We denote sensitivity or probability of correctly classifying success in G, with Therefore, is the probability of false-negative. Similarly, specificity or probability of correctly classifying failure in G, is defined as Therefore, is the probability of false-positive. The probability of success for g, is determined as follows.
In order to obtain an unbiased estimate for π based on the observed variable, one must correct the bias in g. In fact, with a simple algebra an unbiased estimate for π based on g is
The naive Least Squared estimator of β in model (1) that ignores the misclassification in g, is , where X = [1, g].
Rewriting model (1) based on the observed variable g as follows; it is easy to see that is an unbiased estimator.
In the second equation, or actually g is assumed to be surrogate, which means that it does not provide any extra information about the distribution of Y given what is already provided by G.
From the above equations, it can be seen that naive is biased. It is known that this bias is attenuated with large sample size [5,6]. Furthermore, when π is not very small, the naïve estimator is sensitive to sensitivity, in the sense that the smaller θ11, the worse the naïve estimator .
Modifying the methodology suggested by Buonaccorsi  for the linear model with an intercept, the matrix of classification probabilities is defined as
Using the same notation as , we have the mean responses for both genotyping groups as µ1 = β0 + β1 and µ2 = β0
Bounacccorsi (9) proposed a bias-corrected estimator for as where and nw is defined as
With nw1 to be the number of successes in the sample and number of failures in the sample. Returning back the estimates based on β, we have
This method can be easily extended to any candidate gene with an additive effect. We should mention in here that genotyping error is usually estimated in two different ways. There are either two different methods of genotyping compared, or genotyping using one system is repeated more than once. The later is less expensive.
Model II: Gene-environment interaction I
Now, we consider the first GEI model as
In this model, an environmental factor is unobservable. Instead, one observes Z subject to certain measurement error. The measurement error (classic) model may be expressed as
Where U is an unobservable measurement error variable, independent from W, with mean zero and variance, say . We also observe g (instead of G) with error. In model (3), there is another environmental factor (A), which is assumed to be measured without error. The interaction term in the model, is between two errorprone variables. We are interested in estimating
Defining X to be the designed matrix based on the observed variables [1, g, Z, gZ, A], the naive estimator that ignores both ME and misclassification in the variables, can be expressed
where the sums in the matrices are over the number of observations.
The methodology suggested by Buonaccorsi  to correct the bias caused by misclassification, cannot be applied to this model. Since both sensitivity and specificity are large, the bias caused by this error is small . However, the bias caused by U cannot be ignored. In fact, the larger the variability of U, the worse the naive estimator.
in equation 5 need to be corrected for bias. However, bias-correcting these terms requires to be estimated. Generally, estimating requires extra information, such as internal or external validation data [5,6]. The BC estimator of ß, therefore, can be expressed as follows.
where Since g is binary,
Moreover, since , there is no need for correcting the other terms in the naive estimator
Model III: Gene-environment interaction II
Now, we consider the second GEI model as
In this model again, both W and G are unobservable. However, in here, the interaction is between the misclassified variable and the accurately measured environmental factor. We are interested in estimating
Defining X to be the designed matrix based on the observed variables [1, g, A, gA, Z], the naive estimator can be expressed as
In here, only ΣZ2 needs to be corrected for bias. The BC estimator of ß, therefore, can be expressed as follows.
Since the naive estimator does not consider Z or g as random variables, its covariance matrix can be easily written as
The covariance matrix of the bias corrected estimator, however, is conditional on both g and Z, as follows
To examine the finite-sample performance of the bias-corrected approaches for estimating the regression parameters, we carried out some simulation studies. For each model, we present the simulation set ups and the results, separately.
Model I: Candidate gene association study: For this model, we considered n=500 observations. The regression coefficients were , and the model error variance was set to be . The response was generated 1,000 times, by using model (1). Both sensitivity and specificity were 0.95. We compared three, namely True (based on G), Naive (based on g), and BC estimation approaches.
Figure 1 exhibits the magnitude of biases produced by all the three approaches. From the figure we can clearly see that among the three estimators, True and BC estimators are performing well. It is also noted in here that, since both sensitivity and specificity were relatively large, the impact of misclassification on the estimators, is relatively small.
Model II: Gene-environment interaction I: For this model, we considered n=500 observations. The regression coefficients were (2,0.1,0.5,0.3,0.2)', and the model error variance was set to be The response was generated 1,000 times, by using model (3). Both sensitivity and specificity were 0.95. Environmental factor W and A, for simplicity, were generated from a standard normal distribution. The error-prone variable Z was generated from model Z=W+U, where is independent of U and has normal distribution with mean zero and variance Here again, we compared the three approaches: True (based on G and W), Naive (based on g and Z), and BC estimation approach.
Figure 2 shows the magnitude of biases produced by the three approaches. From the figure we can see again that True and BC estimators are performing well. The naive use of W as Z causes remarkable biases in the estimators of β2 and the coefficient of the interaction term β3. It is also noted that, since the misclassification rates are low, the impact of misclassifications on the estimators are negligible. Moreover, since the naive estimate of β0 is unbiased, the box plot for this parameter is omitted.
Model III: Gene-environment interaction II: For this model, we again considered n=500 observations. The regression coefficients were β = (2,0.1,0.5,0.3,0.2)', and the model error variance was set to be The response was generated 1,000 times, by using model (7). Both sensitivity and specificity were 0.95. Environmental factor W and A were generated from a standard normal distribution. The error-prone variable Z was generated from model Z=W+U, where U is independent of W and has normal distribution with mean zero and variance Here again, we compared True, Naive and BC estimation approaches.
Figure 3 shows the magnitude of biases produced by the three approaches. From the figure we can see again that True and BC estimators are performing well. The naive use of W as Z causes remarkable bias in the estimator of β4, the coefficient of W. It is also noted that since the misclassification rates were low, the impact of misclassifications on the estimators were negligible. Moreover, since the naive estimate of β0 is unbiased, the box plot for this parameter is omitted.
Application: CODING study
Complex Diseases in the Newfoundland Population: Environment and Genetics (CODING) is an ongoing, large scale nutrigenomics study of Newfoundland population, in which 2256 individuals from the Newfoundland population were recruited. Variables considered were PTF measured by dual X-ray absoprtiometry as response, rs9939609 single-nucleotide polymorphisms of the FTO gene, genotyped using the high-throughput MassARRAY R platform (Sequenom Inc, San Diego, CA, USA), and PA measured by the Ability of the Atherosclerosis Risk in Communities (ARIC) Baecke et al. , questionaire as covariates. Subjects were stratified by gender for analysis. Gene-candidate association, gene-physical activity interaction, and gene-age interaction were studied. PTF was assumed to be measured with no error. Age was also assumed to be measured accurately. To avoid the colinearty between the variables, age was centred around its mean.
Combination of Sports and Leisure Time Index was selected for the analysis of PA, which was assumed to be measured with error. FTO were coded as G=1 and G=0, for "A" allele with dominant effect. Genotyping error was estimated to be 5%. The purpose of our study was to estimate the coefficients of the three models, accounting for measurement error and genotyping error.
Since there was no extra information available to estimate we performed a sensitivity analysis. It should be mentioned in here that for the ME model (4), is always larger than . In the CODING data, the sample variance for the observed PA was 1.3. Therefore, two arbitrary values of 0.1 and 0.5 were chosen as representatives for relatively and relatively large values for s 2 u . Evidently, the larger the value for , the worse the naive estimates of the parameters! Naive (based on observed genotyping and PA) and BC approach (biascorrected for errors) estimates were calculated for each model with their corresponding standard errors. was calculated using the naive least squared estimators of each model.
Tables 1 and 2 show the results for males and females, separately. As the tables show, when the impact of ME is very small Naive and BC approach estimates for the three models are very similar. However, it is not the case when the impact of ME is relatively large In Model II, Naive estimates of the coefficients for variables G, PA and are affected by the large ME error. Although there was no correction for misclassification of G in BC approach, there is a significant difference between the two estimators of β1. The reason is the interaction between G and the error-prone variable PA. In Model III, however, as it was expected, Naive estimate of the coefficient of the only variable that is highly affected by the ME, is PA. As it was stated in the introduction, the impact of ignoring the ME error, generally, varies from bias in the naive estimators, to false-positive (negative) results in detecting associations. As there is no estimate available for in this data, it is not possible to find out about the impact. However, some interpretations can be made. The large sample Wald test for all the parameters in Model II in Table 2 indicates that both Naive and BC approaches provide similar significant results, different signs of the estimates for β1 and β3 for large variability in ME, however, provides different interpretations of these values. Naive estimates of these parameters imply that for low risk genotype, every additional score in PA makes 4.7% reduction in PTF. For males of high risk genotype, the same amount of increase in PA, obtains only 2.8% reduction in PTF. BC approach, from another hand, starts with higher average PTF for males. It also implies that for males of low risk allele, every additional score in PA makes 6.2% improvement in PTF, when for high risk genotype this amount is 6.8%.
Model I: PTF = β0 +β1 G+∈,
Model II: PTF = β0 +β1 G +β2 PA+β3PA*G +β4Age+∈,
Model III: PTF = β0 +β1 G +β2 Age +β3 Age*G +β4PA+∈
Table 1: Estimates of model coefficients and the standard errors of naive and BC approach for CODING study–Males.
Model I: PTF = β0 +β1G+∈,
Model II: PTF = β0 +β1 G +β2PA+β3 PA*G +β4Age+∈,
Model III: PTF = β0 +β1 G +β2 Age +β3 Age*G +β4 PA+∈
Table 2: Estimates of model coefficients and the standard errors of naive and BC approach for CODING study–Females.
It is now well known that studies of gene-environment interactions can improve the accuracy and precision of the assessment of both genetic and environmental influences. The existing GEI studies on obesity related traits examine both genetics and environmental factors under the assumption that they are measured accurately. However, in reality, both genetics and environmental factors are likely measured with errors. The impact of ignoring errors in variables varies from bias and large variability in estimators to low power or even false negative (positive) results in detecting genetic associations. In order to obtain more accurate results, the bias caused by the errors needs to be corrected.
In this paper, we studied gene-environment interaction and candidate gene association models, where there are misclassification and measurement errors on covariates. In particular, we proposed biascorrected methods to account for these errors. The proposed methods are easy to apply, and unlike some other bias-corrected methodologies , do not require distributional assumptions on ME, and/or errorprone covariates. Our results based on simulation studies show that the proposed methodologies perform quite satisfactory. We also analyzed the CODING data showing that when ME is relatively large, the bias caused by it can dramatically affect the estimation in parameters, and therefore, interpretation of the corresponding values.
There are methodologies suggested by other authors to deal with ME in linear and nonlinear models. Some, studied regression calibration and simulation extrapolation [12-14]. These two methods are only "approximately" consistent, which means that even for large sample size, they still require small ME to perform well. Likelihoodbased methods have also been investigated (for example  and ). Generally, likelihood approaches suffer from restrictive distributional assumptions on ME, covariates with ME and the model error term. Since error-prone covariates and ME are unobservable, likelihoodbased approaches might not be realistic. The proposed approaches in this paper do not require parametric assumptions for the distributions of the unobserved covariates and of the measurement errors, which are difficult to check in practice. They also perform well, no matter how large the ME is. Moreover, the same methodologies may be applied to any interaction models between categorical and continuous variables. However, in those models, both sensitivity and specificity are required to be estimated.
ME models, in general, require extra information such as replicate data, internal or external validation data, or instrumental variables, in order to be identifiable. For example, Abarin and Wang  proposed a semi-parametric method for estimating parameters of generalized linear regression models with the classical ME model using instrumental variables. In the case that no extra information is available, sensitivity analysis is performed.
The methodology proposed in this paper can be generalized to longitudinal models. Fan et al.  proposed a bias-corrected quasilikelihood approach for longitudinal models, where continuous covariates are subject to error. Generalization of the methodology to longitudinal models, with both misclassified and ME, and the interaction between them, yet to be studied. More studies are also required on the proposed methodology in this paper, to the GEI models where there are more than two categories in the classified variable.
Overall, the results of this paper contribute to enhance the discovery of the genetics and environmental factors in GEI studies. We developed modern yet flexible measurement error techniques that will improve the identification of genetic variants, environmental factors, and their interactions associated with any complex trait.
Research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Research and Development Corporation Newfoundland and Labrador (RDC). The author is grateful to Dr. Guang Sun, Faculty of Medicine at Memorial University for providing the data for the analysis. The author is also grateful to the reviewers for their very helpful comments that improved the paper.