Heritability Estimation using Regression Models for Correlation

Heritability estimates a polygenic effect on a trait for a population. Reliable interpretation of heritability is critical in planning further genetic studies to locate a gene responsible for the trait. This study accommodates both single and multiple trait cases by employing regression models for correlation parameter to infer the heritability. Sharing the properties of regression approach, the proposed methods are exible to incorporate non-genetic and/or non-additive genetic information in the analysis. The performances of the proposed model are compared with those using the likelihood approach through simulations and carotid Intima Media Thickness analysis from Northern Manhattan family Study. *Corresponding author: Hye-Seung Lee, Pediatrics Epidemiology Center, Department of Pediatrics, University of South Florida,Tampa, FL33612, USA, E-mail: Hye-Seung.Lee@epi.usf.edu Received July 18, 2011; Accepted October 31, 2011; Published November 15, 2011 Citation: Lee HS, Paik MC, Rundek T, Sacco RL, Dong C, et al. (2011) Heritability Estimation using Regression Models for Correlation. J Biomet Biostat 2:119. doi:10.4172/2155-6180.1000119 Copyright: © 2011 Lee HS, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. J o ur na l o f B iometrics & Bistatis t i c s ISSN: 2155-6180 Journal of Biometrics & Biostatistics Citation: Lee HS, Paik MC, Rundek T, Sacco RL, Dong C, et al. (2011) Heritability Estimation using Regression Models for Correlation. J Biomet Biostat 2:119. doi:10.4172/2155-6180.1000119 Volume 2 • Issue 4 • 1000119 J Biomet Biostat ISSN:2155-6180 JBMBS, an open access journal Page 2 of 6 proposed method estimates the heritability itself as a regression parameter, without estimating both variances. For single trait analysis, we implement the efficient three generalized estimating equations (GEEs) proposed by [8]; for multiple trait analysis, the method for multivariate familial correlation estimation proposed by [9]. While this new approach shares advantages using regression approaches under the scheme of GEE, it also allows us to handle various types of relative pairs under one model and all parameters at the same stage, without requiring the normality assumption. The performances of the proposed model are compared with those using variance component model through simulations and carotid Intima Media Thickness (IMT) analysis from Northern Manhattan Family Study (NOMAS). Materials and Methods Random effect model for a quantitative trait One’s level of a quantitative trait can be viewed as a combination of genetic and environmental effects. With the conventional assumption that genetic and environmental effects are independent, a model for the trait of interest Yij from i th family and jth member is written as follows: Yij = μij + gij + εij (1) where μij is the mean of the trait, while gij and εij are the random genetic and environmental effects, respectively. Considering the nature of the genetic effect gij, the model (1) becomes as follows: Yij = μij + Aij + dij + εij (2) where Aij is the additive genetic effect whose variance is 2 σ A , dij is the dominance effect whose variance is 2 σ d , and εij is the environmental effect whose variance is 2 σ e . Note that epistasis was not considered in this model. From the independence assumption, total phenotypic variance Var(Yij) = σ 2= 2 2 2 , σ σ σ + + A d e and for j ≠ k, 2 2 2 ( , ) ( , ) ( , ) ( , )


Introduction
Heritability is a fundamental concept to quantify the genetics on a quantitative trait. Through the random effect model for a quantitative trait, locus specific heritability, which is for linkage analysis, is estimated after adjusting for polygenic heritability. Heritability has been originally developed in animal studies, which have more control on environmental effects through the study design. Recent attention to quantitative trait analysis has been made in human genetic studies for common diseases. Since common diseases are mostly multifactorial, there are likely to be multiple genetic an environmenttal factors involved in the disease process, and each factor by itself will have limited influence on the disease. Thus, one effort to enhance power to detect a genetic influence on a common disease is to examine subcomponent traits for the disease when the subcomponent traits are considered to be closer to the action of a gene. In family studies, analyzing quantitatively measured subcomponent traits can be of benefit because they can provide genetic information for both affected and unaffected family members. However, existing analytic methods have been limited to control for non-genetic differences between two family members in the analysis, which is more critical in common disease analyses.
In quantitative genetics, the covariance of a trait is described as the sum of the covariance of a random genetic effect and the covariance of a random environmental effect [1,2]. The genetic variance is the variance of a random genetic effect given the correlation of random genetic effects between family members. The environmental variance is the variance of a random environmental effect given the correlation of random environmental effects between family members. While the correlation of random genetic effects is the probability of sharing genes, which can be determined based on their familial relationship, the correlation of random environmental effects is the probability of sharing environments between them. In non-human genetics, a proper study design can make feasible the estimation of the probability of sharing environments on a trait. But in human genetics, especially for large family studies including multi-generations, it is not trivial to make any reasonable estimation on the probability of sharing environments between two family members or the structure for each family.
There has been discussion on the impact of pedigree structure on heritability estimation. Mallinckrodt et al. [3] found that there was no difference in the precision and accuracy of heritability estimate from simulated data for random or fixed pedigree structures. However, [4] compared heritability estimates for pulse pressure from three different family studies and showed that multi-generation studies produced significantly lower heritability estimates than sib-pair studies. Recently, [5] revisited this issue using simulated data and concluded that there is hardly any impact of pedigree structure on heritability estimates. These contradictory findings can be due to the fact that simulated data can comply any necessary assumptions such as zero correlation of environmental effects between two family members for all families but not real data.
To reduce the bias due to non-genetic effects, one strategy would be to conduct analysis adjusted for environmental covariates. Issues in environmental covariate adjustments have been discussed in the setting of quantitative trait linkage analysis. As compared in [6], individual specific covariates are included in the analysis to reduce the residual error of the mean for the individual levels, while pair specific covariates are to allow or adjust for pair specific differences between pairs on the covariance, which is critical in quantitative genetics. Although both variance component model and regression approach can handle pair specific covariates, regression approach provides more exibility to incorporate pair specific covariates (e.g.,age difference for a pair) in the analysis. In addition, [7] did simulation study for variance component models incorporating covariates in the means, covariance and residuals, under the normality assumption, and concluded that potentially important environmental covariates should be adjusted in the absence of a correlation between the covariates and the locus.
In this paper, we propose a new analytic method to estimate heritability for both single and multiple trait cases. When phenotypic similarity is translated into the covariance, heritability is commonly estimated as the ratio between genetic variance estimate and total phenotypic variance estimate. Unlike existing approaches, the proposed method estimates the heritability itself as a regression parameter, without estimating both variances. For single trait analysis, we implement the efficient three generalized estimating equations (GEEs) proposed by [8]; for multiple trait analysis, the method for multivariate familial correlation estimation proposed by [9]. While this new approach shares advantages using regression approaches under the scheme of GEE, it also allows us to handle various types of relative pairs under one model and all parameters at the same stage, without requiring the normality assumption. The performances of the proposed model are compared with those using variance component model through simulations and carotid Intima Media Thickness (IMT) analysis from Northern Manhattan Family Study (NOMAS).

Random effect model for a quantitative trait
One's level of a quantitative trait can be viewed as a combination of genetic and environmental effects. With the conventional assumption that genetic and environmental effects are independent, a model for the trait of interest Y ij from i th family and jth member is written as follows: where µ ij is the mean of the trait, while gij and εij are the random genetic and environmental effects, respectively.
Considering the nature of the genetic effect gij, the model (1) becomes as follows: where A ij is the additive genetic effect whose variance is 2 σ A , d ij is the dominance effect whose variance is 2 σ d , and ε ij is the environmental effect whose variance is 2 σ e . Note that epistasis was not considered in this model. From the independence assumption, total phenotypic variance Var(Y ij ) = σ 2 = where 2θ jk = Corr(A ij , Aik), Δ jk = Corr(d ij , dik), and g ijk = Corr(ε ij , ε ik ). Note that the θ jk and Δ jk are determined as a probability of sharing additiveIBD allele and a probability of sharing dominance IBD allele, which can be expected from the familial relationship of j and k, respectively, irrespective of i. Two alleles (one each from two individuals) are said to be IBD (identity by descent) when they are in the same form and descended from the same ancestral allele. Table 1 shows the θ jk and Δ jk in each familial relationship.

Proposed methods
The correlation between two levels from a pair is expressed as where the To estimate the heritability, existing methods estimate the total variance σ 2 and 2 σ A , given the 2θ jk and Δ jk for a pair (j, k) in the i th family.
Likelihood function assuming multivariate normal distribution of the trait vector of interest from members of a family is commonly used in the analysis of extended families. The inference for the heritability estimator under variance component approach has been well established based on restricted maximum likelihood under the general framework by [10,11]. In addition, Ekstrfm [5] provided explicit expression of the asymptotic variance of each variance estimator.
We propose an alternative approach to interpret the heritability from the phenotypic correlation. Single trait analyses may employ Pearson's correlation, and multiple trait analyses may employ maximum canonical correlation. By interpreting the correlation for a pair (j, k) in the i th family as a regression model 2θ jk b 1 + Δ jk b 2 + b ijk b 3 , the inference on the regression parameter b 1 explains the heritability of the trait, To model Pearson's correlation, we employed the 3GEEs by [8] and a joint estimating equation by [9] for maximum canonical correlation.
Our proposed approach differs from the existing approach in three folds: first, we do not assume normality and use an estimating equation approach rather than likelihood approach, which makes flexible to incorporate non-genetic share between two individuals; second, we parameterize in such a way that the ratio of 2 σ A to σ 2 can be directly estimated, rather than estimating each variance component separate; lastly, multiple trait analysis can be utilized to examine a pleiotropy, which occurs when a single gene influences multiple traits.

Simulation study
Simulation studies compared the performance of the proposed approach with the standard analysis implementing restricted maximum likelihood estimation for single trait cases. Bias and efficiency were compared in two situations when the trait does not follow the normality and when non-zero correlation of environmental effects exists between two family members. In each situation, we generated 100 simulated data including 100 families with three sib-pairs. For siblings, the correlation of additive genetic effects, 2θ jk , is 0.5, while the correlation of dominant genetic effects, Δ jk , is 0.25. For the multivariate normal, t or gamma distribution of a trait, when true polygenic heritabilities were 0.4, 0.6 and 0.8, the bias and variances of each heritability estimate were compared by varying the correlation of environmental effects for each pair from 0, 0.3, 0.6 to 0.9. That is, the data generation with zero correlation of environmental effects complies the defnition of the conventional heritability. Table 2 summarizes total variances and correlations implemented in this simulation.
First, we examined the impact of a positive environmental correlation on current restricted maximum likelihood approach for the conventional heritability assuming g ijk = 0. Table 3 presents the simulation results. As expected, the likelihood approach performed the best when the true correlation of environmental effect was zero and the trait followed the normal distribution. The bias got greater as the true correlation of environmental effect was greater, regardless of the size of true heritability or distribution. Incorporating the correlation of environmental effects into the likelihood ap-proach and proposed approach, we compared their performances when the true non-zero correlation of environmental effects were known for families. Table 4 presents the results for the heritability estimates including non-zero correlation of environmental effects in the model. We found that the likelihood method produces better results with the estimates of the correlation of environmental effects than with the specification of the true correlation for the simulated data (results are not shown). Thus, we present the simulation results for the likelihood approach when the correlation of environmental effects was estimated from each simulated data with the correct specification of the structure for each family, while the proposed method implemented the correct specification of the correlation of environmental effects. Including non-zero correlation parameter of environmental effect in the model reduced the bias in both approaches, especially when the true correlation of environmental effects was high, as expected (Table 3 vs. 4). The likelihood approach performed better when the data were generated from the normal distribution, but the proposed method showed greater improvement as the data deviated more from the normality. However, the likelihood approach requires the correlation of environmental effects to be estimable or the number of family members to be the same. The variance estimates of the proposed model estimates were more efficient in terms of being close to the simulation variances. The performance of the proposed method was stable in most of the cases considered for simulations.

Data analysis
Sacco et al. [12] presented heritability and linkage analysis for eight carotid IMT measures from 1390 subjects and 100 Caribbean Hispanic families. The families were selected when probands from the NOMAS were high-risk Caribbean Hispanic members de_ned by the following criteria: (1) reporting a sibling with a history of myocardial infarction or stroke; or (2) having 2 of 3 quantitative risk phenotypes (maximal carotid plaque thickness, left ventricular mass, or homocysteine level above the 75th percentile in the NOMAS cohort). Families were enrolled if the proband was able to provide a family history, obtain the family members permission for the research staff to contact them, and had at least 3 first-degree relatives able to participate. Although the proband was identified in Northern Manhattan, they enrolled family members in New York at Columbia University and in the Dominican Republic at the Clinicas Corazones Unidos in Santo Domingo. All subjects provided informed consent, and the study was approved by the Institutional Review Boards of Columbia University, University of Miami, the National Bioethics Committee, and the Independent Ethics Committee of Instituto Oncologico Regional del Cibao in the Dominican Republic. See [13] for the details of study design.
Carotid IMT were measured by high-resolution B-mode ultrasound and expressed as the mean and mean of the maximum. Ultrasound measures of carotid IMT have been demonstrated to be valid measures of pathologically defined atherosclerosis, highly reproducible, associated with vascular risk factors, and predictive of stroke and myocardial infarction. The carotid IMT protocols yield measurements of the distance between lumen-intima and media-adventitia ultrasound echoes, from which the IMT and arterial diameter are derived for the 3 carotid segments. The carotid segments were defined as follows: (1) near and far wall of the segment extending from 10 to 20 mm proximal to the tip of the flow divider into the common carotid artery (CCA); (2) near and far wall of the carotid bifurcation beginning at the tip of the flow divider and extending 10 mm proximal to the ow divider tip (BIF); and (3) near and far wall of the proximal 10 mm of the internal carotid artery (ICA). Total IMT was calculated as a mean composite measure of the means of the near and the far wall IMT of all carotid sites (IMTx), and the maximum of the near and the far wall IMT of all carotid sites (IMTm). Carotid segment-specific IMT phenotypes were also examined (BIFx, BIFm, CCAx, CCAm, ICAx, ICAm).
We analyzed the data used for [12] to examine the performance of our proposed method. We first estimated the heritability assuming zero environmental correlations in the proposed model to compare the results from the SOLAR, which implements the restricted maximum likelihood approach. Our method is to interpret the relationship between phenotypic correlation and sharing genes as the heritability. Assuming g ijk = 0, the which is the heritability, would be roughly the sum of phenotypic correlation estimates for each degree of relatives over the sum of 2θ jk for each degree of relatives. For example, as for the trait IMTx, the (2θ jk , Correlation of the residual IMTx from the mean model) for each degree of relative pairs would be (0.5,0.406) for Parent-Offspring pairs, (0.5,0.379) for sib-pairs, (0.25, 0.294) for second degree, (0.125, 0.121) for third degree, (0.0625, 0.105) for fourth degree. Thus, the estimate of 2 2 σ σ A is 0.908 as the sum of phenotypic correlations is 1.305 and the sum of 2 θ jk is 1.4375. As shown in Table 5, the heritability from the proposed model varied from 0.28 to 0.91, while the heritability reported in [12] was from 0.41 to 0.65, after adjusting for the same confounders in the mean model for each carotid IMT. Although two estimates were quite different, the trend (small to large) was consistent, as well as the statistical significance.
In this study, the mean family size was 14 (standard deviation (SD)=8); for family member pairs, age difference varied from 0 to 82 years (mean=20, SD=14, first quartile(Q1)=7, median(Q2)=18 and third quartile(Q3)=29), and BMI difference was from 0 to 32.7 (mean=6.0, SD=5.0, Q1=2.2, Q2=4.7 and Q3=8.4). A common practice is to analyze the residuals after adjusting for potential environmental  confounders in the mean model. However, it is still not clear if the mean model only adjustment is sufficient to correct for the bias when the covariance is modeled. Table 6 shows the raw correlation and the residual correlation from the mean model for CCAx, BIFx and ICAx stratified by the quartiles of age or BMI difference for pairs. The raw correlations became smaller when age difference or BMI difference was greater, regardless of sharing genes. Residual correlations stratified by the quartiles of age difference does not show the trend anymore, but those stratified by BMI difference seem to retain the trend.
We applied our proposed methods including the intercept in the model. For three different mean measures of carotid segments, CCAx, BIFx and ICAx, we studied the heritability of each trait and the heritability of three measures together.   estimating equations without requiring normality assumption. In conjunction with linkage analysis, there have been a number of new quantitative trait analysis methods developed in extended family design to employ the GEE, regression based statistics or score statistics to be robust to the normality assumption [14][15][16][17]. However, to our knowledge, none of the methods estimate the heritability itself. By interpreting the heritability as a regression parameter under the GEE framework, the proposed approach has the following advantages: 1) exibility to model the genetic and non-genetic differences between two family members; 2) not requiring the normality assumption on the trait of interest; 3) simplicity to express the heritability adjusted for pair specific non-genetic factors; 4) robust estimation of the variance of the heritability estimate; 5) simplified inference as discussed in [18].We also extended our approach to accommodate multiple trait heritability analysis. Although [19] discussed multiple trait analysis using variance component models, there is no analytic method available for multiple trait analysis using correlations from multi-generation families.
Simulation results showed that there was improvement using the proposed method when true heritability was high on the trait of interest or when the trait does not follow the normality. The likelihood approach increases the bias and greatly underestimate the actual variance when the model is misspecified. However, when the trait follows the normality, the likelihood approach performed better than the proposed method. The model (2) has been extended for multipoint quantitative trait linkage analysis [20]. In that context, zero environmental correlation is often assumed. For the heritability estimation assuming zero environmental correlation, the biases from both approaches were not negligible when the environmental correlation was high as expected, which made both approaches inaccurate (Results were presented for the likelihood approach only). Since zero environmental correlation is very unlikely true for all relative pairs, it is desirable to interpret the heritability adjusted for non-zero correlation of environmental effects in the model to measure the genetics on a quantitative trait. The proposed methods provide a direct interpretation for the adjusted heritability from the model including pair specific genetic and non-genetic effects in the analysis. Data analysis implemented the heritability adjusted for a common non-genetic correlation for all pairs and pair specific differences in the model. Non-genetic correlations, which is an intercept in the correlation regression model, were weak for the three carotid IMT measures (CCAx, BIFx and ICAx) in single trait analysis. In single trait analysis, those differences in age or BMI were not significant, but age difference was significantly different from zero in multiple trait analysis. Regardless, both analyses showed strong genetic components in them. REML    the multiple trait analysis for CCAx, BIFx and ICAx, simultaneously. We estimated the three trait heritability for CCAx, BIFx and ICAx to examine if there could be a common gene effect on those three measures. The multiple trait heritability from Model 1 was 0.738 and 0.784 from Model 2. Unlike the single trait analysis, the size became greater in Model 2 as the age difference was significantly different from zero. In conclusion, all of those three traits showed a strong genetic effect on them.

Discussion
In this paper, we proposed new heritability estimation methods implementing regression models for correlation. Unlike existing methods that estimate each variance component, the proposed approach estimates the heritability as a regression parameter through Current paper is focused on polygenic heritability, but locus specific heritability, which uses genotypes, can be a direct extension from this model. This new regression approach may reduce the bias resulting from not accounting for environmental effects in quantitative genetics and may increase the power to locate genes on quantitative traits. It is also well known that the marginal approach can produce negative variance component estimates. In likelihood approach, a hierarchical model can accommodate to have a positive variance estimate. Our approach can also produce a negative heritability estimate, but a transformation for the regression parameter, such as logit, can range the heritability estimate from 0 to 1.