Received date: December 21, 2011; Accepted date: February 15, 2012; Published date: February 21, 2012
Citation: Jamsen KM, Zaloumis SG, Scurrah KJ, Gurrin LC (2012) Specification of Generalized Linear Mixed Models for Family Data using Markov Chain Monte Carlo Methods. J Biomet Biostat S1:003. doi: 10.4172/2155-6180.S1-003
Copyright: © 2012 Jamsen KM, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Statistical models imposed on family data can be used to partition phenotypic variation into components due to sharing of both genetic and environmental risk factors for disease. Generalized linear mixed models (GLMMs) are useful tools for the analysis of family data, but it is not always clear how to specify individual-level regression equations so that the resulting within-family variance-covariance matrix of the phenotype reflects the correlation implied by the relatedness of individuals within families. This is particularly challenging when families are of varying sizes and compositions. In this paper we propose a general approach to specifying GLMMs for family data that uses a decomposition of the within-family variance-covariance matrix of the phenotype to set up a series of regression equations with fixed and random effects that corresponds to an appropriate genetic model. This “mechanistic” specification is particularly suited to estimation and evaluation of models within a Markov chain Monte Carlo (MCMC) framework. The proposed approach was assessed with simulated data to demonstrate the accuracy of estimation of the variance components. We analyzed data from the Victorian Family Heart Study (families with two generations over-sampled for those with monozygotic and dizygotic twins) and for a binary phenotype (hypertension) that resulted in substantially reduced computation time in the MCMC framework (via WinBUGS) compared with a maximum likelihood approach (via Stata). The proposed approach to model specification in this paper is easily implemented using standard software and can accommodate prior information on the magnitude of fixed or random effects.
Family data; Generalized linear mixed models; Model specification; Markov chain Monte Carlo
High-throughput genotyping technology now provides epidemiologists with an unprecedented opportunity to explore the association between measured genetic variants and the risk of disease. For most common complex diseases of adult life, however, putative genetic risk factors explain only a small proportion of phenotypic variation. Statistical analysis of family data therefore retains an important role in the search for genetic determinants of disease since, with suitable assumptions, phenotypic variation can be partitioned into shared genetic and common environmental components. Efforts to elucidate the role of genetic variants through their functional effects can then be concentrated on phenotypes where there is strong evidence that observed variation in the phenotype is consistent with the influence of genetic factors.
Generalized Linear Mixed Models (GLMMs) offer a convenient vehicle to perform the variance components analysis described above. Such an analysis relies on a correct specification of the within-family variance-covariance matrix of the phenotype, which is implied by the correlation structure of shared random effects and individual or residual error terms that appear in the linear predictor. It is not always obvious how to generate a series of individual-specific regression equations to achieve this aim, especially when families are of varying sizes and compositions.
Current approaches to specifying GLMMs for family data are tied to particular family compositions and/or phenotypic outcomes. Burton et al.  and Scurrah et al.  proposed methods for binary phenotypes and censored survival data (respectively), but their specifications were derived in an ad-hoc fashion that was dependent on the family structures in the data. Rabe-Hesketh et al.  proposed some model specifications to suit continuously-valued and categorical phenotypes, however they require specific family compositions (e.g. monozygotic (MZ) and dizygotic (DZ) twins grouped together, nuclear families with no MZ twins, etc.). Lange et al. developed FISHER [4,5], which inputs the within-family variance-covariance matrix (calculated from a pedigree file indicating relationships between individuals within families) directly into a multivariate normal likelihood function, thus only continuous phenotypes can be analyzed. Atkinson and Therneau  developed an R package to analyze family data that makes use of the generalized Cholesky decompostion of the matrix representing the relatedness between individuals within a family, but again only for continuously-valued phenotypes. SOLAR  uses a liability threshold model to analyze family data with discrete phenotypes, but these discrete phenotypes can only consist of two categories (binary phenotypes only).
In this paper we propose a general specification for GLMMs to analyze family data where families are of varying sizes and compositions. The specification utilises a decomposition of the withinfamily variance-covariance matrices as the basis for generating a system of regression equations that imply the desired correlation structure between phenotypes. It can be easily implemented in standard statistical packages that support mixed effects models, such as Stata , but is particularly suited to scenarios where the model is expressed in a “mechanistic” way via an explicit statement of a regression equation with fixed and random effects for each datapoint, such as those encapsulated in the WinBUGS software and programming language .
Our models for data have as their basis Fisher’s polygenic model , where the genetic contribution to the phenotype is the combined effect of possibly a large number of separate locations or loci on the human genome. At each of these loci an individual inherits one allele (of possibly many) from each parent. The effect of these alleles is assumed to be additive - the presence of each additional allele of the same type changes the phenotype by the same amount (so the effect of two identical alleles is twice the effect of one allele acting alone), and the effect of two different alleles is the sum of the effect of each allele separately. The result of this assumption is a phenotypic model where covariation between individuals depends only on their level of relatedness. If the underlying additive model holds at a sufficient proportion of the active loci, then the polygenic model will be a reasonable approximation to the true but unknown effect of the genetic factors on the phenotype. Common environmental factors might also contribute to similarity of outcomes within families, with the additional assumption that the correlation between phenotypes does not depend of the degree of relatedness of the individuals.
We can formalise the above by specifying a model for individual participant data:
Yij= μij + aij+ cij + εij, (1)
where Yij is the observed phenotype (continuously-valued) for the jth member of the ith family, μij is the fixed-effect mean that can be expressed in a general linear predictor to capture the effect of measured environmental and/or genetic factors, including possible interactions between the two, is the additive genetic random effect, is a common environment random effect (typically we take cij = ci for all i) and is an individual-specific residual term. We assume that these three sources of variation or random effects are independent of each other, so that the total variation is the sum of the variance components, i.e. . This preliminary model specifies the marginal, univariate distribution of the random effects a,c and ε, and hence the distribution of the phenotype Y, but we need further structure to express the assumed correlation between phenotypes within a family and thus specify the joint distribution of phenotypes.This can be achieved through within-family sharing of the random effects aij and cij. For a family i with j =1,2,…,ni individuals, we re-write equation 1 as a multivariate model:
where Yi is an ni × 1 vector of observed phenotypes, μi is an ni × 1 vector of means, ai, ci, and εi are ni × 1 vectors of additive genetic, common environment and individual-specific random effects with ni × ni design matrices , and respectively.
Setting aside the specification of for the moment, let , the ni × 1 vector of all ones and cij = ci for all i so that individuals within families share a single, common random effect representing the shared family environment. More generally,will be a matrix indicating which related individuals share a common environment. Also, let be the ni × ni identity matrix, . Our revised specification with these assumptions is therefore
where is the ni × ni matrix of all ones. The entries kjj' , of the matrix correspond to the kinship coefficients  that represent the relatedness of individuals j and j’ within the same family. We can then re-state the problem of model specification as the requirement that the additive genetic design matrix satisfies the condition for known Ki.
Since Ki can be interpreted as a correlation matrix, it is symmetric positive-definite and can therefore be decomposed uniquely into the product of a lower triangular matrix (the Cholesky triangle) and its transpose. The product is then the Cholesky decomposition of Ki. We show in the next section that estimates of the linear predictor are invariant to the choice of matrix decomposition.
To illustrate the Cholesky decomposition of the kinship matrix, we now consider two example models, each consisting of a single nuclear family. First, let Yi = (Yi1,Yi2) be a vector of continuously-valued phenotype data from a pair of full siblings. From equation 3 we have
so that the phenotypic correlation between full siblings due to shared genetic factors is 1/2 since they will, on average, share half of their genetic material. The Cholesky triangle of Ksib is then
so that the system of equations in 2 reduces to
Yi1 = μi1 + ai1 + ci + εi1
For a nuclear family of biologically unrelated parents and two children (arbitrarily ordered as father, mother, child 1 and child 2), the kinship matrix Knuc for the within-family correlation structure implied by shared genetic factors is
which implies that the corresponding random effects design matrix Znuc is
so that the system of equations in 2 becomes
Yi1 = μi1 + ai1 + ci + εi1
Yi2 = μi2 + ai2 + ci + εi2
which produces the correct within-family correlation structure.
For a non-continuously-valued phenotype, we specify a generalized linear model for Y using a link function g () to relate the expected value of Y to the linear predictor of fixed and random effects:
McCulloch and Searle  show that a Taylor expansion of the link function around E (Yi) approximately follows a linear mixed model.
Choice of matrix decomposition
Before illustrating the use of the matrix specification above to analyze data from multiple families, we show that predictions from the fitted model are invariant to the choice of the matrix decomposition. Consider two alternative specifications of the model in equation 2
Note that a1i and a2i are alternative specifications of the random effect vector ai. The equations above imply
where for k = 1,2, which implies that if . Consider now the fitted value at the family level taking expectations at the individual level, namely
and using results stated in  we have
which implies that
Thus does not depend on the choice of decomposition. Again using results stated in  we also have
Therefore does not depend on the choice of matrix decomposition for the design matrix of the genetic random effects contribution to the linear predictor.
The model specification described above can be implemented using, for example, R and WinBUGS by the following procedure:
1. Create the kinship matrix for all families, which results in a N × N block diagonal sparse matrix, with N being the total number of individuals in the data. Here we assume that the data can be partitioned into families (possibly extended families or “pedigrees”) and that members in the same family appear as consecutive records in the dataset. The diagonal blocks are the within-family kinship matrices and the off-diagonal blocks are zero (implying outcomes for individuals in different families are uncorrelated). This computation can be done in R using the makekinship command from the kinship package . Note that makekinship does not distinguish between monozygous (MZ) and dizygous (DZ) twins, so the resulting kinship matrix needs to be amended to reflect this distinction.
2. Obtain Z by computing the transpose of the Cholesky decomposition of the kinship matrix, which can also be done in R. This results in a “triangular diagonal” matrix, that is, a block diagonal sparse matrix where the blocks are lower triangular matrices.
3. Set up a three level hierarchical model in WinBUGS. For a continuously-valued phenotype, level one (the individual level, i.e. the jth member of family i ) is specified as
μij = xijβ + zaija + cij (5)
Thus the mean μ for the jth person of family i depends on (i) a vector of fixed-effects regression coefficients β and the corresponding fixed- effects design matrix X (has columns xij); (ii) the N × 1 vector of genetic random effects a (with ) and the corresponding N × N block diagonal sparse matrix of family decompositions Za (which has columns zaij) and (iii) the common environment random effect cij, where cij = ci for all i and . For a general categorical phenotype, the model is specified as the linear model in equation 5 for g(E(Yij)), where g is the link function. In a full probability (i.e. Bayesian) framework, prior distributions would be specified for β, , and .
It is straightforward to implement this process in standard statistical software, for example, by creating an R source file that prepares the data (steps 1 & 2), performs the analysis in WinBUGS via R2WinBUGS  (step 3) and outputs the results as dataframes and tables.
Analysis of simulated data
To assess empirically the validity of the proposed model specification, we conducted a small simulation study. Data were simulated (in R) for 790 independent nuclear families (two parents and one or more children, possibly including MZ or DZ twins), where the family sizes and compositions were chosen to mimic those from the Victorian Family Heart Study (VFHS) .
Three phenotypes were simulated: (i) a continuously-valued phenotype from the linear mixed model in equation 2 with μi = 0, σa = 4, σc = 3 and σε = 2; (ii) a binary phenotype from the generalized linear mixed model specified in equation 4 with g(E(Yi)) = log(E(Yi)/(1- E(Yi))) (i.e the logit transformation of a proportion) and , corresponding approximately to an overall proportion of E(Yi) = 0.25; and (iii) a binary phenotype generated in the same way as (ii) above but with g(E(Yi)) = Φ-1(E(Yi)) where Φ-1 is the inverse cumulative distribution function of the standard normal distribution (i.e. the probit transformation of a proportion).
The simulated data were analyzed by fitting the models from which they were generated, producing three analyses per simulated dataset. The proposed model specification was implemented in WinBUGS (via R2WinBUGS, ) in R. For all models in WinBUGS, a single chain was specified and uniform prior distributions (U(0,30)) were specified for , and . The linear model was run for 5,000 iterations (2,500 burn-in), the logistic model was run for 10,000 iterations (5,000 burn- in) and the probit model was run for 20,000 iterations (10,000 burn-in). For the binary phenotypes the cut function was used to prevent any “orphan” random effects (those that appear in the linear predictor of a single individual and are not shared by any other family members) from contributing to the posterior distribution of the corresponding variance component. This was done since these random effects mimic those associated with a residual error term (which no longer exists in a GLMM specification) and can lead to upwardly biased estimates . For all analyses, the posterior medians and 95% posterior intervals for the variance components were computed. This simulation-estimation process was repeated 10 times for all phenotypes.
The results from the simulation-estimation procedure are displayed in Figure 1. For the linear model all displayed posterior summaries were close to the target values (the median of the posterior medians for σa, σcand σε, were 4.14 (target value 4), 3.04 (3) and 1.90 (2), respectively). The estimates from the logistic model were consistent with the nominal values, with only one posterior interval for σaand one posterior interval for σc excluding the target values (the median of the posterior medians for σaand σcwere 4.05 (4) and 2.82 (3), respectively). The estimates from the probit model were somewhat less consistent with the nominal values, with four posterior intervals for σaand three posterior intervals for σc excluding the target values (although the median of the posterior medians for σaand σcwere 3.79 and 2.90, respectively, which were close to the respective target values of 4 and 3). On average it took 58 seconds per analysis for the linear models, 5.15 minutes for the logistic models and 6.36 minutes for the probit models.
Analysis of data from the victorian family heart study
To illustrate the proposed model specification in an application, data from the Victorian Family Heart Study (VFHS) were analyzed in WinBUGS. The Victorian Family Heart Study was established to investigate the causes of familial patterns in cardiovascular risk factors. The study consisted of adult families recruited in Melbourne, Australia, where each family consisted of both parents and at least one natural adult offspring. For full details of the study see .
Two phenotypes were analyzed: systolic blood pressure (SBP) taken lying down (continuously-valued) and high blood pressure (HBP), defined as a systolic blood pressure reading of >140 mm Hg or a diastolic blood pressure reading of >90 mm Hg. SBP was analyzed using a linear model and HBP was analyzed using a logistic model. In both models, age (dichotomised as <35 years or ≥ 35 years) and sex were included as fixed-effect covariates, and the additive genetic and common environment variance components were estimated (the linear model also included the residual component of variance). Descriptive statistics of the phenotypes and covariates included in the analyses are given in Table 1. In addition, three chains were specified for each model, and convergence was declared when the Gelman-Rubin R-hat statistic was ≤ 1.1. The analyses were also run in Stata to provide a comparison with a maximum likelihood method. Stata was chosen for the maximum likelihood analysis since it comes with pre-packaged routines allowing easy implementation of the proposed model specification for both continuously-valued and categorical phenotypes (xtmixed was used for SBP and xtmelogit was used for HBP, where 6 integration points were specified for xtmelogit). The results from these analyses are displayed in Table 2.
|SBP* lying down (mm Hg)||HBP|
*SBP = systolic blood pressure
HBP = high blood pressure
SD = standard deviation
Table 1: Descriptive statistics of phenotypes by generation and sex in the Victorian Family Heart Study.
|Model||Package||(95% interval)*||(95% interval)*||(95% interval)*||Run time|
|(4.3, 7.6)||(3.6, 5.7)||(11.2, 12.2)|
|(4.1, 8.0)||(3.8, 6.0)||(11.2, 12.6)|
|(1.2, 3.0)||(0.7, 1.7)||n/a|
|(0.9, 3.2)||(0.8, 1.6)||n/a|
*95% posterior intervals for WinBUGS; 95% confidence intervals for Stata
Results based on 3,000 iterations (1,500 burn-in)
Results based on 10,000 iterations (5,000 burn-in)
Table 2: Results from analyzing continuously-valued and binary phenotypes from the Victorian Family Heart Study.
The results were similar between WinBUGS and Stata, however Stata took substantially longer to achieve estimates for the logistic model.
In this work we developed a general method for specifying GLMMs for data from families of varying size and composition. We evaluated the method by analyzing simulated and real data and found that it performed well for parameter estimation and run times for both continuously-valued and binary phenotypes. The model specification is particularly suited to MCMC methods that require a “mechanistic” description of a linear predictor containing both fixed and random effects.
For the simulated continuously-valued phenotypes the proposed model specification delivered unbiased parameter estimates. Using the specification in WinBUGS for the continuously-valued phenotype from the Victorian Family Heart Study (systolic blood pressure) gave similar results to xtmixed in Stata. The analysis did, however, take much longer in WinBUGS - iterative estimation routines for linear mixed models typically converge to the maximum likelihood estimates quickly, so a simulation-based approach to parameter estimation will under-perform in comparison. The MCMC framework does, however, allow for flexibility in model specification, such as the opportunity for the shared environment component of variance to be dependent on the value of other covariates, something that is difficult to achieve in linear modelling routines such as xtmixed in Stata.
For the simulated binary phenotypes the proposed model specification also produced parameter estimates close to the target values, but the coverage of the target parameters by the nominal 95% posterior intervals was not as good as it was for the continuously-valued phenotype. The analysis of the binary phenotype from the Victorian Family Heart Study (high blood pressure) that employed our model specification using WinBUGS yielded similar estimates to xtmelogit in Stata, however the analysis took substantially less time in WinBUGS. The xtmelogit routine uses adaptive quadrature, which can be slow, especially when there are many random effects, a large number of observations and several integration points . Even with the use of multiple processors, it is unlikely that the processing time in Stata could be reduced from days to hours.
It is straightforward to extend the proposed model specification to accommodate multi-category phenotypes. Details on the specification of logistic regression models that incorporate random effects that can reproduce within-family correlation structures for ordinal outcomes due to shared environmental and genetic risk factors are described in Zaloumis . In addition, Ellis et al.  analyzed data from the VFHS to investigate risk factors for male pattern baldness (a four-category ordinal outcome variable) using generalized linear mixed models similar to those proposed here.
We successfully reduced the processing time in WinBUGS by stating explicitly the multiplication of the columns of the design matrix Za by the aij’s. This contrasts with what appears to be a more convenient coding strategy relying on the inprod (inner product) command. Results from a small subset of simulated data (continuously-valued phenotype, 300 families) showed this choice of coding took at least three times as long to achieve convergence of estimates. In addition, it is possible to further reduce the processing time in WinBUGS (and Stata) by “stacking” the diagonal blocks of Za on top of one another, so that the number of columns of this “stacked” matrix is equal to the number of people in the largest observed family. For families that don’t have as many members as the largest family, the remaining columns can be filled in with zeros. This “stacking” approach is appropriate if families are assumed to be unrelated (as they were in the simulation-estimation procedure and the analysis of the VFHS data) and the software allows one to specify a hierarchical model on families where the ai random effects vectors are independent among families.
Although we have shown in this paper that our approach to specifying GLMMs for family data is analytically sound, it does not overcome the computational difficulties of estimating variance components with categorical phenotypes . It is possible to circumvent these problems with careful specification of the model, but a large number of iterations of the Gibbs sampler will be required (i.e. tens if not hundreds of thousands) and accurate estimation will require a substantial number of families (i.e. hundreds if not thousands). Therefore when using our proposed approach to specifying mixed models for family data with categorical phenotypes, a large number of families will be needed for adequate estimation of the variance components, particularly , and the Gibbs sampler will have to be run for several thousands of iterations.
Our proposed general approach to specifying GLMMs for family data avoids the nuisance of having to specify a model that is dataset-specific, which can be tedious and time-consuming. This is an improvement on current approaches, which are tied to particular phenotypes and/or family compositions. The method can be easily implemented in freely available software and was particularly suited to the MCMC framework. The method of specification proposed in this paper should be considered when analyzing family data, particularly when outcomes are categorical, prior information on parameters needs to be incorporated and/or non-standard specifications of the genetic model are of interest.
The authors thank Prof Murray Aitkin for his insightful suggestions regarding the implementation of MCMC when fitting GLMMs.