A Family-based Association Method for Pedigree Including Half-Sib Data

Family datasets could provide good resources for association studies as an initial investigation or a replication study. The current family-based association tests analyze data only from full sibships of a nuclear family or extended pedigrees of related nuclear families. In order to fully exert all possible information in the family data, we propose a “Pedigrees with Half-sibs Association Test” (PHAST) to accommodate half-siblings if they are available. PHAST adopts the idea of transmission score from the Pedigree Disequilibrium Test (PDT) to construct the test statistic. The difference is that it utilizes the identity-by-descent (IBD) information of the marker between sibling pairs (full or half sibs) to construct an allelic transmission statistic. If parental genotypes are missing, EM algorithm is used to infer parental genotypes and compute transmission scores for all possible scenarios. The computer simulation results suggested that our new method has valid type I error rates under varied family structures. Our method could have more power than PDT and FBAT when the sample size of half-sibs increases, especially the families without parental genotypes. In conclusion, our method can serve as an alternative method of the existing family-based association tests. Furthermore, it can relax the ascertainment criteria for studying late onset diseases since limited siblings are available.


Introduction
Association studies have been a predominant approach for analyzing densely spaced genetic markers (e.g. single nucleotide polymorphisms (SNPs)) for detecting genetic variants that may lead to susceptibility genes or genetic modifiers for the traits of interest (disease or quantitative traits). Since the development of transmission/ disequilibrium test (TDT) [1], family-based study design was viewed as a robust approach because of less chances of encountering false positive association results due to population stratification existed in the population samples. However, the development of several prominent association methods such as GENOMIC CONTROL [2,3], STRUCUTRE [4], EIGENSTRAT [5] methods has lessened the concern and changed our view of the case-control study design significantly. As the results, many genome wide association studies (GWAS) utilized easier obtained population-based samples rather than family-based samples.
While case-control design seems to dominate the current practice of the association study, particularly GWAS, familial samples still remain some advantages. First, many familial samples were collected during the era of whole genome linkage scans. These family datasets provide good resources for association studies as either an initial investigation or a replication dataset for other association findings. Second, family data are more robust for detecting genotyping errors (e.g. Mendelian errors) than population-based samples. Third, family data have higher accuracy in inferring haplotypes and confirming the role of rare variants to disease susceptibility. Finally, as the nextgeneration sequencing is becoming an important approach to pinpoint the causal variant, familial samples have the advantage for the first pass of sequencing effort to discover both common and rare variants. Overall, familial samples will still remain important in genetic studies of human complex diseases.
Many family-based association methods have been developed in the past decade. The TDT was the pioneer of all methods, which tests for association in the presence of linkage using parent-offspring triads. Many extensions of the TDT method were developed for various pedigree structures afterward. The sibling TDT (S-TDT) [6] and several other tests [7,8,9] were designed to accommodate discordant sibpairs without parental genotype data. More general methods such as the pedigree disequilibrium test (PDT) [10], and the family-based association test (FBAT) [11] can accommodate multiple affected offspring.
The current family-based association tests share a characteristic, which is to analyze data only from full sibships of a nuclear family, such as parent-offspring triads, parents and multiple affected full siblings, discordant full sibpairs of large size, or extended pedigrees of related nuclear families. Therefore, full siblings or parents have been the target of ascertainment in genetic research. In this study, we explore a new method that can relax the recruitment criteria such as accommodating half-siblings in addition to full siblings if they are available. We also allow our method to handle multiple affected siblings that may exist from pedigrees recruited for linkage studies.
To accommodate these different family structures, we developed a Pedigrees with Half-sibs Association Test (PHAST) to fully exert all possible information in the family data. The development of PHAST was based on the framework of the pedigree disequilibrium test (PDT) but with several new features: (1) applicable to both full siblings or combinations of full and half siblings family data; (2) inferring parental genotypes when they are missing rather than using siblings information to compute test statistics; and (3) accounting for linkage when parental genotypes are missing. Although this new method may have similar properties as the existing methods, it will largely benefit ascertainment process, particularly for aging related diseases when limited siblings are available to ascertain.

Methods
For simplicity, we start with the pedigree structure of affected sibpair (ASP) with parents and concordant half sibpair (HSP) with parents ( Figure 1A). Families with more than two affected siblings could be addressed by following the same procedures. We assume that the marker tested is a biallelic marker with a risk allele 1. Our method adopts the concept of allelic transmission presented in PDT but utilizes the identity-by-descent (IBD) information of the marker between a pair of individuals to construct test statistic. For any ASP family, if the two siblings share 2 or 1 IBD at the target marker, we treat the ASP family as a whole unit to compute the allelic transmission score. If the two siblings share 0 IBD at the target marker, we split the ASP family to two independent case-parent trios. The same rules are applied in HSP family except that no 2 IBD sharing between half sibpairs will occur. The mathematical forms of this strategy are described below for different family structures.

Concordant sibpairs with parents
Define a random variable X for allelic transmission score, which is computed as the number of different copies allele 1 transmitted to the two affected sibs minus the number of allele 1 non-transmitted.
X iT is the number of allele 1 transmitted minus the number of allele 1 non-transmitted for a case-parent trio (Trio). G S = (G S1 , G S2 ) is the marker genotypes for sibling 1 and 2, and G p is the parental genotypes, that is G p = (G P1 , G P2 ) for ASP family and G p = (G P1 , G P2 , G P3 ) for HSP family. ( | , ) p s P IBD k G G = is the probability that the two affected siblings share k IBD, k = 0, 1, or 2 (APPENDIX I). Various computer programs can estimate IBD probability. We used MERLIN [12] for estimating IBD probabilities at a particular marker.
The value of X is observed by counting the number of copes of allele 1 in the affected siblings. Under the null hypothesis of no association between the marker and disease alleles, X should have an expected value of 0. To make computation easy for accounting both ASP and HSP units, we looked into the different status of IBD. Under 0 IBD, the two copies of markers of each sibling inherited from totally different source which indicates these two siblings are independent in allelic PDT case-parent trios as the total allelic transmission and sharing score for both ASP and HSP. Under 1 IBD, in order to assure the expectation of statistic X to be 0 under the null hypothesis, the number of "different" copies of alleles transmitted must be equal to the number of alleles non-transmitted. For 1 IBD sharing HSP, in genotypes (G S1 , G S2 ), there are three different copies of allele transmitted (S1 and S2 sharing one identical copy from the common parent P2) and three alleles remain non-transmitted in (G P1 , G P2 , G P3 ). However, this is not the case in ASP family. For 1 IBD sharing ASP, there are three different copies of allele transmitted but the number of allele nontransmitted is always one. In order to get correct transmission statistic, we transform the original two parents ASP into a pseudo HSP family ( Figure 1A) by repeating one parent P1 as P3. For instance, assume (G P1 , G P2 ) with genotypes (11,12) and (G S1 , G S2 ) with genotypes (11,11), if S1 and S2 share 1 IBD, the transformed pseudo HSP family is (G P1 , G P2 , G P1 ) with genotypes (11,12,11) and the corrected allelic transmission score will be X i1 = 3 -2 = 1 ( Table 1). Under 2 IBD, which will only occur in ASP unit, it will always be equal between the number of different copies of alleles transmitted and the number of alleles nontransmitted. Therefore, for the 2 IBD sharing ASP family, we can calculate transmission statistic directly without any transformation. In the previous example, if S1 and S2 share 2 IBD, G S1 and G S2 are with the same pair of alleles transmitted from the parents P1 and P2. The different copies allele 1 transmitted are 2, but not 4, and X i2 = 2-(# of allele 1 not transmitted in F i ) = 2 -1 = 1.

Discordant sibpairs with parents or parent-offspring triads
For full and half sibpairs with different disease status (discordance sibpairs, DSP) ( Figure 1B,1C) or simple parent-offspring triads, there is only one affected sibling in each family. The random variable X= X iT (the last part of equation (1)) and will follow the PDT framework.

More than two affected siblings
We also formulate PHAST method for taking into account multiple affected siblings (> two affecteds). The same inference principles described for concordance sibpairs are applied. In the example of nuclear family with three affected siblings ( Figure 1D), we identify the IBD status between each pair of siblings first. Split the nuclear family into two independent units if sharing 0 IBD, transform nuclear family transmission. Therefore, 0 i X is formulated to be the summation of two to pseudo HSP family if sharing 1 IBD, and then count the allelic transmission score number X. The possible X scores for nuclear family

Missing parents --inferring parental genotypes
When parental genotypes are missing, PDT method does not infer parental genotypes but utilizes information from discordant sibpairs. However, the allelic sharing scores may vary by the number of available sibling samples genotyped. In our proposed method, we took a step further to infer parental genotypes and compute allelic transmission scores for all possible parental genotypes. At the same time, in order to accommodate linkage between a maker and disease locus, the probability of affected siblings sharing k allele IBD, , need to be taken into account. We used the Expectation Maximization (EM) [13] algorithm to estimate the conditional probability ( | ) p s P G G and IBD parameters Z k . The probability of parental genotypes based on offspring genotypes is formulated as Then, the expected random variable in family i can be estimated by . For ASP and HSP without parental genotypes, the value of X j is calculated by equation (1) consistent with G sj and the jth set of possible parental genotypes G pj . For missing parental genotypes DSP cases, X j is the number of allele 1 transmitted minus the number of allele 1 non-transmitted for each inferred parental genotypes, as defined in PDT method under the scenario of known parental genotypes.

The PHAST test statistic
We define a general association statistic D i based on all ASP, HSP, and DSP subunits in pedigree i.
For the ith pedigree, where i = 1, …, N, a family-specific score 1 1 where F iA , F iH , and F iD are the number of ASP, HSP, and DSP families in pedigree i.
Under the null hypothesis, we can derive Therefore, the PHAST statistic can be written as

Simulation Studies
A series of computer simulations was implemented to study the validity of PHAST. For each simulation, 10,000 replicates were with three affected siblings are listed in Table 2.
Assume a bi-allelic disease locus A with alleles A 1 and A 2 (allele frequencies p 1 and p 2 ) and a single marker M with alleles M 1 and M 2 (allele frequencies q 1 and q 2 ). Linkage disequilibrium (LD) between the disease locus and the marker was set as The haplotypes for parental population were generated based on these frequencies. We assume random mating in the population and form two haplotypes for each offspring by randomly drawing one haplotype from each parent. Genetic markers were simulated under the assumption of complete linkage to the disease locus. Three genetic models (recessive, additive, and dominant) were considered through different disease penetrances. The model parameters are given in Table  3. Disease phenotypes were simulated based on disease locus genotypes and their corresponding penetrances.
The type I error was studied under the null hypothesis of no association between the disease and marker alleles (D = 0). We generated replicate samples N= 200 and N =500 of families with different disease models and five types of family structures, ASP, HSP, DSP, discordant half sibpairs (DHSP) with and without parental genotypes, and nuclear family with three affected siblings, in type I error simulations. To evaluate the power and examine the half sibpairs power contribution, three combinations of different types of family structures were used: (1) 200 families with different ratios of DSP to DHSP, (2) 200 families with different ratios of ASP to HSP, and (3) 200 nuclear families with three affected siblings.

Type I error rates
Tables 4 and 5 present the type I error rates for PHAST, PDT, and FBAT tests in 200 HSP, ASP, and DSP with and without parental genotypes for different simulated genetic models. In the cases that parental genotypes are known, Tables 4 and 5 show that type I error estimates for most tests are very close to the nominal significance level of 0.05. For the scenario of concordant sibpairs without parental genotypes (ASP and HSP), since PDT and FBAT cannot address this type of data, no estimates were obtained for both programs, and a nominal level of type I error estimates was obtained for PHAST. Table 6 shows type I error rates for data simulated from 200 HSP families under different ratios of 1 IBD to 0 IBD families. Both PHAST and PDT are very robust under different ratios of 1 IBD to 0 IBD families. However, FBAT tends to have inflated type I error when the ratio of 1 IBD cases are high and conservative type I error when the 1 IBD ratio is low.  Overall, we show that PHAST has correct type I error estimates under different scenario of family structure, and can handle missing parents for concordant sibpairs (full or half sibs). Families with parental genotypes generally show slightly lower type I error rates than those without parents cases. This was expected because the overall sample size is large when parental genotypes are available.

Power estimates
We also carried out simulations for all combinations of genetic models and different pedigree structures to evaluate the statistical power for PHAST method. Because similar power patterns were found in dominant, additive, and recessive models, we only present the results of additive model here.
Generally, power will increase when the degree of linkage disequilibrium (D) between marker and disease locus increases. For the family structure with parental genotypes cases, PHAST has similar statistical power with PDT and FBAT. For example, when D = 0.21 (maximum LD scenario for marker and disease allele frequencies at 0.3 by equation (2)) for 200 HSP with parents, the power of PHAST is 0.602 and powers of PDT and FBAT are 0.602 and 0.606. Although the statistical power under the maximum LD was not strong, this is mostly due to the small size of data and the lack of parental genotypes. Figure 2 shows the results of power comparison among PHAST, PDT, and FBAT under different ratios of DSP without parents to DHSP without parents. A set of 200 families of difference combinations of DSP and DHSP without parents were simulated for each replicate. It is noted that in this simulation study, PDT and FBAT do not use data from DHSP without parents. Therefore, only PHAST method can utilize the full dataset. The results in Figure 2 show that PHAST has increasing power as the proportion of DHSP without parents increasing while PDT and FBAT have decreased power.        HSP without parents are small due to the limited sibling genotypes available for inferring missing parental genotypes in HSP. The above simulation results for either type of family structure --siblings without parental genotype information, show that PHAST will not reach up to 80% power. However, it shows that PHAST can handle more family types and add to the statistical power. We mentioned earlier that we

Discussion
In this paper, we proposed the PHAST approach to test for association with the inclusion of half-siblings or more than two affected sibling data. Like other family-based association methods, such as PDT and FBAT, PHAST can also handle the conventional family data structures such as trios, ASP with parents, and DSP. The simulation results showed that PHAST method has correct type I error under varied family structures. We also studied the properties of the PHAST, PDT, and FBAT test statistics as HSP families contain different ratios of 1 IBD to 0 IBD families. It is important to point out that type I error rates in the FBAT test are inflated as the ratio of half sib-pair sharing 1 IBD increases. Therefore, the power of FBAT method could lead to misinterpretation when the data with half-siblings are used.
We compared our method with two alternative methods, PDT and FBAT. We found the PHAST and PDT version to be, in most cases, highly correlated. Thus in most cases, the PHAST will provide similar power to detect an association over current methods and will, for some specific genetic models, offer a gain in power. Simulations revealed that PHAST could have more power than PDT and FBAT when the sample size of half-siblings increases, especially the families without parental genotypes. Therefore, PHAST can be considered as a useful tool for studying late onset disease.
In summary, the PHAST method can serve as an alternative for the PDT method when either full-or half-siblings, or both, are available. Moreover, PHAST is potentially applicable to genetic studies with special features. For instance, for the aging related diseases, available samples to be recruited may be limited and parents are mostly not available. In this case, if the expansion to half-siblings is possible, PHAST will be able to handle this type of data. Thus, we hope that PHAST can be additional tool for researchers to facilitate future studies. Accordingly, this method should be able to expand the scope of the current family-based ascertainment strategy, and hopefully add insights into the genetic association study of human complex disease. expanded PHAST statistic to handle more than two affected siblings. Our simulation showed that with additional siblings, power to detect risk effects is improved. Figure 4 shows power among PHAST, PDT, and FBAT for 200 families with three affected siblings and known parental genotypes simulated under the additive model. When D = 0.21, the power of PHAST is 0.576, while the power of PDT and FBAT reach 0.735 and 0.641. However, the type I error of PHAST is closest to the significance level of 0.05 (PHAST: 0.049; PDT: 0.043; FBAT: 0.046). The difference in the analysis strategy between PHAST and the other two methods is that PHAST treated the whole family (two parents+3 affecteds) as a whole unit while PDT and FBAT take families with three affected siblings as three independent trios.