Sample Size Calculation for the Modified Likelihood Ratio Test in Genetic Linkage Analysis

Mixture models provide flexible means of handling heterogeneity in data. The possible latent structure suggested by mixture model analysis should be carefully examined using designed experiments. Sample size determination is an important and difficult step in design of experiments. We investigate the sample size calculation for the modified likelihood ratio test for binomial mixture models arising in genetic linkage analysis. We obtain limiting distributions for the modified likelihood ratio test under two sets of commonly used local alternatives. A simple sample-size formula is obtained and illustrated using both simulations and a genetic linkage study for schizophrenia. Journal of Biometrics & Biostatistics J o ur al of Bio metrics & Bistatis t i c s


Introduction
Mixture models provide flexible means of handling observed or unobserved heterogeneity in data. The data analysis using mixture models could unveil the possible underlying or latent structure. Welldesigned clinical trials and scientific experiments are usually needed to examine the validity of the suggested latent structure. Sample size determination is a major issue in those studies, see Chow et al. [1] and references therein. There is a vast literature covering sample size calculation for comparative research studies especially in medical context, for example, hypothesis testing for proportions in two groups.
Instead of considering simple designs such as a two-sample test, we consider calculating sample size for hypothesis tests in mixture model framework. More specifically we propose a formula for determining required sample size for performing a test of homogeneity. A test of homogeneity, which tests the null hypothesis of one component parametric model versus the alternative of a two-component mixture, is one of the most difficult and important problems in finite mixture models. There is some literature on power and sample size calculations for tests of homogeneity in finite mixture models. Hall and Stewart [2] provided theoretical analysis of power in a two-component normal mixture model and addressed the irregular feature of the problem. Recently, Chen et al. [3] addressed the issue of sample size calculation for tests of homogeneity using the EM-test and C(α) test. Instead of a general homogeneity test, we consider a special binomial mixture model arising in genetic linkage analysis. This particular binomial mixture model in pedigree studies has been studied in Lemdani and Pons [4], Liang and Rathouz [5], Fu et al [6]. showed that the modified likelihood ratio test (MLRT) which was proposed by Chen [7] and Chen et al. [8] has better power for detecting the aforementioned binomial mixture alternative than other methods discussed in their paper. Since sample size calculation is test-specific, for the homogeneity test of the special binomial mixture, we choose the MLRT as the basis for the sample size determination. Following Chen et al.
[3], we investigate the power properties of the MLRT under two sets of commonly used local alternatives. A simple sample size formula is obtained and illustrated by both simulations and a genetic linkage study for schizophrenia.
The rest of the article is as follows. Section 2 presents the problem set up and gives the asymptotic distribution of the MLRT and samplesize formula for two local alternatives. Section 3 presents a real data example in genetic linkage study and simulation results are given in Section 4. The proof of the theorem is given in the Appendix.

Main Results
The particular binomial mixture model in pedigree studies we consider is a two-component binomial mixture with one component distribution completely known. This model is commonly used to model the recombinant data in pedigree studies and known as phase known model. See Liang and Rathouz [5] and Fu et al. [6] for more details. Suppose we have a random sample X 1 ,…,X n drawing from the following binomial mixture model where γ is the mixing proportion and θ ∈ [0,0.5] is the component parameter with a specified range. Our interest is to test homogeneity with the null hypothesis specified as Note that there are two unusual features of the homogeneity test: (1) the null hypothesis lies on the boundary of the parameter space, and (2) the parameters γ and θ are not identifiable under the null model. The log-likelihood function of (γ,θ) is The modified log-likelihood function is defined as with C>0. In this paper, we choose C=1 as suggested in Fu et al. [6]. The MLRT statistic is defined as where γ 0 and θ 0 are constants within the parameter space. For testing homogeneity in finite mixture models, we usually encounter two types of loss of identifiability, which lead to the two specified local alternatives.
Note that the three basic components of sample size calculation are significance level α, target power 1−β and a potential alternative model. For the two sequences of local alternative model The validity of the sample size formula is examined using a real data example and computer simulations which are given in the next two sections.

Application
We applied the developed theory to a genetic linkage study for schizophrenia conducted at the Johns Hopkins School of Medicine. The details of the study design and data collection can be found in Pulver et al. [10] and Liang and Rathouz [5]. This study included 486 individuals from 54 families with at least two affected relatives. Here "affected" refers to someone who was diagnosed with either schizophrenia or schizoaffective disorder based on the DSM-III-R criteria. Based on previous studies, one is particularly interested in Marker D22S941 on chromosome 22. However, it is well known that schizophrenia is prone to heterogeneity. Research showed that the following two-component binomial mixture 0.6 (9,0.5) 0.4 (9,0.06) Bin Bin + may fit the data well. Suppose our interest is to validate above mixture structure at the 0.5% level, which is typical in linkage studies, with at least 80% power. The approximate sample size is 0.005,0.2 10.
n ≈ We also used computer simulations to check: (1) whether the limiting distribution provides reasonable approximation to the finite sample distribution under the calculated sample size; (2) whether the MLRT statistic has the desired power to detect the heterogeneity. In the simulations, we set C=1 as recommended by Fu et al. [6]. The simulated type I error is 0.4%, and the power of M n is 87% based on 50,000 repetitions.
Similarly, we consider the situation where the significance level is 1%, and target power is 80%. The approximate sample size is 0.01,0.2 9. n ≈ The simulated type I error and power of M n are around 1.4% and 86%, respectively.

Simulation Study
We further examined the performance of the sample size calculation formula under other settings. We considered eight alternative models which are determined by the various combinations of γ=(0.05,0.3), θ=(0.05,0.3), and m= (4,8  We considered two significance levels 0.5% and 1%, with the same desired power 80%. For each alternative model, we calculated the required sample size, the simulated type I error rate, and power of M n with C=1 based on 50,000 repetitions. The results were summarized in Tables 1 and 2. From the tables we can see that the proposed sample size formula reliably achieves the desired power under different alternative models.