Modeling Host-Cancer Genetic Interactions with Multilocus Sequence Data

Cancer susceptibility may be controlled not only by host genes and mutated genes in cancer cells, but also by the epistatic interactions between genes from the host and cancer genomes. We derive a novel statistical model for cancer gene identification by integrating the gene mutation hypothesis of cancer formation into the mixturemodel framework. Within this framework, genetic interactions of DNA sequences (or haplotypes) between host and cancer genes responsible for cancer risk are defined in terms of quantitative genetic principle. Our model was founded on a commonly used genetic association design in which a random sample of patients is drawn from a natural human population. Each patient is typed for single nucleotide polymorphisms (SNPs) on normal and cancer cells and measured for cancer susceptibility. The model is formulated within the maximum likelihood context and implemented with the EM algorithm, allowing the estimation of both population and quantitative genetic parameters. The model provides a general procedure for testing the distribution of haplotypes constructed by SNPs from host and cancer genes and the linkage disequilibria of different orders among the SNPs. The model also formulates a series of testable hypotheses about the effects of host genes, cancer genes, and their interactions on cancer susceptibility. We carried out simulation studies to examine the statistical properties of the model. The implications of this model for cancer gene identification are discussed.


Introduction
Since the recognition of cancer as a genetic disease, a number of familial cancer genes with high-penetrance mutations, such as oncogenes and the tumor suppressors, have been chromosomally identified, isolated or cloned (Balman et al., 2003;Rand et al., 2008). However, growing evidence shows that most cancer is the result of an intricate interaction of lowpenetrance genetic variants with environmental exposures that humans experience (Brennan, 2002). These low-penetrance cancer genes, each usually with a minor effect and cooperating with others in a complicated web, are difficult to detect and, therefore, their contribution to the risk of cancer development remains unclear. There is a pressing demand on the development of powerful statistical models and computational algorithms for identifying and mapping specific DNA sequence variants that regulate cancer susceptibility.
Human cancer cells frequently possess large-scale chromosomal rearrangements due to chromosomal instability (CIN) (Stock and Bialy 2002; Thompson and Compton, 2008) or gene mutation ( tions of chromosomes gained or lost during cell division, resulting in an imbalance in the number of chromosomes per cell (aneuploidy) and an enhanced rate of loss of heterozygosity. Thus, the "aneuploidy hypothesis of cancer" (Stock and Bialy, 2002) proposes that the main differences between normal and abnormal (cancer) cells result from the number of genes rather than the types of genes differentially expressed, as opposed to the "gene-mutation hypothesis" (Jallepalli and Lengauer, 2001). In general, cancer incidence and development are not only affected by the host genes, but also by genes derived from the cancer cells themselves. Given strong mechanistic interactions between the host and cancer tissues (Araujo and McElwain, 2006), these two different systems of genes operate interactively or epistatically to alter the course of cancer progression. Thus, it can be well anticipated that any statistical model for gene detection that incorporates these two hypotheses are likely to make groundbreaking discoveries of cancer genes.
Genetic mapping has proven to be a powerful approach for detecting quantitative trait loci (QTLs) for complex traits. But a QTL may contain multiple genes that operate in a collective way. It is not possible to study the DNA structure, organization and function of a QTL detected from a mapping approach. A more accurate and useful approach for the characterization of genetic variants contributing to quantitative variation is to directly analyze DNA sequences, known as haplotypes, associated with a particular disease (Liu et al., 2004;Lin and Wu, 2006). If a string of DNA sequence is known to increase disease risk, this risk can be prevented by inhibiting the expression of this string using a specialized drug. The control of this disease can be made more efficient if all possible DNA sequences determining its variation are identified in the entire genome. The elucidation of the entire human genome has been accelerated by the haplotype map, or HapMap, constructed by SNPs (The International HapMap Consortium, 2003). More recently, the marvelous plans of sequencing the cancer genomes (Kaiser, 2005) will provide unprecedented fuel for studying the genetic architecture of cancer risk.
In this article, we will derive a statistical model for detecting the actions and interactions of haplotypes derived from the host and cancer genomes for cancer susceptibility. We will incorporate the "gene-mutation hypothesis" into the model. The "aneuploidy hypothesis of cancer" will be considered in a next paper. Through the release of software to the public, our statistical model will serve as a rou-tine means for the genetic diagnosis of cancer risk. Results from the model will provide scientific guidance for clinical doctors to design an optimal treatment scheme in terms of cancer genes and patient's genes.

Design Sampling Strategies
Suppose there is a natural human population at Hardy-Weinberg equilibrium (HWE) from which a random sample is drawn for cancer gene identification. In order to identify DNA sequences responsible for cancer susceptibility, we genotype SNPs from the entire host genome and also SNPs from the cancer genome for the same patient. We assume that the cancer genome is a diploid, whose difference from the normal genome is due to the "genemutation hypothesis". Recent molecular surveys suggest that the human genome contains many discrete haplotype blocks that are sites of closely located SNPs (Daly et al., 2001;Patil et al., 2001;Gabriel et al., 2002). Each block may have a minimal subset of SNPs, i.e., "tagging" SNPs, that can characterize the most common haplotypes. Our model will be based on tagging SNPs within each haplotype block. Although no detailed information about the structure of the cancer genome is available, we can assume that a particular set of SNPs may contribute to cancer formation at the haplotype level. The tenet of our epistatic model is that the effect of a given DNA sequence in the host genome on cancer is masked or enhanced by one or more sequences in the cancer genome.
Quantitative genetic model: Our model will be derived to characterize haplotypes that are responsible for complex traits because the association between haplotype diversity and phenotypic variation has been detected by several genetic studies (Judson, 2000;Bader, 2001;Rha et al., 2007). Among all possible haplotypes are there some particular haplotypes, called the risk haplotype ( ), that perform differently than the rest of the haplotypes, called the non-risk haplotype ( ). The combinations between the risk and non-risk haplotypes, , are called the composite diplotype (Liu et al., 2004;Wu and Lin, 2008).
Thus, by testing the differences in the genotypic value of a trait among composite diplotypes, we can estimate the genetic effects of haplotypes on the trait. It is also feasible to detect epistatic interactions between haplotypes from different genomes.
Let and denote the risk haplotypes and non-risk haplotypes for a series of SNPs genotyped from the host and cancer genomes, respectively. These two genomes form nine different composite diplotypes expressed as . We will use Mather and Jinks's (1982) formulation for genetic epistasis between different loci ( Table  2) to model the genetic effects of the composite diplotypes. The genotypic value ( ) of a joint composite diplotype from the two genomes can be decomposed into nine different components as follows: where stand for the composite diplotypes from the host and cancer genomes, respectively, is the overall mean; a H and a C are the additive effects of haplotypes from the host and cancer genome; d H and d C , the dominance effects of haplotypes; and i aa , i ad , i da , and i dd , the additive × additive, additive × dominance, dominance × additive, and dominance × dominance epistatic effects between the haplotypes from the two different genomes ( Table 2).    (12). This lets us describe the overall mean, additive, dominance, and four kinds of epistatic effects between the two genomes by Thus, by testing the significance of i aa , i ad , i da , and i dd , we can judge whether there is epistasis and how the epistasis affects a phenotypic trait.

Estimation Procedures
We will estimate two types of parameters for gene cancer identification. First is the population genetic parameters ( p ) that describe the distribution and diversity of haplotypes in the sampled population quantified by haplotype frequencies for multiple SNPs from the host and cancer genomes, allele frequencies at these SNPs and their linkage disequilibria. Second is the quantitative genetic parameters ( q ) that describe the genotypic values of composite diplotypes, specified by the action and interaction effects of haplotypes on cancer susceptibility, and residual variance. Given observed phenotypic ( ) and marker data from the host (H) and cancer (C) genomes, we construct a likelihood and factorize it to two components: where the first component is related to haplotype frequencies and the second component related to haplotype effects and variance. Thus, maximizing the likelihood (14) is equivalent to maximizing its two components separately.

Estimating Across-Genome Haplotype Frequencies:
Because the same genotype may be formed from multiple diplotypes, we need to incorporate the EM algorithm to estimate the unknown diplotype of a genotype, which is statistically viewed as a missing data problem. Excoffier and Slatkin, (1995) provided a simple approach for estimating haplotype frequencies, without specifying the configuration of haplotype. We provide a similar estimating algorithm for specifying the haplotype configuration. An observed zygotic genotype is generally expressed as the slashes are used to separate genotypes at different SNPs. Let (which sums to a total sample size of n subjects) be the observation of a typical joint-SNP genotype from the host and cancer genomes. Table 3 is an example of data structure for genotypic observations of two SNPs derived from the host genome and two SNPs from the cancer genome. The table also provides the expected frequencies of different genotypes in terms of haplotype frequencies.
Based on the information about observed data, it is not difficult to construct a multinomial likelihood, log L p |H,C), in which a mixture model is incorporated for those genotypes that are heterozygous at two more SNPs. By maximizing the observed data likelihood, the EM algorithm is derived. In the E step, we calculate the expected number of a particular across-genome haplotype within the mixture of diplotypes that form the same genotypes. For example, such an expected number is calculated for two SNPs from the host genome and two SNPs from the cancer genome using the following formulas: Ω Ω y log L(Ω p , Ω q |y, H, C) = log L(Ω p |H, C)  (15) where In the M step, we estimate a haplotype frequency with the expected number of haplotypes calculated above and the observations given in Table 3 by (16) Both the E and M steps are iterated between equations (15) and (16) until the estimates converge to stable values. The estimates at convergence are the maximum likelihood estimates (MLEs) of haplotype frequencies. The MLEs of allele frequencies at different SNPs and their linkage disequilibria of different orders can be solved from these estimated haplotype frequencies using a system of equations given in Table 1.
For a practical data set, risk haplotypes are unknown. A combinatory approach is used to detect an optimal combination of risk haplotypes derived from the host and cancer genomes. This can be chosen from all 16 possible combinations. The combination that gives the largest likelihood is considered as the best risk-haplotype combination. Under such an optimal combination, we estimate genotypic values of the composite diplotypes including  ,  ,  ,  , , , , , and . From the estimated .... values, we can then solve the additive, dominance, and epistatic effects between haplotypes from the two genomes using a system of equations (13).

Hypothesis Tests
It is imperative to know how different SNPs are associated within and between the host and cancer genomes and how haplotypes trigger cancer susceptibility singly or epistatically across different genomes. Two kinds of major hypotheses can be made to address these questions.
Linkage Disequilibrium Tests: The association between different SNPs within each genome and between two dif-

Computer Simulation
The statistical behavior of the model proposed for cancer gene identification is investigated through simulation studies. We simulate a HWE population of cancer individuals in which a set of SNPs from the host genome are associated with a different set of SNPs from the cancer genome. The haplotypes of these SNPs within and across the genomes trigger main and interaction effects on the susceptibility of a cancer. The allele frequencies of two of the SNPs from each genome and their linkage disequilibria of different orders are given in Table 4   For the assumed population in which multiple SNPs are typed, a quantitative trait that describes cancer susceptibility was simulated, following a normal distribution with means depending on composite diplotypes of SNPs and a residual variance. The genotypic values of composite diplotype are determined by assuming specific values for the additive, dominance and epistatic effects of haplotypes on the cancer trait. Composite diplotypes are formed by assuming two risk haplotypes for the SNPs, one from the host genome and the second from the cancer genome . The genotypic values of a total of 16 composite diplotypes, along with their probabilities calculated from haplotype frequencies, are used to compute the genetic variance. The residual variance is then determined by assuming different heritability levels (0.1 and 0.4).  Table 5 gives the log-likelihoods of population and quantitative genetic parameters by assuming different combinations of risk haplotypes from the host and cancer genomes for simulation data with sample size 200 and heritability 0.1. It can be seen that risk haplotype combination corresponds to the maximum likelihood among all possible combinations, suggesting that the model has correctly selected risk haplotypes. It appears that the power to correctly select the optimal combination of risk haplotypes is very high, even when a modest sample size and/or heritability is assumed (data not shown). The model provides rea-sonable estimates of quantitative genetic parameters ( Table  6). The estimation precision of parameters increases with sample size and heritability. For the additive genetic effects, a modest sample size (200) is quite enough even when the heritability is low (0.1). The good estimation of dominance genetic effects needs an intermediately large sample size (400 or more) for a small heritability. Epistasis, especially the dominance × dominance interaction, can be reasonably estimated when a large sample size (2000) is used. This is especially true for a small heritability.

Discussion
Despite painstaking cumulative efforts to fight against cancer by researchers worldwide in the past five decades, we have still not achieved substantial progress in diagnosis, prevention and treatment of this disease. The latest research, however, has found a possibility to treat, control and prevent cancer by using gene therapy. The successful use of this technique relies on our profound understanding of the genetic architecture of cancer susceptibility and progression. When tremendous progress in genotyping and sequencing the human genome and cancer genome has taken place, we are now in a great position to study the genetic control mechanism of cancer. In this article, we have developed a statistical model for characterizing DNA sequence variants that encode cancer susceptibility.
The novelty of the model lies in three aspects. First, we incorporate the latest discovery of cancer genetics into the model that gene mutation cause cancer (Jallepalli and Lengauer, 2001;Stock and Bialy, 2002;Balman et al., 2003;Greenman et al., 2007). The model is not only able to characterize how gene mutation in the cancer genome acts to regulate cancer, but also can detect the genetic interactions between the host genes and cancer mutation. The model allows the test of haplotype distribution and diversity in the cancer population and patterns of genetic actions and interactions. Second, the model is integrated with multilocus SNP data, detecting cancer genes at the DNA sequence level (Liu et al., 2004;Wu and Lin, 2008). This will provide significant insights into the genetic regulation mechanisms of cancer and cloning of cancer genes. Third, the model was built on the interactions of genes between different genomes. Modeling genome-genome interactions has received an increasing interest in studying the genetic architecture of seed development (Cui and Wu, 2006) and pathogenesis (Foster et al., 2003;Wang et al., 2005).
The model was investigated in terms of its statistical behavior through simulation studies. Different schemes of simulation that consider varying sample sizes and heritabilities were used. The estimation precision of parameters and the power to detect genetic variants for cancer was explored under different schemes. The results from simulation provide scientific support for the model to be used for cancer gene identification in practical data sets. Although we did not include real data to validate our model, the statistical design and algorithm proposed in this work will help cancer geneticists and clinicians launch a novel experiment to test hypotheses about the genetic control of cancer.
This article presents a framework for haplotyping cancer genes, and its extension to including genes × environment interactions, haplotyping in a case-control study, genetic imprinting, and an arbitrary number of SNPs will be possible. As an inherited disease, genetic research of cancer is beneficial from an informative family-structured design in which one or both of the parents and offspring are sampled simultaneously. The general principle of haplotyping genome-genome interactions can be used for such a family design, facilitating our understanding of cancer genetics. Also, cancer can be better viewed as a dynamic trait which undergoes marked developmental transition. Functional mapping advocated by our group (Ma et al., 2002;Liu et al., 2005;Wu and Lin, 2006) can be implemented into the haplotyping model to explore the developmental change of genetic control of cancer in time course. In this article, we focus on the "gene-mutation hypothesis" of cancer formation when the epistatic model was derived. Other hypotheses, such as "aneuploidy hypothesis of cancer" ( Bialy, 2002), should also be integrated into the model, to better understand the genetic mechanisms of cancer formation and progression. The model that incorporate the aneuploidy control of cancer will be reported in other articles.
The preparation of this manuscript is partially supported by Joint NSF/NIH grant DMS/NIGMS-0540745.