Reach Us +44-1522-440391
Analysis of Sex-Linked Recessive Traits: Optimal Designs for Parameter Estimation and Model Tests | OMICS International
ISSN: 2155-6180
Journal of Biometrics & Biostatistics

Like us on:

Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on Medical, Pharma, Engineering, Science, Technology and Business
All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Analysis of Sex-Linked Recessive Traits: Optimal Designs for Parameter Estimation and Model Tests

J. Fellman1,2*

1Folkhälsan Institute of Genetics, Department of Genetic Epidemiology, Helsinki, Finland

2Hanken School of Economics, Helsinki, Finland

*Corresponding Author:
J. Fellman
Folkhälsan Institute of Genetics
Department of Genetic Epidemiology, POB 211
FIN-00251 Helsinki, Finland
E-mail: [email protected]

Received date: May 24, 2012; Accepted date: June 19, 2012; Published date: June 20, 2012

Citation: Fellman J (2012) Analysis of Sex-Linked Recessive Traits: Optimal Designs for Parameter Estimation and Model Tests. J Biomet Biostat 3:146. doi: 10.4172/2155-6180.1000146

Copyright: © 2012 Fellman J. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Visit for more related articles at Journal of Biometrics & Biostatistics


The estimation of the gene frequency of sex-linked recessive traits is reconsidered. The estimation formulae for mixed, male, and female samples are presented and compared. Optimal designs for efficient estimation are studied. Male samples are optimal for gene frequencies below 1/3 and female samples for frequencies above 1/3. Mixed samples are never optimal. The model testing problem is discussed. Mixed samples are necessary for model testing. We analyse the loss in efficiency when both estimation and testing must be performed. In general, our results indicate that mixed samples should contain an excess of males. The results obtained are applied to empirical data found in the literature.


Maximum likelihood estimation; Model testing; Efficiency; Colour vision; xg Blood groups


In the literature, abundant studies exist concerning probabilistic models in genetics. These have mainly investigated model building and the statistical estimation of gene frequencies. However, in to our opinion, experimental design problems have not been examined sufficiently. Against this background, this study is performed. We evaluate the estimation of the gene frequencies of sex-linked recessive traits and our basic assumption is that the trait is monogenic and recessive. Such a trait has markedly different phenotype frequencies in the male and female segments of the population. This is caused by the fact that if the trait is recessive and has a gene frequency p in the total population, then the frequency of affected individuals is p among males and p2 among females. Consequently, direct comparisons of phenotype frequencies between males and females are of no value; e.g. the genes for colour-blindness and for blood group Xg are sex-linked, being located on the X chromosome.

We discuss and compare the maximum likelihood estimators of the gene frequency for mixed, male, and female samples. Among geneticists there is consensus that colour-blindness is not a monogenic trait. Kalmus (1985, p. 63) discussed whether the genes responsible for protan or deutan defects represent one common series of alleles on the X-chromosome or two separate series. He stated that the two-loci hypothesis seems better supported. The possibility to test the genetic model is crucial, and we give alternative methods for model testing. We analyse the loss in efficiency when both estimation and testing must be performed. The results obtained are applied to empirical data found in the literature [3,4].


Maximum likelihood estimation

The model: We consider a monogenic sex-linked recessive trait. We assume that we have a sample of size N consisting of M males and F females and that there are m1 males with a recessive phenotype, m2 males with a dominant phenotype, f1 females with a recessive phenotype and f2 females with a dominant phenotype. If the gene frequency of the recessive trait is p among both males and females [5,6], then the genetic model is given in Table 1.

    Males     Females  
  Number Affected Not affected Number Affected Not affected
 Observed M m1 m2 F f1 f2
Theoretical M Mp M(1-p) F Fp2 F(1-p2)

Table 1: Observed and expected number of subjects according to a monogenetic recessive sex-linked trait. Affected individuals have recessive and not affected individuals have dominant phenotypes.

A mixed sample: If we ignore a proportionality factor independent of p, we obtain from Table 1 the likelihood function equation,with the restriction 0 < p < 1. The function L(p) can be written

        equation                                                  (1)

The log-likelihood function equation is

equation                                                                                                                 (2)

If the log-likelihood function is written

equation                                                                                                                (3)

the first parentheses (lm(p)) contain the contribution of the male data and the second parentheses (lF(p)) the contribution of the female data. When we maximize l(p) in (2), we obtain

        equation                                     (4)

The condition equation, yields an algebraic equation of second degree

        equation                                                       (5)

This equation has two roots, one negative outside the admissible region (0,1) and one positive. The positive root is

        equation                               (6)

The upper limit ofequationis


Consequently, equation and equation belongs to the admissible interval (0,1). This estimation result was given by Haldane (1963). One obtains

        equation                               (7)

and equation. Consequently, the unique solution equation maximizes l(p) (and L(p)). If we accept the model, we can estimate the estimator variance. We have

        equation                                                            (8)

From (7) and (8), we get the information

        equation            (9)

If we introduce equation and equation, we obtain

        equation            (10)

We note that for high values of x (a majority of males) the information is high for low values of p and that for low values of x (a majority of females) the information is high for high values of p. Later, we will discuss this observation in more detail.

The inverse of I(x,p) yields the variance

        equation            (11)

The estimator equation is asymptotic normal and the variance equation can be estimated by using equation instead of p in (11). Haldane (1963, formula (5)) gives a slightly different estimate of equation . His formula contains the observed frequencies and is, in to our opinion, not altogether satisfactory. In fact, he estimates p with equationgiven below in (13) in the “male part” of the formula and withequationgiven in (16) in the “female part” of the variance formula (c.f. formula (11)).

A male sample: If we consider a male sample and ignore the proportionality factor, which is independent of p, we obtain from (2) the log-likelihood function

        equation                                      (12)

When we maximize lM(p), we get the “male” estimator

        equation                                                                                     (13)

with the information equationand the well-known variance

        equation                                                                       (14)

The estimator equationis asymptotic normal and the varianceequation in (14) can be estimated by using equation instead of p.

A female sample: If we consider only the female part of the sample and ignore the proportionality factor, which is independent of p, we obtain from (3) the log-likelihood function [7,8]

        equation                                                (15)

If we maximize the log-likelihood function, we obtain the “female” estimator

        equation                                                                                        (16)

with the information equationand the variance

        equation                                                                          (17)

The estimator equation is consistent, efficient and asymptotic normal and the variance equation in (17) can be estimated by using equation instead of p. According to Huether and Murphy (1980), it is not clear how rapidly these asymptotic properties are approached with increasing sample size. The log likelihood equation (15) yields an unbiased estimate equationof p2, but in (16) is biased with a negative bias. Haldane (1956) proposed an improved estimate [9,10]

        equation                                                                           (18)

In order to improve the ML estimates, Huether and Murphy proposed a jackknife procedure. Their estimate is, using our notations.

        equation        (19)

How these improvements influence our gene estimates will be discussed in the Discussion section. Eq. (9) indicates that the information obtained for the whole data set isequation. This is a consequence of the male and female data sets being independent.

Model testing

A mixed sample: In the mixed data set, there are two degrees of freedom because the row sums for males and females are fixed. After the estimation of p, one degree of freedom remains. According to Table 1, the model can be tested by the quantity [11]

        equation        (20)

where equation

Under the null hypothesis that the model holds, this quantity is approximately χ2 distributed with one degree of freedom.

The model can also be tested by the Likelihood Ratio Test (LRT). Consider




The maximizations give

        equation                      (21)

Where equation,equation and equation are given in (6), (13), and (16), respectively. Under the null hypothesis, −2logΛ is approximately χ2 distributed with one degree of freedom. In situations not far from the null hypothesis, the χ2 tests based on (20) and (21) give similar results. In the applications, the formula (20) is used.

Separate male and female samples: If we estimate p separately for the male and female series, there is no degree of freedom left in either series. Consequently, if we test the hypothesis equation = equation , we must consider the difference equation - equation with the variance

        equation                            (22)

Under the null hypothesis, equationis standard normal.

If we accept the null hypothesis equationp, then we can obtain a weighted estimate of the common gene frequency p. To minimize the variance of the weighted estimate, the weights should be the inverses of the variances in (14) and (17). The weighted estimate is

        equation                                       (23)

and its theoretical variance is equation, which is identical to (11). The estimator equation maximizes L( p) and equation but the weighted estimator equation and the Haldane estimator ˆp have asymptotically the same efficiency. Consequently, both estimators are best asymptotic normal (BAN).

Design of experiments

In connection with another type of genetic problem, Brown (1975) considers efficient experimental designs for the estimation of genetic parameters. We start from the same basic idea, but use different methods. In his book on colour-blindness, Kalmus (1965, p. 85) states, without further comments, that the population frequency for rare sexlinked recessive traits must be based on male samples. Now we study this problem in more detail. We apply experimental design theory using the inference results in the preceding sections [12].

Designs for parameter estimation: Let us assume that we intend to investigate N (fixed) individuals and that the gene frequency is p. Now our problem is in what proportion M : F shall we include males and females in our sample in order to minimize the variance given in (11) or, alternatively, to maximize the information measure (9). We study the information I( x, p) and the variance V( x, p) as functions of p and x. From (9) we get

        equation       (24)

The function (24) is a linear function of x. For equationis an increasing function of x and the maximum is obtained for x = 1, i.e. for a male sample. For equationis a decreasing function of x and the maximum is obtained for x = 0, i.e. for a female sample. For equationis constant and all samples are equally good. Our optimal experimental design for parameter estimation is hence

(i) Use a male sample if equation

(ii) Use an arbitrary sample if equation

(iii) Use a female sample if equation

We observe that the optimal design of the experiment depends on the true parameter value. This is common in non-linear situations, but in this case the dependence is very simple. In different populations, the frequency of colour blindness is about 0.08 so the rule (i) is in good agreement with Kalmu´s (1985) statement.

In practice, the problem is not so simple. Often when we start an investigation, we do not know the gene frequency. If we have prior information (from earlier studies) that the gene frequency is far in a known direction from one-third we can with confidence use a male or a female sample. If, however, we have no prior information or if the gene frequency is known to be in the neighbourhood of equation, then it is difficult to decide whether to use a male or female sample. We can see in Table 2 that for the Xg blood group p is close to equation, and this is a good example of this problem.

  N Recessive Dominant a) b) SD c) Reference
Males 154 59 95 0.356021 0.88 0.383117 0.039175 0.355486 Mann et al., 1962
Females 188 21 167 0.025542   0.334219 0.034369 0.025836  
Males 1751 620 1131 0.341226 2.62 0.354083 0.019206 0.334715 Noades et al., 1966
Females 1667 179 1488 0.008075   0.327687 0.011570 0.009911  
Males 3513 1209 2304 0.340577 0.41 0.344150 0.013664 0.338743 Sanger et al., 1971
Females 3271 371 2900 0.005731   0.336780 0.008232 0.007051  

Table 2: Xg in different studies

Let us now analyse the efficiency of a mixed sample in more detail. Assume that the true gene frequency is p. Now, we have to compare equation with equation if equationand with equation if equation , and we obtain the relative efficiencies for the mixed sample


        equation                                                               (25)


If we must test the model, it is necessary to include in the sample both males and females. If this is done, there is a loss of efficiency relative to the best (but unknown) design. In general, if we compare a male sample, a female sample, and a mixed sample of the same size, then the efficiency of the mixed sample is always between the efficiencies of the single-sex samples.

In Figure 1, we see how the efficiencies depend on the gene frequency for the single-sex samples (x = 0 and x = 1) and for some mixed samples (x = 0.3333, 0.5155, and 0.6667). The choice of x = 0.5155 and x = 0.6667 will be explained later. We observe that for small values of p the efficiency strongly depends on the true value of p. For equation, the male sample is most efficient. For equation, the female sample is most efficient but the efficiency of a female sample is not as good as the efficiency of the male sample for equation. Therefore, Figure 1 supports the conclusion that, independently of the true value of p, if we want to play safe a mixed sample should contain an excess of males.


Figure 1: Efficiency as a function of the gene frequency p for different sample compositions (x = 0, 0.3333, 0.5155, 0.6667, 1), where Maximin corresponds to x p . For details, see the text.

This result can also be obtained in the following way. We consider the efficiency E(x,p) for a mixed sample as a function of p for a given x. For equation, we have equation and equation , with equality for x = 1, i.e. the sample contains only male subjects. Hence, E(x,p) is an increasing function of p and equation for equation.

Similarly, we obtain for equationequation and equationNow, E(x,p) is a decreasing function of p and equation for equation. From these results, it follows that equation. Hence, equation, and this value is obtained for equationand p = 0 or 1

Speaking in terms of game theory, the strategy of nature is the choice of p and our strategy is the choice of x, and E(x,p) is the payoff of the game. The equationsolution indicates that we are playing safe. We expect the worst, i.e. that nature has chosen one extreme p value, and consequently, we prepare for it and choose the strategy that maximizes our gain (the efficiency). From this point of view, we should use a sample with equationmales and equationfemales. This mixed sample guarantees at least the efficiency equation(cf. Figure 1).

Designs for model testing: A sample consisting of both males and females is necessary if we have doubts about the model. The doubts may concern the simple recessive inheritance (cf. colour blindness), absence of selection (the same gene frequency in males and females), exact typing independent of the sex, or the non-existence of border cases that are difficult to type. If we have a mixed sample, we can then test the model as we have noted above. This is not possible with a male-, or female-only sample. This problem is a good example of the common situation that an experimental strategy, which is optimal for parameter estimation, is too restricted to be of any value for model testing.

If we want to test the model and to use equationgiven in (22) most efficiently under the null hypothesis, then we have to consider the variance

        equation                                                   (26)

and to pursue equation. This solution indicates that we are again playing safe. We expect the worst situation, i.e. that nature has chosen a p value that maximizes the variance, and consequently, we prepare for it and choose the strategy (x) that minimizes our loss (the variance). In other words, we want to answer the question: Which sample mixture x: (1-x) minimizes the equationFor a given x, we obtain


The corresponding W value is a maximum for equation. This maximumWmax( x), which depends on x, is


Now, we minimize Wmax( x) by using the derivative equation

If we use the condition that equation, the derivative reduces to equation Now, we solve the equation equation under the restriction equation. The equation simplifies to

        equation                                                   (27)

This equation of third degree satisfies the conditions equation and equation . Consequently, the equation has one root or three roots within the interval (0,1). The case three roots within this interval are impossible because the product of the roots has to be equation. Thus, there is only one root within the interval (0,1). In Figure 2, we present the function equation in order to locate the roots. An iterative calculation yields the numerical root equation ,and the corresponding p value is equation. Finally, we obtain

        equation                                                   (28)


Figure 2: Graphic representation of the function The root = 0.5155 is easily recognized.

The solution equation is our best testing strategy in order to meet nature’s worst alternative equation. This minimax solution of the testing problem does not coincide with the maximin solution of the efficiency problem. Figure 3 shows how NW( p,x) depends on p for different values of x. The minimax property of equationis easily seen.


Figure 3: Variance of the difference as a function of the gene frequency p for different sample compositions (x = 0.2000, 0.3333, 0.5155, 0.6667 , 0.8000).Minimax corresponds to. For details, see the text.


We apply our theoretical results to empirical data. We consider both colour vision and Xg blood group data. In Table 2, we present the results of the analyses of blood group data, and in Table 3 the results of the colour vision data. The results obtained by the mixed sample and obtained by combined estimates of male and female samples are fairly similar.

  N Recessive Dominant a) b) SD c) Reference
Males 9049 725 8324 0.077226 4.76 0.080119 0.002854 0.076979 Waaler, 1927
Females 9072 40 9032 0.00247   0.066402 0.005238 0.002506  
Males 6863 532 6331 0.074505 4.89 0.077517 0.003228 0.074141 Schmidt, 1936
Females 5604 20 5584 0.002862   0.059740 0.006667 0.002905  
Males 21231 1687 19544 0.078034 5.62 0.079459 0.001856 0.077898 Koliopoulos et al., 1976
Females 8754 37 8717 0.001740   0.065013 0.005333 0.001753  

Table 3: Colour blindness in different studies.


The reduction of the biases in the female estimates in the Tables 2 and 3 is presented in Table 4. The comparison between the maximum likelihood estimates and the improved estimates indicates that the MLE has a negative bias, but the sample sizes result in ignorable errors. The improvements proposed by by Haldane (1956) and Huether & Murphy (1980) yield almost identical estimates.

  ML estimate Haldane, 1956 Huether & Murphy, 1980
Mann et al., 1962 0.33422 0.33598 0.33603
Noades et al.,  1966 0.32769 0.32789 0.32789
Sanger et al., 1971 0.33678 0.33688 0.33688
Colour vision      
Whaaler,  1927 0.06640 0.06661 0.06661
Schmidt , 1936 0.05974 0.06011 0.06012
Koliopoulos et al., 1976 0.06501 0.06523 0.06523

Table 4: Comparison between the maximum likelihood estimates and the improved estimates proposed by Haldane (1956) and Huether & Murphy (1980).

If our minimax design equation is used for an estimation problem, then the minimum efficiency is 0.5155, which is obtained for p = 0. If we compare this value with the maximin solution x = 0.6667 for the estimation problem, we observe how much we have to “pay” for the hypothesis testing. On the other hand, if our primary goal is estimation and we choose the design equation , then the corresponding maximal variance is equationwhich is obtained for p = 0.3333. This can be compared with the earlier obtained equation. Hence, if our target is parameter estimation, then the efficiency of the model test is reduced in the proportion 0.8991 : 1 .

The common opinion of today is that colour blindness is not a onelocus trait. Waaler´s, Smith´s, and Koliopoulo´s data show statistically significant differences from the one-locus model. The common finding in this study is that the estimate equationis less than equation, and this result supports the two-loci hypothesis. However, the other colour vision data, especially the female data, are very limited. NZHTA Report 7 (1998) presents colour vision data collected from different sources and the value of this study is this collection. In addition, that study presents tests of the sex differences in the distribution between subjects with colour deficiency and normal sight. The tests indicate strong sex differences, but the tests have ignored the effect of the sex-linked property of colour blindness, and consequently, these results are of minor interest.


The author is grateful to an anonymous referee for very constructive suggestions concerning the biasedness among the “female estimates” and has tried to consider these remarks in the text. This work was supported in part by a grant from the Magnus Ehrnrooth Foundation.

equation. For details, see the text.


Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Relevant Topics

Article Usage

  • Total views: 11995
  • [From(publication date):
    September-2012 - Jul 22, 2019]
  • Breakdown by view type
  • HTML page views : 8203
  • PDF downloads : 3792