Relative Likelihood Differences to Examine Asymptotic Convergence: A Bootstrap Simulation Approach

Maximum likelihood estimators (mle) and their large sample properties are extensively used in descriptive as well as inferential statistics. In the framework of the large sample distribution of the mle, it is important to know the relationship between the sample size and asymptotic convergence i.e. for what sample size does the mle behave satisfactorily "attaining asymptotic normality." Previous work has discussed the undesirable impact of using large sample approximations of the mles when such approximations do not hold. It has been argued that relative likelihood functions should be examined before making inferences based on the mle. Little has been explored regarding the appropriate sample size that would allow the mle achieve asymptotic normality from a relative likelihood perspective directly. Our work proposes a bootstrap/simulation based approach in examining the relationship between sample size and the asymptotic behaviors of the mle. We propose two measures of the convergence of observed relative likelihood function to the asymptotic relative likelihood function namely: differences in areas and dissimilarity in shape between the two relative likelihood functions. These two measures were applied to datasets from the literature as well as simulated datasets. Journal of Biometrics & Biostatistics J o ur al of Bio metrics & Bistatis t i c s ISSN: 2155-6180 Citation: Bimali M, Brimacombe M (2015) Relative Likelihood Differences to Examine Asymptotic Convergence: A Bootstrap Simulation Approach. J Biom Biostat 6: 220. doi:10.4172/2155-6180.1000220 J Biom Biostat ISSN: 2155-6180 JBMBS, an open access journal Page 2 of 10 Volume 6 • Issue 1 • 1000220 little has been explored regarding the appropriate sample size that would allow the mle to achieve asymptotic normality from a relative likelihood perspective directly. However little has been explored regarding the appropriate sample size that would allow the mle to achieve asymptotic normality from a relative likelihood perspective directly. This work proposes a bootstrap/simulation based approach to the above question via the behavior and properties of the relative likelihood function. This work proposes a bootstrap/simulation based approach to the above question via the behavior and properties of the relative likelihood function. In particular we measure the proximity of the observed likelihood function to the likelihood function based on large sample properties, both of which are scaled here by their modes to have a maximum at one. The convergence measures proposed by the authors are(i)difference in area under the two relative likelihood functions and (ii) dissimilarity in the shape of the two likelihood functions (dissimilarity index). We propose that, for a given sample size, if the difference in the area under the two relative likelihood functions and the dissimilarity index between them are both close to 0, the asymptotic approximation of mle is satisfactorily achieved. To study the properties of these measures and related likelihood convergence, we use the bootstrap to generate samples of varying size based on initial samples for examples in literature. The paper is laid out as follows. Section 2 provides a review of the bootstrap method and the proposed measures of distance between distributions. In section 3, we provide the mathematical details of the two measures of convergence. In section 4 we provide examples by simulating data from exponential families of distribution and apply our method to data available in the literature and textbook. Review of Bootstrap and Distances Between Distributions Bootstrap The Bootstrap is a re sampling technique introduced by Efron (1979) with a related long history [11] and has attracted much attention in the past three decades primarily due to its conceptual simplicity and the computational empowerment of statisticians due to advances in computer science [12]. The past three decades have witnessed much works dedicated to developing bootstrap methods [13-19]. Bootstrap at its core, is a re sampling technique that treats the data at hand as a “surrogate population” and allows for sampling with replacement with a goal of re-computing the statistic of interest many times. This allows us to examine its bootstrap distribution. Efron has demonstrated that the bootstrap method outperforms other methods such as jackknifing and cross-validation [12]. Despite the simplicity of the bootstrap algorithm, the large sample properties of bootstrap distributions are surprisingly elegant. Singh (1981), for example has demonstrated that the sampling distribution of θ θ   −     


Introduction
"Likelihood" is arguably the most pronounced terminology in the statistical realm and was defined and popularized by the eminent geneticist and statistician Fisher (1922) [1][2][3][4]. The likelihood function is a function of model parameter(s) based on a given set of data and a pre-defined probability density function (pdf). The likelihood function can be formally defined as follows: If ( | ) θ f x is the joint pdf (pmf) of the sample ( ) 1 , , = … n X X X based on the random variables ' i X s iid, then the likelihood function of θ is given by: where c is a constant with respect to .
A key point often reiterated in textbooks is that the likelihood function is a function of and not to be viewed as a probability density itself [5]. However, the shape of the likelihood function relative to its mode is often of interest in estimating . Likelihood functions can be mathematically constructed for most statistical distributions; but maximum likelihood estimators may not always have closed form [6]. Nevertheless most of the distributions commonly used allow for the computation of maximum likelihood estimators either analytically, numerically or graphically. Several properties of maximum likelihood estimators such as asymptotic normality, invariance, and ease of computation have made maximum likelihood estimators popular [7]. In this paper we assume is a scalar throughout.
The large sample distribution of the maximum likelihood estimators is often used for inferential purposes. If θ , the observed Fisher Information, has been used as an approximation in the computation of θ        I [8].
A common question that often arises in statistics is in regard to sample size. In the framework of the large sample distribution of the mle, we are interested in knowing for what sample size the mle behaves satisfactorily, attaining the asymptotic normal distribution. Put a different way, does the existing sample size allow us to use the large sample properties of the mle with confidence? If not, what would be an ideal sample size? Sprott et al (1969) discussed some of the undesirable impacts of using large sample approximation of the mle when such approximations do not seem to hold [9] and suggested examining likelihood functions before making inferences based on mle. They demonstrate via an example from Bartholomew [10] that drawing inferences from the mle without first examining the likelihood functions can be misleading. Figure 1 gives the plot of the observed relative likelihood (likelihood functions scaled by its mode) as obtained from Bartholomew's data and relative normal likelihood based on the large sample theory of mle. The plot shows that for a pre-specified value of the relative likelihood, the range of s can be in complete disagreement between the two likelihood functions. A relative likelihood of 10 % or higher, the ranges are roughly (20,110) and (7.81) for the relative and relative normal likelihood function [9], approximately a 17% drop in coverage. Sprott (1969) also demonstrated that transformation of the mle can help achieve the asymptotic normality with smaller sample sizes. However little has been explored regarding the appropriate sample size that would allow the mle to achieve asymptotic normality from a relative likelihood perspective directly. However little has been explored regarding the appropriate sample size that would allow the mle to achieve asymptotic normality from a relative likelihood perspective directly. This work proposes a bootstrap/simulation based approach to the above question via the behavior and properties of the relative likelihood function. This work proposes a bootstrap/simulation based approach to the above question via the behavior and properties of the relative likelihood function. In particular we measure the proximity of the observed likelihood function to the likelihood function based on large sample properties, both of which are scaled here by their modes to have a maximum at one. The convergence measures proposed by the authors are(i)difference in area under the two relative likelihood functions and (ii) dissimilarity in the shape of the two likelihood functions (dissimilarity index). We propose that, for a given sample size, if the difference in the area under the two relative likelihood functions and the dissimilarity index between them are both close to 0, the asymptotic approximation of mle is satisfactorily achieved. To study the properties of these measures and related likelihood convergence, we use the bootstrap to generate samples of varying size based on initial samples for examples in literature.
The paper is laid out as follows. Section 2 provides a review of the bootstrap method and the proposed measures of distance between distributions. In section 3, we provide the mathematical details of the two measures of convergence. In section 4 we provide examples by simulating data from exponential families of distribution and apply our method to data available in the literature and textbook.

Review of Bootstrap and Distances Between Distributions Bootstrap
The Bootstrap is a re sampling technique introduced by Efron (1979) with a related long history [11] and has attracted much attention in the past three decades primarily due to its conceptual simplicity and the computational empowerment of statisticians due to advances in computer science [12]. The past three decades have witnessed much works dedicated to developing bootstrap methods [13][14][15][16][17][18][19]. Bootstrap at its core, is a re sampling technique that treats the data at hand as a "surrogate population" and allows for sampling with replacement with a goal of re-computing the statistic of interest many times. This allows us to examine its bootstrap distribution. Efron has demonstrated that the bootstrap method outperforms other methods such as jackknifing and cross-validation [12]. Despite the simplicity of the bootstrap algorithm, the large sample properties of bootstrap distributions are surprisingly elegant. Singh (1981), for example has demonstrated that the sampling distribution of θ θ is an estimate of , is approximated well approximated by its bootstrap distribution [20]. Bickel and Freedman have also made substantial contributions in developing bootstrap theory [21][22][23]. The most common applications of the bootstrap in its basic form involve approximating the standard error of sample estimatorse, correcting the bias in the sample estimate, and in constructing confidence intervals. However in situations involving bootstrapping dependent data, modified bootstrap approaches such as moving-block bootstrap are recommended [24]. Romano (1992) has also discussed extensively applications of the bootstrap [18]. Here we use the bootstrap as an approach to simulating samples based on the observed data. The sampling properties of the bootstrap are not used directly as we observe convergence behavior on the likelihood scale.

Distance between distributions
Kullback-Leibler distance is a commonly used measure of the difference between two statistical distributions [25]. If p(x) and q(x) are two continuous distributions the KL distance between p(x) and q(x) is defined as follows:.
Kullback-Leibler distance has been applied in areas such as functional linear models, Markovian processes, model selection, and classification analysis [26][27][28][29]. It should be noted that the Kullback-Leibler distance is not symmetric, , ≠ KL p q KL q p , but can be expressed in a symmetric form [30].
Bhattacharya distance is another popular measure of difference between two distributions [31]. If p(x) and q(x) are two continuous distributions the Bhattacharya distance between p(x) and q(x) is defined as follows: Bhattacharya distance has also found extensive applications in several fields [32][33][34][35]. Bhattacharya distance assumes the product p(x) q(x) to be non-negative.
In lieu of the above two distance measures, we could simply In this paper we make use of the bootstrap approach to resample from the actual sample (or simulate data from known distributions) to obtain a "bootstrap sample". For each "bootstrap sample", the observed  relative likelihood function and corresponding asymptotic (normal) relative likelihood function are constructed and the area under the two relative likelihood functions computed. As the size of "bootstrap sample" increases we measure the convergence of the observed relative likelihood function to the asymptotic relative likelihood function. The convergence is measured by the difference in area under the curves and a dot product based measure of curve similarity. We note that simulated data is not a real world.

Method
Background is defined as follows: , the asymptotic relative likelihood function of can be defined as follows: For exponential families, the density function can be expressed in the following form: and the related likelihood function expressed as: is the mle of , then the likelihood function evaluated at θ  is: Thus the observed relative likelihood function ( ) θ R is: The asymptotic relative likelihood function for θ  assumes the following form: If the expression does not have a closed form solution, numerical methods such as Simpson's rule [36] can be applied: Where n is the number of intervals.
For similar curves we would expect ∆R to be very small. A tolerance level may set: will be acceptable. [37].

Proof:
The general expression for Taylor expansion of a function ( ) f x around 'a' is as follows: This is the derivative of the score function evaluated at mle.
θ R X can be approximated as: The k! in the higher order terms of the Taylor expansion shrinks it to 0.
For exponential families: n a a n a a Volume 6 • Issue 1 • 1000220 S and 2i S are parallel. This is the case for perfect similarity.
S and 2i S are in opposite direction. This is the case for perfect dissimilarity.
Ideally if the two curves were exactly same, we would expect (3)

A Dissimilarity Index
Equation (3) can be used to express disagreement between the two curves (a dissimilarity index). If D is the dissimilarity index between the two curves then, The proposed simulation based approach can be summarized in the following steps.
For a given sample ( ) Choose alerance level for ∆R and D.
Compute ∆R and D for the given sample.
If ∆R and D are not sufficiently close to 0, bootstrap from the original sample and compute ∆R and D again for the bootstrap sample.
Repeat step 3 until satisfactory convergence in achieved i.e. ∆ D and R is less than chosen tolerance level.
The next section contains several simulated examples to demonstrate the application of the above method.

Results
In this section, we examine the convergence of likelihood functions for some of the common distributions, using simulated data as well as

Simulation studies
The convergence of the observed relative likelihood function to This implies that higher order terms in the Taylor expansion are converging to zero. Our method here graphically demonstrates this as a function of n. The idea is to divide the support of the two curves into sufficiently small segments so that each of them can be approximated by a line segment ( Figure 2). Each of these segments is equivalent to a vector in two dimensions and hence we can compute the dot product for the two vectors in each of these segments. If in general the two vectors are parallel in each of these segments, this would imply that the two curves have similar local curvature and hence the curves are locally similar. In other words, for similar curves, the dot product between the two vectors is equal to the product of their individual L 2 norms over each segment Figure 2.

Curve dissimilarity index
Let i=1,…,n+1, be the number of points over which the two curves are segmented i.e. there are n segments of the two curves in total.   Red curve and black curve represents two functions. S 1i (θ) and S 2i (θ) formed by connecting

POISSON DISTRIBUTION
Observed relative likelihood function (blue curve) and asymptotic relative likelihood function (red curve) show increasing overlap as sample size increases.  The values of difference in area (red line) and dissimilarity index (green line) for data of different sample sizes generated under weibull distribution. The Pearson's and Spearman's rank correlation between the two measures of convergence are also reported. Observed relative likelihood function (blue curve) and asymptotic relative likelihood function (red curve) show increasing overlapping as sample size increases.    A group of 20 mice are allocated to individual cages randomly. The cages are then assigned randomly to two treatments namely control A and drug B. All animals were infected with tuberculosis. The number of days until the mice die is recorded (Table 4).
For mice assigned to drug the mean and variance are roughly equal and the data is count data. So a Poisson model is a reasonable choice. Based on the proposed methods, the values of difference in area under curves ∆A and dissimilarity index were found to be D:0.00204 and 0.0066 respectively. It indicates that the asymptotic normality approximation of the mles holds for the data (Drug) above Figure 7. b) Data from Williams et al. (1995).
The following data was obtained from Williams et al. [39,40]. The data is the weight (in grams) of dry seed in the stomach of each spinifex

Exponential Distribution (Williams et al.)
Observed relative likelihood function (blue curve) and asymptotic relative likelihood function (red curve) show increasing overlapping as sample size increases.       The plot of relative and relative normal likelihood functions together with the values of∆A and D is in figure 8: While the difference in area is small enough, the value of dissimilarity index seems fairly high. It was seen that that with larger samples (bootstrap) the dissimilarity index and difference in area both decrease (Table 5, Figure 9 and 10). c) Data from Breslow (1984) The data set is taken from a paper by Breslow who proposed an iterative algorithm for fitting over-dispersed Poisson log-linear model. The dataset provides the number of revertant colonies of TA98 Salmonella observed on each of plates processed at 6 dose levels of quinoline [39].

Data (Breslow et al)
Observed relative likelihood function (blue curve) and asymptotic relative likelihood function (red curve) show increasing overlapping as sample size increases. The two convergence measures (Table 6 and Figure 11) suggest that the data at each dose level is large enough for the mle to satisfy asymptotic normality.

Discussion
Our work discusses the issue of appropriateness of sample size required for asymptotic normality of mles to hold true. We essentially proposed two different diagnostic measures for this purpose viz. ∆ − R difference in the area under the relative observed likelihood and relative asymptotic likelihood curves and − D dissimilarity index which measures the shape of the curves. The simulated results show that different distributions have different threshold of ∆R and D. It gives an informal measure of convergence in real world. For example if we believe that the data at hand follows ( ) 10 λ = Poi distribution we could compute ∆R and D and compare it with the tabulated values in table 2. If the ∆ computed R and computed D are close to the tabulated values for the given sample size, assumption of asymptotic normality of mles is reasonable.
The two measures of convergence were also applied to data from the literature and bootstrap techniques were used in assessing the convergence of relative likelihood functions as the sample size increased. As seen from the simulated examples as well as the example from literature, a "sample size of 30" can be far more than what is actually needed and the sample size requirements for satisfactory asymptotic convergence differs for different distributions. For example with Poisson ( ) 10 λ = distribution, it was seen that samples of sizes less than 10 show convincing convergence.Our future work is directed at generalizing these diagnostic measures to distributions taking into account parameters within more than one dimension.