Received Date: March 10, 2017; Accepted Date: March 27, 2017; Published Date: March 30, 2017
Citation: McCracken CE, Looney SW (2017) On Finding the Upper Confidence Limit for a Binomial Proportion when Zero Successes are Observed. J Biom Biostat 8: 338. doi:10.4172/2155-6180.1000338
Copyright: © 2017 McCracken CE, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
We consider confidence interval estimation for a binomial proportion when the data have already been observed and x, the observed number of successes in a sample of size n, is zero. In this case, the main objective of the investigator is usually to obtain a reasonable upper bound for the true probability of success, i.e., the upper limit of a one-sided confidence interval. In this article, we use observed interval length and p-confidence to evaluate eight methods for finding the upper limit of a confidence interval for a binomial proportion when x is known to be zero. Long-run properties such as expected interval length and coverage probability are not applicable because the sample data have already been observed. We show that many popular approximate methods that are known to have good long-run properties in the general setting perform poorly when x=0 and recommend that the Clopper-Pearson exact method be used instead.
p-Confidence; Exact methods; Interval length; One-sided interval; Sample size
This study was motivated when one of the authors was approached by a clinical investigator in the dental school who was seeking advice on how to design a research study consisting of a sequence of independent trials, each with only two possible outcomes (“success” and “failure”). Based on a previously published study he had conducted , as well as his clinical judgment and experience, the investigator had strong reason to believe that the observed number of “failures” in the study he was planning would likely be zero. His question for us was, “Assuming that there will be no failures in my study, how many trials do I need to conduct so that I can be reasonably sure that the true probability of failure is no greater than .05”? Since the trials can be assumed to be independent, it is reasonable to assume that X=the number of “failures” in the clinical study follows a binomial distribution, with n=the number of independent trials and π=the probability of “failure” on any one trial.
Many research studies in clinical areas and other applied fields meet the assumptions of the binomial distribution (Without loss of generality, we will refer to the two outcomes as “success” and “failure” and the outcome of primary interest as “success”). Furthermore, it is not uncommon in these studies for x, the observed number of successes in a sample of size n, to be zero. Examples can be found [2-6].
Only a few publications in the statistical literature have examined or compared methods for analyzing binomial data in which the observed number of successes is zero; for example, see [7-12]. Observing X=0 in a binomial sample lends itself to the Bayesian approach since one can condition on the observed data, and several authors have considered this approach [13-16]. In the present article, we approach the problem from a frequentist point of view; however, we also include a Bayesian approach for comparison purposes.
We restrict our attention to the situation in which one is interested only in finding the one-sided (upper) confidence limit for the true value of the binomial proportion when the number of successes has already been observed to be zero. The use of one-sided confidence limits in this situation is controversial [8,10], but some authors have recommended their use [17,18]. Following our discussions with the clinical investigator, we decided that a one-sided upper confidence limit would be appropriate. Our review of the relevant literature indicated that there was no clear consensus on the best upper confidence limit to use when the observed number of successes is zero, especially if one wished to approach the problem from a frequentist perspective. Hence, in order to be able to provide a well-informed response to the dental researcher's question, we concluded that it would be worthwhile to systematically compare the most widely-used binomial confidence interval (C.I.) methods under the assumption that x=0.
In Section 2, we describe each of the methods for finding confidence limits for a binomial proportion that we compare; in Section 3, we describe our methodology for comparing the performance of these methods; in Section 4, we provide a summary of the results of our comparisons; and in Section 5, we discuss our results and make recommendations concerning the best methods to use in practice.
In this section, we describe the eight methods we compared for finding πu, the upper 100 (1-α)% confidence limit for the true probability of success for a binomial random variable, under the assumption that the number of successes has already been observed to be 0. We selected these methods either because (1) they are commonly covered in introductory statistics or biostatistics textbooks, or (2) they have been recommended for general use when finding confidence limits for a binomial proportion.
One method we did not include is the Wald interval, which continues to be one of the most widely used methods for finding confidence limits for a binomial proportion. The Wald interval is based on the normal approximation to the binomial, and the approximate 100 (1-α)% upper confidence limit for π for a one-sided interval is given by:
where p=x/n is the observed proportion of successes and Z1-α is the standard normal deviate associated with an upper-tail area of α. However, when p=0, the Wald interval is (0, 0) for any confidence coefficient and is obviously uninformative. Nevertheless, such intervals are still reported in the scientific literature .
If one uses the normal approximation to the binomial with continuity correction, the upper confidence limit for π given by:
. When x=0, this reduces to which is not a valid confidence limit because it does not depend on the confidence coefficient.
The Agresti-Coull method
Agresti and Coull  proposed a modification to the standard 95% Wald interval that consisted of adding four pseudo observations (two successes and two failures) to the sample. When x=0, their “adjusted Wald” approximate 95% upper confidence limit is given by:
Wilson’s “Score” interval
Wilson  proposed a method based on inverting the score test of the null hypothesis H0: π=π0. It is similar to the Wald interval except that the standard error is based on the hypothesized null value π0 instead of the observed proportion of success p. The approximate 100 (1-α)% upper confidence limit is given by
when zero successes are observed.
Continuity-corrected Wilson interval
Under some conditions, Wilson’s interval fails to maintain its prespecified 1-α nominal coverage probability . Therefore, Casella  recommended using the continuity-corrected normal approximation to the binomial when inverting the score test. When x=0, the approximate upper limit for a 100 (1-α)% C.I. is:
Clopper-Pearson “exact” (C-P) method
Clopper and Pearson  proposed what is usually called the “exact” method for constructing binomial confidence limits. This method is based on the inversion of the exact test of H0: π=π0 using the binomial distribution. Thus, the upper limit of an exact 100 (1-α)% one-sided interval is found by solving for πU in the following equation:
When x=0 successes are observed, the exact method provides the following 100 (1-α)% upper confidence limit:
Louis  noted that 1−α1/n is “exactly the 95 percent Bayesian prediction interval based on a uniform prior distribution for π.” In addition, the C-P upper limit is equivalent to the upper limit of the modified Jeffreys interval proposed by Brown, Cai, and DasGupta  when x=0.
Mid-P adjusted Clopper-Pearson method
Some authors have recommended using the mid-p adjustment  to help overcome the possibly extreme conservatism of the Clopper- Pearson method, especially when n is small . This approach involves inverting the mid-p adjusted exact test of H0: π=π0. The mid-p adjusted one-tailed p-value is obtained by subtracting half the point probability of the observed value of X from the appropriate one-tailed exact p-value. When x=0, the mid-p adjustment yields an upper 100 (1-α)% confidence limit of
The binomial distribution with large n and small π can be approximated by the Poisson distribution with parameter λ=nπ and the resulting approximation can be used to derive approximate confidence limits for π . For x=0, the Poisson approximation to the binomial yields an approximate upper 100 (1-α)% confidence limit of
The Poisson approximation for a 95% upper limit is equivalent to the Rule of 3, in which πu=3/n .
The single augmentation with an imaginary failure or success (SAIFS) method
Borkowf  suggested an approach similar to the Agresti-Coull method. His method adds an imaginary failure to the sample when finding the lower confidence limit and an imaginary success when finding the upper confidence limit. When zero success occur in a sample of size n, the SAIFS method gives a one-sided upper confidence limit of
Borkowf showed that replacing the z1-α quantile with a t1-α quantile with n-1 degrees of freedom could improve coverage probabilities, so that is the method we will consider here.
Bayes-Laplace HPD interval
The Bayes-Laplace (B-L) highest posterior density (HPD) credible interval for a binomial proportion is derived using the Bayesian approach with a non-informative Bayes-Laplace prior of beta (1, 1). Tuyl, Gerlach, and Mengerson  recommended the B-L prior as a consensus prior when x=0 is observed, and this approach yields the following 100 (1-α)% upper confidence limit:
We illustrate the methods described above using data from the clinical study that motivated this investigation . Because dental devices emit electromagnetic energy, dental patients with implanted biotechnological devices can be at risk for adverse events relating to the functioning of their implants. The implant of interest in  was a neurostimulator (NS) device commonly used to relieve various types of pain, including spinal axial pain, ischemic limb pain, certain anginal pain syndromes, and chronic diabetic neuropathic pain. The purpose of the study was to investigate the effects of electromagnetic interference on an NS during the operation of three dental devices: an electric pulp tester, an apex locator, and an electrocautery unit. The investigator wished to estimate the proportion of experimental trials in which the dental device would damage an NS device located in different tissues. Based on his previously published study , in which he found that various dental devices had no impact on cochlear implants, the investigator had strong reason to believe that the observed number of “failures” in the NS study he was planning would also be zero. He wished to know the number of trials he would need to conduct to be 95% confident that the true probability of failure is no greater than πmax=0.05. A sample size calculation for the C-P method assuming that a one-sided confidence limit would be used in the data analysis yielded
The investigator chose to perform 70 independent trials in order to ensure that data from an adequate number of valid trials would be available. Once the study data were collected, no damage to the NS implant was observed in any of the n=70 trials. For purposes of illustration, we calculated the upper 95% confidence limit using these data for each of the eight methods described above (Table 1). In Table 1, we have also indicated the confidence coefficient of the C-P interval based on each of these upper limits. These coefficients range from 90.0% for the mid-p adjusted C-P upper limit to 98.5% for the Agresti- Coull upper limit.
|Method||Formula for Upper Limit||Upper Limit||Clopper-Pearson Confidence Coefficient†|
|Continuity- Corrected Wilson||0.0500||97.2|
|Mid-P Adjusted Clopper-Pearson||0.0324||90.0|
SAIFS: Single Augmentation with an Imaginary Failure or Success
†The Clopper-Pearson confidence coefficient based on the specified upper limit.
Table 1: Upper Confidence Limits for One-sided Intervals when x=0, n=70, α=.05.
Criteria for comparing the confidence interval methods
In this article, we use two criteria to evaluate the different methods for constructing upper one-sided binomial confidence limits when x is known to be zero: observed interval length and p-confidence . Properties such as coverage probability and expected interval length that are typically used to evaluate long-run performance of C.I. procedures are not applicable for our study because we consider only the situation when the data have already been observed and the value of x is known to be zero. Therefore, the expected values required for coverage probability and expected interval length cannot be calculated.
Observed interval length
Observed interval length is calculated by subtracting the lower confidence limit of a C.I. from the upper confidence limit. Since we are only interested in calculating the upper confidence limit for a onesided interval, the lower limit is assumed to be zero, and therefore the observed interval length is simply the value of the upper confidence limit.
In the general case, p-confidence is defined for the binomial in terms of the equal-tail p-value function using the binomial distribution and the observed number of successes . In the upper-tailed onesided case, when x=0, the p-value function reduces to a function of π alone:
Let πu=the upper confidence limit of the one-sided 100(1-α)% C.I. computed using any particular method (continuity-corrected Wilson, Agresti-Coull, etc.). For a C.I. I(0) based on zero successes,
Vos and Hudson  give the following interpretation of p-confidence: “The p-confidence is a measure of how large the associated p values are for parameters outside an interval. If we are interpreting a C.I. as the inversion of a hypothesis test, then values of the parameter outside a good approximate C.I. should not have associated p values that are appreciably larger than α where (1−α) × 100% is the confidence level.” Thus, it is desirable for p-confidence to be close to the nominal 100(1-α)% confidence coefficient for approximate confidence intervals.
The relationship between coverage probability, which is a measure of the long-run performance of a C.I. procedure, and p-confidence, which is a “post-data” measure of performance, has a connection with the relationship between Neyman’s Type I error rate and Fisher’s p-value in hypothesis testing . The concept of p-confidence is closely related to the concept of consonance , and therefore is especially helpful when thinking of C.I.’s as inversions of hypothesis testing procedures, as in the present article. It is useful when interpreting the interval (0, πu) to ask the question: “How large could the probability of zero successes in n trials be for values outside this interval?” The answer is “1–p-confidence.”
For example, consider the 95% Agresti-Coull interval when x=0. The values of p-confidence for these intervals range from 95% for n=5 to 98.5% for n=100. In other words, the probability of zero successes in n trials for values outside these intervals ranges from 0.015 for n=100 to 0.05 for n=5. In contrast, the probability of zero successes in n trials for values outside the 95% C-P interval is 0.05 for all n.
Another criterion that can be used to evaluate C.I.’s after the data have been observed is p-bias . However, p-bias is equal to zero for the one-sided intervals examined in this article, so it will not be considered further.
To assist us in determining if the p-confidence for an approximate C.I. procedure differs in a meaningful way from the nominal 100 (1-α)% confidence level, we adapted the following “liberal” guideline proposed by Bradley  for evaluating the robustness of a statistical test: if the true significance level differs from the nominal level by no more than α /2, one can conclude that the test is robust. If the true significance level differs by more than α/2 from the nominal level (either above or below), one can conclude that the test is not robust. In the present study, we applied the Bradley criterion as follows: if the p-confidence differed from the nominal confidence level by no more than α/2, we concluded that the p-confidence for that procedure was within acceptable limits. If the p-confidence differed by more than α/2 from the nominal confidence level (either above or below), we concluded that the p-confidence for that method was not acceptable. Thus, for a 90% C.I., the p-confidence for a C.I. procedure must be between 85% and 95% to be acceptable; for 95% and 99% C.I.’s, these limits are 92.5% to 97.5% and 98.5% to 99.5%, respectively.
Observed interval length
Figure 1 displays the ratio of observed C.I. length relative to the length of the C-P interval for n=5 (1) 10 (5) 50 (10) 100 and a nominal confidence level of 95%. (Similar figures for 90% and 99% intervals are available from the second author. The conclusions based on these figures are very similar to those based on Figure 1). Of the seven methods that were evaluated relative to the C-P method, the Wilson and mid-p methods produce the shortest intervals for values of n between 5 and 100 (Figure 1). The B-L HPD intervals and the SAIFS intervals for n>10 are also consistently shorter than the C-P intervals. The 95% C.I.’s based on the Poisson approximation (the Rule of 3) and the continuity-corrected Wilson intervals are always longer than the C-P intervals. The Agresti-Coull method consistently produces the longest 95% intervals for n>10.
Figure 1: Ratio of interval length relative to the Clopper-Pearson (C-P) interval for the seven 95% confidence intervals when n=5 (1) 10 (5) 50 (10) 100. A-C: Agresti-Coull Interval; Wilson_CC: Continuity-Corrected Wilson Interval; SAIFS: Single Augmentation with an Imaginary Failure or Success; B-L: Bayes-Laplace Highest Posterior Density Interval.
Figure 2 through 4 give the range of p-confidence values for n=5 (1) 10 (5) 50 (10) 100 and confidence coefficients of 90%, 95%, and 99%, respectively. The “acceptability” limits are indicated on the figures by a dotted, black, boldface line.
Figure 2: P-confidence for the seven 90% confidence intervals when n=5 (1) 10 (5) 50 (10) 100. The bold face dotted lines indicate the arbitrary “acceptability limits” defined in the text. Wilson_CC: Continuity-Corrected Wilson Interval; C-P: Clopper-Pearson Interval; SAIFS: Single Augmentation with an Imaginary Failure or Success; B-L: Bayes-Laplace Highest Posterior Density Interval.
For 90% C.I.’s, the C-P method, the continuity-corrected Wilson method, the SAIFS method, and the B-L HPD method maintain p-confidence within the acceptable limits of 85% to 95% for all n (Figure 2). Intervals based on the Poisson approximation have acceptable p-confidence except when n=5. The mid-p and Wilson intervals have p-confidence well below the lower acceptability limit of 85% for all values of n.
Similar to the results for 90% intervals, the p-confidence results for 95% intervals for the continuity-corrected Wilson method and the SAIFS method fall within the acceptability limits for all n (Figure 3). For the B-L HPD intervals, p-confidence falls within the acceptability limits as long as n ≥ 7.
Figure 3: P-confidence for the eight 95% confidence intervals when n=5 (1) 10 (5) 50 (10) 100. The bold face dotted lines indicate the arbitrary “acceptability limits” as defined in the text. A-C: Agresti-Coull Interval; Wilson_CC: Continuity-Corrected Wilson Interval; C-P: Clopper-Pearson Interval; SAIFS: Single Augmentation with an Imaginary Failure or Success; B-L: Bayes-Laplace Highest Posterior Interval. Note that for 95% intervals, the Poisson approximation is equivalent to the Rule of Three.
In contrast, the 95% Agresti-Coull intervals maintain p-confidence above 98% for n ≥ 20. The Poisson (Rule of 3) intervals have high values of p-confidence for values of n<9 but as n increases, the p-confidence falls between the acceptability limits. The mid-p method maintains a p-confidence of 90%, well outside the lower acceptability limit. The Wilson intervals eventually reach an acceptable level of p-confidence when n ≥ 30, but p-confidence can be as low as 88.5% for smaller values of n.
For 99% C.I.’s, p-confidence for the mid-p intervals is 98%, below the lower acceptability limit of 98.5% (Figure 4). Except for small values of n, the SAIFS intervals have p-confidence below 98.5% and can be as low as 96.6%. For n<20, the Poisson approximation method produces p-confidence as high as 99.9%. The continuitycorrected Wilson intervals have acceptable p-confidence only for n<10; otherwise, p-confidence is greater than the upper acceptability limit. In contrast, the Wilson intervals produce p-confidence less than the lower acceptability limit for n<9 and acceptable values otherwise. The B-L HPD intervals have unacceptable p-confidence for n ≤ 10, but acceptable values otherwise.
Figure 4: P-confidence for the seven 99% confidence intervals when n=5 (1) 10 (5) 50 (10) 100. The bold face dotted lines indicate the arbitrary “acceptability limits” defined in the text. Wilson_CC: Continuity-Corrected Wilson Interval; C-P: Clopper-Pearson Interval; SAIFS: Single Augmentation with an Imaginary Failure or Success; B-L: Bayes-Laplace Highest Posterior Density Interval.
Our results indicate that that many popular approximate methods that are known to have good long-run properties in terms of coverage probability and expected interval length in the general binomial setting perform poorly when x=0. For example, the Agresti-Coull method consistently produces unnecessarily long 95% C.I.’s, resulting in values of p-confidence much larger than 95%. Thus, we recommend that the Agresti-Coull method not be used when x=0. Although the mid-p interval has been recommended for general use when finding a C.I. for a binomial proportion, it should not be used when zero successes are observed because it produces intervals that is much too short, resulting in unacceptably low p-confidence.
If one wishes to use an approximate method, the Wilson, continuity-corrected Wilson, Poisson, SAIFS, and Bayes-Laplace HPD intervals are all acceptable for various combinations of sample size and confidence coefficient. However, none of these methods had acceptable values of p-confidence for all combinations of n=5 (1) 10 (5) 50 (10) 100 and nominal confidence coefficients of 90%, 95%, and 99%.
Despite criticism that the C-P method tends to produce confidence intervals that are too long in the general estimation setting [18,20], this method performed quite well when x=0. The moderate length of the C-P intervals, relative to the other methods, together with the fact that C-P intervals will always have p-confidence equal to the nominal confidence level, regardless of sample size, lead us to recommend the C-P method for general use when x=0. An additional advantage is that when x=0 the C-P interval is equivalent to a Bayesian credible interval based on the modified Jeffreys prior recommended by Brown, Cai, and DasGupta  for general use when estimating the binomial parameter. In terms of the clinical problem that prompted our investigation, the C-P interval has another advantage in that it has a very simple interpretation for the client: “We can be quite certain that the probability of damaging the neurotransmitter is less than 4.2% because, if it is greater than that, observing no damaged neurotransmitters in 70 patients is very unlikely, occurring with probability less than 0.05.”
One could argue that we "stacked the deck" in favor of the C-P method in our comparison of the various methods since any exact C.I. method will always have p-confidence equal to the nominal confidence level, regardless of sample size . However, several authors have argued against use of confidence intervals based on the C-P method under any circumstances and instead recommended that an approximate method be used; for example [20,29,34,35]. Agresti and Coull  even titled their article "Approximate is Better than ‘Exact’ for Interval Estimation of Binomial Proportions."
Because of this "exact vs. approximate" controversy, we undertook our study to examine the behavior of commonly used and commonly recommended approximate methods, in addition to the exact C-P method, in the special situation when x=0. We defined "acceptability limits" based on a modification of Bradley's robustness criterion to assist us in evaluating the p-confidence of the approximate C.I. methods. If one of the approximate methods had achieved acceptable p-confidence for all combinations of n and confidence coefficient that we considered, we would have also recommended it for routine use when x=0, especially if it was competitive with the C-P method in terms of observed interval length. However, none of the approximate methods satisfied these conditions. For this reason, we recommend that the Clopper-Pearson exact method be used to estimate the upper confidence limit of a one-sided interval whenever x=0 is observed.
The authors wish to thank Steven Roberts for providing us with the clinical example that motivated this study, Carl Russell for suggesting the use of exact methods for finding the upper confidence limit, and Alan Agresti, Rickey Carter, Daniel Linder, Warren May, and Paul Vos for their insightful comments on the manuscript.