Adefisoye JO, Golam Kibria BM* and George F
Department of Mathematics and Statistics, Florida International University, 11200 SW 8th Street, Miami, FL 33199, USA
Received Date: September 23, 2016; Accepted Date: November 08, 2016; Published Date: November 11, 2016
Citation: Adefisoye JO, Golam Kibria BM, George F (2016) Performances of Several Univariate Tests of Normality: An Empirical Study. J Biom Biostat 7: 322. doi:10.4172/2155-6180.1000322
Copyright: © 2016 Adefisoye JO, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
The problem of testing for normality is fundamental in both theoretical and empirical statistical research. This paper compares the performances of eighteen normality tests available in literature. Since a theoretical comparison is not possible, MonteCarlo simulation were done from various symmetric and asymmetric distributions for different sample sizes ranging from 10 to 1000. The performance of the test statistics are compared based on empirical Type I error rate and power of the test. The simulations results show that the Kurtosis Test is the most powerful for symmetric data and Shapiro Wilk test is the most powerful for asymmetric data.
Chi-square; Kurtosis; Normality; Shapiro-wilk; Simulation study; Skewness; Type I and type errors
Primary 62F03, 62F40
Many of the statistical procedures including correlation, regression, t tests, and analysis of variance are based on the assumption that the data follows a normal distribution or a Gaussian distribution. In many cases, the assumption of normality is critical, when the confidence intervals are developed for population parameters like mean, correlation, variance etc. If the assumptions, under which the statistical procedures are developed, do not hold the conclusion made using these procedures may not be accurate. So, the practitioners need to make sure the assumptions are valid. Checking the validity of the normality assumption in a statistical procedure can be done in two ways: empirical procedure using graphical analysis and the goodnessof- fit tests methods. The goodness-of-fit tests which are formal statistical procedures for assessing the underlying distribution of a data set are our focus here. These tests usually provide more reliable results than graphical analysis. There are many statistical tests available in literature to test whether a given data is from a normal distribution. In this article we review most commonly used methods for normality test and compare them using power and observed significance value.
The first normality test in the literature is the chi-square goodnessof- fit test Snedecor and Cochran  which was suggested by Pearson . Later, the famous Kolmogorov-Smirnov goodness-of-fit test was introduced by Kolmogorov . The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. The Anderson-Darling test  assesses whether a sample comes from a specified distribution. It makes use of the fact that, when given a hypothesized underlying distribution and assuming the data does arise from this distribution, the frequency of the data can be assumed to follow a Uniform distribution. Lilliefors  test is a modification of Kolmogorov’s test. The Kolmogorov’s test is appropriate when the parameters of the hypothesized normal distribution are completely known whereas in Lilliefors parameters can be estimated from sample. We have included Lilifores test but not KS test in our simulation studies, since in most practical situations we would not know the parameters of null distribution. Shapiro and Wilk  test is the first test that was able to detect departures from normality due to either skewness or kurtosis or both . D’Agostino  proposed a test which is based on transformations of the sample kurtosis and skewness. 0 and Francia  suggested an approximation to the Shapiro-Wilk test which is known to perform well . Jarque- Bera  test is based on the sample skewness and sample kurtosis which uses the Lagrange multiplier procedure on the Pearson family of distributions to obtain tests for normality. In order to improve the efficiency of the Jarque-Bera test, Doornik and Hansen  proposed modification which involves the use of the transformed skewness. The skewness test Bai and Ng  is based on the third sample moment. It is used to test the non-normality due to skewness. In the kurtosis test Bai and Ng [13,14] the coefficient of kurtosis sample data is used to test non-normality due to kurtosis. Gel and Gastwirth proposed a robust modification to the Jarque-Bera test. The Robust Jarque-Bera uses a robust estimate of the dispersion in the skewness and kurtosis instead of the second order central moment. Brys, et al.  have proposed a goodness-of-fit test based on robust measures of skewness and tail weight. Bonett and Seier  have suggested a modified measure of kurtosis for testing normality. Considering that the Brys test is a skewness based test and that the Bonett–Seier is a kurtosis based test a joint test using both these measures was proposed by Romao et al.  for testing normality. The joint test attempts to make use of the two referred focused tests in order to increase the power to detect different kinds of departure from normality. Bontemps and Meddahi  have proposed a family of normality tests based on moment conditions known as Stein equations and their relation with Hermite polynomials. Gel et al.  have proposed a directed normality test, which focuses on detecting heavier tails and outliers of symmetric distributions. Last one in the list is the G test proposed by Chen and Ye .
Over forty (40) different tests have been proposed over time to verify the normality or lack of normality in a population . The main goal of this paper is to compare the performance of most commonly used normality tests in terms of the power of the test and the probability of type I error (α). Yazici and Yolacan  and recently Yap and Sim  did some work on the comparison of normality tests, but kurtosis tests and skewness tests were not in their work. Interestingly, the kurtosis test turned out to be the best test for symmetric distributions and the skewness test performs well for both symmetric and asymmetric distributions.
The rest of the paper is organized as follows. Section 2 discusses different statistical test for normality. A simulation study has been conducted in section 3. A real life data are analyzed in section 4 and finally some concluding remarks are given in section 5.
There are various parametric and nonparametric tests for normality available in literature. This section discusses widely used statistical methods for normality tests.
Lilliefor’s test [LL]
The test statistic is defined as:
Anderson–Darling test [AD]
The AD test is of the form:
Where Fn (x) is the empirical distribution function (EDF), Φ(x) is the cumulative distribution function of the standard normal distribution and ψ(x) is a weight function. The critical values for the Anderson-Darling test along with a more detailed study have been published in Stephens .
Chi-Square test [CS]
The chi-square goodness-of-fit test statistic is defined as:
where ‘ Oi’ and ‘Ei’ refers to the ith observed and expected frequencies respectively and k is the number of bins/groups. When the null hypothesis is true the above statistic follows a Chi-square distribution with k-1 degrees of freedom.
Skewness test [SK]
The skewness statistic is defined as:
, where and S is the standard deviation.
Under H0, the test statistic Z(g1) is approximately normally distributed for n > 8 and is defined as:
Kurtosis test [KU]
The kurtosis statistic is defined as:
Under H0 the test statistic Z(g2) is approximately normally distributed for n ≥ 20 and thus more suitable for this range of sample size. Z(g2) is given as:
The test combines g1 and g2 to produce an omnibus test of normality. The test statistics is:
Z2 g1 and Z2 g2 are the normal approximations to g1 and g2 respectively. The test statistic follows approximately a chi-square distribution with 2 degree of freedom when a population is normally distributed. The test is appropriate for a sample size of at least twenty.
Shapiro–Wilk test [SW]
The Shapiro-Wilk test uses a W statistic which is defined as
Where m=n/2 if n is even while m=(n – 1)/2 if n is odd, and x(i) represents the ith order statistic of the sample, the constants ai are given by
Where m1, m2, m1,… mn are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and V is the covariance matrix of those order statistics. For more information about the Shapiro-Wilk test refer the original Shapiro and Wilk  paper and for critical values refer Pearson and Hartley .
The test statistic is defined as:
The W′ equals the product-moment correlation coefficient between the x(i) and the mi, and therefore measures the straightness of the normal probability plot x(i) ; small values of W′ indicate non-normality. A detailed discussion of this test along with critical values is available in Royston .
Jarque-Bera test [JB]
The test statistic is given as:
Where and b2 are the skewness and kurtosis measures and are given by and respectively; and m2, m3, m4 are the second, third and fourth central moments respectively. The Jarque- Bera statistic is chi-square distributed with two degrees of freedom.
Robust Jarque-Bera test [RJB]
The robust Jarque–Bera (RJB) test statistic is defined as
Where and M is the sample median. The RJB statistic is asymptotically χ22-distributed
Doornik-Hansen test [DH]
The test statistic involves the use of the transformed skewness and transformed kurtosis. The transformed skewness is given by the following expression:
where Y, c and w are obtained by
Bowman and Shenton  had proposed the transformed kurtosis z2 as follows,
With ξ and a obtained by
The test statistic proposed by Doornik and Hansen  is given by
The normality hypothesis is rejected for large values of the test statistic. The test is approximately chi-squared distributed with two degrees of freedom.
Brys-Hubert-Struyf MC-MR test [BH]
This test is based on robust measures of skewness and tail weight. The considered robust measure of skewness is the medcouple (MC) defined as
where med stands for median mF is the sample median and h is a kernel function given by
The left medcouple (LMC) and the right medcouple (RMC) are the considered robust measures of left and right tail weight respectively and are defined by
The test statistic TMC-LR is then defined by
in which w is set as [MC, LMC, RMC]’, and ω and V are obtained based on the influence function of the estimators in ω. According to Brys, et al. , for the case of normal distribution, and V are defined respectively
The normality hypothesis of the data is rejected for large values of the test statistic which approximately follows the chi-square distribution with three degrees of freedom.
Bonett-Seier test [BS]
The test statistic Tw is given by:
in which ω is set by
This statistic follows a standard normal distribution under null hypothesis.
Brys-Hubert-Struyf-Bonett-Seier Joint test [BHBS]
The normality hypothesis of the data is rejected for the joint test when rejection is obtained for either one of the two individual tests for a significance level of α/2.
Bontemps-Meddahi tests [BM(1) and BM(2)]
The general expression of the test family is given by:
Where and Hk (⋅) represents the kth order normalized Hermite polynomial.
Different tests can be obtained by assigning different values to p, which represents the maximum order of the considered normalized Hermite polynomials in the expression above. Two different tests are considered in this work with p=4 and p=6; these tests are termed BM3-4 and BM3-6. The hypothesis of normality is rejected for large values of the test statistic and according to Bontemps and Meddahi ; the general BM3-p family of tests asymptotically follows the chi-square distribution with p - 2 degree of freedom.
Gel-Miao-Gastwirth test [GMG]
The test is based on the ratio of the standard deviation and on the robust measure of dispersion Jn as defined in the expression:
where M is the sample median.
The normality test RSJ which should tend to one under a normal distribution is thus given by:
The normality hypothesis is rejected for large values of the RSJ, and the statistic is asymptotically normally distributed .
G test [G]
The test is used to test if an underlying population distribution is a uniform distribution. Suppose x1, x2,…, xn are the observations of a random sample from a population distribution with distribution function F(x). Suppose also that x(1), x(2),…, x(n) are the corresponding order statistics. The test statistic has the form:
Where x(0) is defined as 0, and x(n+1) is defined as 1.
We can observe that are the ordered observations of a random sample from the U(0,1) distribution and thus the G Statistic can be expressed as:
When the population distribution is the same as the specified distribution, the value of the test statistic should be close to zero. On the other hand, when the population distribution is far away from the specified distribution, the value should be pretty close to one.
In order to use the test to test for normality, we can assume F(x) to be a normal distribution. Considering the case where the parameters of the distribution are not known, Lilliefor’s idea is adopted by calculating and s2 from the sample data and using them as estimates for μ and σ2 respectively, and thus F(x) is the cumulative distribution function of the distribution. By using the transformation:
The test statistic becomes:
The hypothesis of normality should be rejected at significant level α if the test statistic is bigger that it’s 1 - α critical value. A Table 1 of critical values is available in Chen and ye .
|Normal (0, 1) – Skewness = 0, Kurtosis = 0|
*Tests with acceptable Type I error rates.
Table 1: Simulated Type I error rate at 5% significance level.
Since a theoretical comparison among the test statistics is not feasible, a simulation study has been conducted instead to compare the performance of the test statistics in this section.
The results of the simulation vary across different levels of significance, sample size and alternative distributions. The results for the 0.05 significance level for the different distribution considered are as presented in the Tables 1-3. First we generate samples of sizes 20, 50, 100 and 500 from standard normal distribution to compare probabilities of type I error. The empirical probability of Type I error is defined as the number of times null hypothesis of normality rejected divided by the total number of simulations. The results in Table 1 are based on 10,000 simulations. We use the software R 3.1.1 R Core Team  for all simulations.
|Beta (2, 2) – Skewness = 0, Kurtosis = -0.86|
|Beta (3, 3) – Skewness = 0, Kurtosis = -0.67|
|Uniform (0, 1) – Skewness = 0, Kurtosis = -1.20|
|t(10) – Skewness = 0, Kurtosis = 1|
|t (5) – Skewness = 0, Kurtosis = 6|
|Laplace (0, 1) – Skewness = 0, Kurtosis = 3|
*The most powerful test for each sample size.
Table 2: Simulated power for symmetric distributions at 5% significance level.
|Gamma (4, 5) – Skewness = 1, Kurtosis = 4|
|Chi-square (3) – Skewness = 1.63, Kurtosis = 4|
|Exponential (1) – Skewness = 2, Kurtosis = 6|
|Log-Normal (0, 1) – Skewness = 6.18, Kurtosis = 113.94|
|Weibull (2, 2) – Skewness = 0.63, Kurtosis = 0.25|
|Gompertz (0.001, 1) – Skewness = -1, Kurtosis = 1.5|
*The most powerful test for each sample size.
Table 3: Simulated power for asymmetric distributions at 5% significance level.
The empirical power of a test is calculated as the ratio of the number of times the null hypothesis rejected when the alternative hypothesis of non-normality is true. For power comparison purposes we have considered the following distributions: Beta, Uniform, Student’s t and Laplace and these are class of symmetric distributions. For assymetric class of distributions, we consider Gamma, Chi-square, Exponential, Log-Normal, Weibull and Gompertz distributions. To compare power of the tests we generate samples of sizes 20, 50, 100 and 500 from nonnormal distributions. The power based on 10,000 simulations from different symmetric distributions is presented in Table 2 and those from asymmetric distributions are shown in Table 3. The critical values for corresponding test statistics are discussed in Section 2.
Discussion of simulation results
The best test is the one with maximum power while keeping the nominal significance level. Table 1 gives the type I error rate while Tables 2 and 3 give the power of the tests for the several alternative distributions.
An examination of the performance of the tests in terms of type I error rate shows that the LL, AD, CS, DK, SK, KU, SW, SF, RJB, DH tests were found better than the other tests; these tests have Type I error rates that were around the 5% level specified. The RJB test also have generally acceptable type I error rate but these rate were slightly higher than specified when the sample size was less than 50. The JB, BH, BS, BM (1) and G statistic all have Type I error rates lower than 5% and tend to under-reject while the BHBS, BM (2) and the GMG have Type I error rates higher than 5% and tend to over-reject.
A consideration of the results of power of the tests showed that different tests performed differently under different combinations of the sample size and the significance level. A general and expected pattern was observed that as sample size increase the power of the test also increase.
With Beta (2, 2) and Beta (3, 3) as the alternative distributions, we have symmetric distributions with short tails. With Beta (2, 2), only the KU at 78.79% exhibited significant power when the sample size was less than 100, followed by the CS at 64.97%. However, with the sample size of 200, all the test reached at least 80% except for BHBS at 77.99, SF at 75.40, AD at 70.79% and JB at 61.04%. All other tests do not exhibit significant power especially the SK and BH which had 0.05% and 46.74 % power respectively, even at n=1000, and are clearly not suitable for these conditions. It is noticed that as the value of the parameter increases, the tail of the distribution reduces and consequently the coefficient of kurtosis resulting in a loss of power. In fact, for Beta (3, 3), considerable power was not achieved until when the sample size was 200; the kurtosis test was able to achieve a 79.72% power at this point.
In the case of a Uniform (0, 1) as the alternative distribution, the KU test had a power 88.59% at n=50 to prove being the most powerful under this condition, followed closely by DK (79.77%). With n=100, all tests excepts the LL, CS, SK, JB, RJB, BH, BM (1) and the G had power greater than 80%; the CS, SK, JB, RJB, BH, BM(1) and G particularly proved to be very bad test with n ≤ 50 in this situation with the SK only achieving a power of 0.07% even at n=1000.
For a t(10) (t-distribution with 10 degrees of freedom) distribution, all the test were poor in detecting non-normality; even at n=500, only the BM(1) and BM(2) achieved a power of 80%, followed closely by the RJB (76.42%), GMG (75.54%), JB (75.16%), DH (74.97%), KU (74.37%) and SF (71.84%). All other test had power below 70% at n=500 or less. However, BM (2) is not acceptable as it has unacceptable type I error rate.
For a t(5) distribution that is symmetric and long-tailed, none of the tests was able to achieve a power of 80% even at n=100 with those that achieved closest to this cut-off point being the BM(2) (71.35%), RJB (69.02%), GMG (68.79%), DH (64.10%), JB (62.90%), SF (62.89%) and DK (60.04%).
Considering a Laplace (0, 1) with a mean of zero, the GMG is the most powerful for all sample sizes and achieved a power of 94.86% with n=100, with the AD, SW, SF, RJB, DH, BS, BHBS and BM(2) all achieving power above the 80% threshold. The SK and the G tests are the least powerful under this alternative distribution.
In the situation where the alternative distribution is a Gamma (4, 5), the most powerful test was the SW reaching a power of 95.81% at n=100, it was followed closely by the DH, BM(2), SF, and SK all achieving more than 90% power at n=100. The least powerful under the situation are the G, KU and BS. Both G and KU that did not achieve 80% power until n=500; the BS only achieved a power of 61.99% even at n=1000.
The chi-square (3) distribution proved to be one that was easily identified as being non-normal by all tests with SW(87.19%), SF (83.50%), AD(79.93%) and DH(79.42%) all achieving adequate power even at sample size as small as 30. At n=50, all eighteen tests considered had reached at least the 80% threshold except for the KU, BH, BS, BHBS, GMG and G. The least powerful was the BS test never achieving 100% power at n=1000 whereas all other tests have.
Exponential (1) also proved to be a distribution that was easy for the tests to identify as non-normal with the SW and SF having power above 80% at only n=20. All tests were able to achieve more than 80% power at only n=50 except for the KU, BH, BS, BHBS, and GMG. All tests however surpassed the 80% threshold at n=100 except for the BS which only achieved a 57.70% power at this sample size and proved the least powerful never achieving 100% power at n=1000 whereas all other tests have.
The SW test proved to be the most powerful under the Log-normal alternative distribution achieving a power of 83.73% at n=20, followed closely by its modified form the SF (80.13%). All tests surpassed the 80% threshold at n=40 except for the BH and BS which only achieved power of 65.76% and 69.87% respectively. BHBS, a joint test of the BH and BS however proved more powerful than the individual tests by achieving a power of 89.52 at n=40. However, BHBS is not recommended as it has unacceptable type I error rate.
The result of power on a weibull (2, 2) alternative distribution showed that the SW is the most powerful under this distribution. The test achieved a power of 79.33% at n=100 which is just a little below the 80% rate that is usually described as acceptable. The SW is closely followed by the DH (72.64%) and SF (71.64). The AD, DK, SK, JB, RJB, BM(1) and BM(2) were also able to achieve at least 80% power at n=200. The BS once again proved to be the least powerful among the tests under this distribution by only achieving a power of 16.94%.
An asymmetric, short-tailed Gompertz distribution as an alternative distribution showed the SK test to be powerful, and a strong rival to the popular SW test, however, none of the test was able to achieve 80% power until the sample size was increased to 100 at which point all of the tests except the LL, CS, KU, BH, BS, GMG and G had surpassed the threshold. The BS once more was the least powerful under this distribution; despite most of the tests achieving the 80% threshold and a significant number of them achieving 100% at n=500, the test was only able to achieve 65.88% power.
A weibull (2, 2) distribution also showed RJB as the most powerful for sample sizes of 40 or less and SW for larger sample sizes as against BHBS for a sample size of 10 and SW for larger sample sizes at the 5% level. There is however, the most drastic change in the case of the Gompertz (0.001,1) distribution at 1% level, where the GMG was the most powerful for sample size on 10 and SK for other sample sizes. The SK will probably be the most powerful for a sample size of 10 but for the unavailability of the SK along with the KU and DK for sample sizes less than 20. At the 5% level on the other hand, the RJB was the most powerful for sample sizes of 40 or less and BM (2) for larger sample sizes.
As it is clear from the above discussions that all these tests behave differently depending on the alternative distribution under consideration. Even though the BHBS, BM(2) and GMG showed powerful in certain situations, they are not recommended for testing for normality as they do not effectively control for type I error rate. The results are in good agreement with those obtained in Yap and Sim . A general and expected pattern was observed that as sample size increases the power of the test also increases for all tests.
This section highlights the illustration of the performance of the tests using a real life example of medical data. The postmortem interval (PMI) is defined as the elapsed time between death and an autopsy. Knowledge of PMI is considered essential when conducting medical research on human cadavers. The following data (Data Source: Hayes and Lewis ) are PMIs of 22 human brain specimens obtained at autopsy in a recent study:
5.5, 14.5, 6.0, 5.5, 5.3, 5.8, 11.0, 6.1, 7.0, 14.5, 10.4, 4.6, 4.3, 7.2, 10.5, 6.5, 3.3, 7.0, 4.1, 6.2, 10.4, 4.9.
The sample is positively skewed with skewness=0.99 and shorttailed with kurtosis=-0.16,
mean=7.30, SD=3.18 and sample size is 22. The QQ plot of PMI data is given below, which certainly indicates that the data are not symmetric (Figure 1).
The computed values of the test statistics along with their p-values and decisions are presented in Table 4. This dataset was originally modeled by a gamma distribution with shape parameter α=5.25 and scale parameter β=1.39, so one may assume that the hypothesis of normality will be rejected, however, seven of the eighteen test considered failed to reject this hypothesis including the popular DK, SW and SF tests. It can be noted that the coefficient of kurtosis of the data is 0.16 and close enough to that of a normal distribution.
|Normality Test||Value of test statistic||P-value (or Critical Value)||Reject Normality or Do not reject at α = 5%|
|DK||5.5303||0.063||Do not reject|
|KU||0.7222||0.4702||Do not reject|
|SW||0.9091||0.2378||Do not reject|
|SF||0.9129||0.2244||Do not reject|
|JB||4.141||0.1261||Do not reject|
|BS||-0.126||0.8997||Do not reject|
|BM(1)||3.6023||0.1651||Do not reject|
Table 4: Test Results from Postmortem interval data.
We have considered eighteen different tests of normality comprising the most popular along with some of the recently proposed tests. The performance was measured in terms of type I error rate and power of the test [31,32]. The type I error rate is the rate of rejection of the hypothesis of normality for data from the normal distribution while the power of the test is the rate of rejection of normality hypothesis for data generated from a non-normal distribution. We have considered both symmetric and asymmetric distributions in the simulation study. Based on the simulation results we have found several useful test statistics for testing the normality. However, the Kurtosis Test is the most powerful for symmetric data and Shapiro Wilk test  is the most powerful for asymmetric data among all the methods with acceptable type I error rate. The findings of this paper are in good agreement with Yap and Sim , but Kurtosis test and Skewness test were not included in their paper. Interestingly, the kurtosis test turned out to be the best test for symmetric distributions and the Skewness test performs well for both symmetric and asymmetric distributions.
Authors are thankful to referees for their comments that certainly improved the presentation of the paper.
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals