Performances of Several Univariate Tests of Normality: An Empirical Study

The problem of testing for normality is fundamental in both theoretical and empirical statistical research. This paper compares the performances of eighteen normality tests available in literature. Since a theoretical comparison is not possible, MonteCarlo simulation were done from various symmetric and asymmetric distributions for different sample sizes ranging from 10 to 1000. The performance of the test statistics are compared based on empirical Type I error rate and power of the test. The simulations results show that the Kurtosis Test is the most powerful for symmetric data and Shapiro Wilk test is the most powerful for asymmetric data. Citation: Adefisoye JO, Golam Kibria BM, George F (2016) Performances of Several Univariate Tests of Normality: An Empirical Study. J Biom Biostat 7: 322. doi:10.4172/2155-6180.1000322


Introduction
Many of the statistical procedures including correlation, regression, t tests, and analysis of variance are based on the assumption that the data follows a normal distribution or a Gaussian distribution. In many cases, the assumption of normality is critical, when the confidence intervals are developed for population parameters like mean, correlation, variance etc. If the assumptions, under which the statistical procedures are developed, do not hold the conclusion made using these procedures may not be accurate. So, the practitioners need to make sure the assumptions are valid. Checking the validity of the normality assumption in a statistical procedure can be done in two ways: empirical procedure using graphical analysis and the goodnessof-fit tests methods. The goodness-of-fit tests which are formal statistical procedures for assessing the underlying distribution of a data set are our focus here. These tests usually provide more reliable results than graphical analysis. There are many statistical tests available in literature to test whether a given data is from a normal distribution. In this article we review most commonly used methods for normality test and compare them using power and observed significance value.
The first normality test in the literature is the chi-square goodnessof-fit test Snedecor and Cochran [1] which was suggested by Pearson [2]. Later, the famous Kolmogorov-Smirnov goodness-of-fit test was introduced by Kolmogorov [3]. The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. The Anderson-Darling test [4] assesses whether a sample comes from a specified distribution. It makes use of the fact that, when given a hypothesized underlying distribution and assuming the data does arise from this distribution, the frequency of the data can be assumed to follow a Uniform distribution. Lilliefors [5] test is a modification of Kolmogorov's test. The Kolmogorov's test is appropriate when the parameters of the hypothesized normal distribution are completely known whereas in Lilliefors parameters can be estimated from sample. We have included Lilifores test but not KS test in our simulation studies, since in most practical situations we would not know the parameters of null distribution. Shapiro and Wilk [6] test is the first test that was able to detect departures from normality due to either skewness or kurtosis or both [7]. D'Agostino [8] proposed a test which is based on transformations of the sample kurtosis and skewness. 0 and Francia [9] suggested an approximation to the Shapiro-Wilk test which is known to perform well [10]. Jarque-Bera [11] test is based on the sample skewness and sample kurtosis which uses the Lagrange multiplier procedure on the Pearson family of distributions to obtain tests for normality. In order to improve the efficiency of the Jarque-Bera test, Doornik and Hansen [12] proposed modification which involves the use of the transformed skewness. The skewness test Bai and Ng [13] is based on the third sample moment. It is used to test the non-normality due to skewness. In the kurtosis test Bai and Ng [13,14] the coefficient of kurtosis sample data is used to test non-normality due to kurtosis. Gel and Gastwirth proposed a robust modification to the Jarque-Bera test. The Robust Jarque-Bera uses a robust estimate of the dispersion in the skewness and kurtosis instead of the second order central moment. Brys, et al. [15] have proposed a goodness-of-fit test based on robust measures of skewness and tail weight. Bonett and Seier [16] have suggested a modified measure of kurtosis for testing normality. Considering that the Brys test is a skewness based test and that the Bonett-Seier is a kurtosis based test a joint test using both these measures was proposed by Romao et al. [17] for testing normality. The joint test attempts to make use of the two referred focused tests in order to increase the power to detect different kinds of departure from normality. Bontemps and Meddahi [18] have proposed a family of normality tests based on moment conditions known as Stein equations and their relation with Hermite polynomials. Gel et al. [19] have proposed a directed normality test, which focuses on detecting heavier tails and outliers of symmetric distributions. Last one in the list is the G test proposed by Chen and Ye [20]. type I error (α). Yazici and Yolacan [22] and recently Yap and Sim [23] did some work on the comparison of normality tests, but kurtosis tests and skewness tests were not in their work. Interestingly, the kurtosis test turned out to be the best test for symmetric distributions and the skewness test performs well for both symmetric and asymmetric distributions.
The rest of the paper is organized as follows. Section 2 discusses different statistical test for normality. A simulation study has been conducted in section 3. A real life data are analyzed in section 4 and finally some concluding remarks are given in section 5.

Statistical Methods
There are various parametric and nonparametric tests for normality available in literature. This section discusses widely used statistical methods for normality tests.

Lilliefor's test [LL]
The test statistic is defined as: Where S n (x) is the sample cumulative distribution function and F * (x) is the cumulative distribution function (CDF) of the null distribution. For more details and critical values refer Conover [24].

Anderson-Darling test [AD]
The AD test is of the form: is the empirical distribution function (EDF), Φ(x) is the cumulative distribution function of the standard normal distribution and ψ(x) is a weight function. The critical values for the Anderson-Darling test along with a more detailed study have been published in Stephens [25].

Chi-Square test [CS]
The chi-square goodness-of-fit test statistic is defined as: where ' O i ' and 'E i ' refers to the i th observed and expected frequencies respectively and k is the number of bins/groups. When the null hypothesis is true the above statistic follows a Chi-square distribution with k-1 degrees of freedom.

Skewness test [SK]
The skewness statistic is defined as: Under H 0 , the test statistic Z(g 1 ) is approximately normally distributed for n > 8 and is defined as:

Kurtosis test [KU]
The kurtosis statistic is defined as: Under H 0 the test statistic Z(g 2 )is approximately normally distributed for n ≥ 20 and thus more suitable for this range of sample size. Z(g 2 ) is given as:

D' Agostino-Pearson K 2 test [DK]
The test combines g 1 and g 2 to produce an omnibus test of normality. The test statistics is: Z 2 g 1 and Z 2 g 2 are the normal approximations to g 1 and g 2 respectively. The test statistic follows approximately a chi-square distribution with 2 degree of freedom when a population is normally distributed. The test is appropriate for a sample size of at least twenty.

Shapiro-Wilk test [SW]
The Shapiro-Wilk test uses a W statistic which is defined as Where m=n/2 if n is even while m=(n -1)/2 if n is odd,

Shapiro-Francia [SF]
The test statistic is defined as: The W′ equals the product-moment correlation coefficient between the x (i) and the m i , and therefore measures the straightness of the normal probability plot x (i) ; small values of W′ indicate non-normality. A detailed discussion of this test along with critical values is available in Royston [27].

Jarque-Bera test [JB]
The test statistic is given as:

Robust Jarque-Bera test [RJB]
The robust Jarque-Bera (RJB) test statistic is defined as and M is the sample median.
The RJB statistic is asymptotically χ 2

Doornik-Hansen test [DH]
The test statistic involves the use of the transformed skewness and transformed kurtosis. The transformed skewness is given by the following expression: where Y, c and w are obtained by  Bowman and Shenton [28] had proposed the transformed kurtosis z 2 as follows, With ξ and a obtained by n n n n n b n n n a n n n n The test statistic proposed by Doornik and Hansen [12] is given by The normality hypothesis is rejected for large values of the test statistic. The test is approximately chi-squared distributed with two degrees of freedom.

Brys-Hubert-Struyf MC-MR test [BH]
This test is based on robust measures of skewness and tail weight. The considered robust measure of skewness is the medcouple (MC) defined as where med stands for median m F is the sample median and h is a kernel function given by The left medcouple (LMC) and the right medcouple (RMC) are the considered robust measures of left and right tail weight respectively and are defined by The test statistic T MC-LR is then defined by The normality hypothesis of the data is rejected for large values of the test statistic which approximately follows the chi-square distribution with three degrees of freedom.

Bonett-Seier test [BS]
The test statistic T w is given by: This statistic follows a standard normal distribution under null hypothesis.

Brys-Hubert-Struyf-Bonett-Seier Joint test [BHBS]
The normality hypothesis of the data is rejected for the joint test when rejection is obtained for either one of the two individual tests for a significance level of α/2.

Bontemps-Meddahi tests [BM(1) and BM(2)]
The general expression of the test family is given by: and H k (⋅) represents the kth order normalized Hermite polynomial.
Different tests can be obtained by assigning different values to p, which represents the maximum order of the considered normalized Hermite polynomials in the expression above. Two different tests are considered in this work with p=4 and p=6; these tests are termed BM [3][4] and BM [3][4][5][6] . The hypothesis of normality is rejected for large values of the test statistic and according to Bontemps and Meddahi [18]; the general BM 3-p family of tests asymptotically follows the chi-square distribution with p -2 degree of freedom.

Gel-Miao-Gastwirth test [GMG]
The test is based on the ratio of the standard deviation and on the robust measure of dispersion J n as defined in the expression: The normality test R SJ which should tend to one under a normal distribution is thus given by: The normality hypothesis is rejected for large values of the R SJ , and the statistic ( 1) sJ n R − is asymptotically normally distributed [19].

G test [G]
The test is used to test if an underlying population distribution is a uniform distribution. Suppose x 1 , x 2 ,…, x n are the observations of a random sample from a population distribution with distribution function F(x). Suppose also that x (1) , x (2) ,…, x (n) are the corresponding order statistics. The test statistic has the form: We can observe that are the ordered observations of a random sample from the U(0,1) distribution and thus the G Statistic can be expressed as: When the population distribution is the same as the specified distribution, the value of the test statistic should be close to zero. On the other hand, when the population distribution is far away from the specified distribution, the value should be pretty close to one.
In order to use the test to test for normality, we can assume F(x) to be a normal distribution. Considering the case where the parameters of the distribution are not known, Lilliefor's idea is adopted by calculating x and s 2 from the sample data and using them as estimates for µ and σ 2 respectively, and thus F(x) is the cumulative distribution function of the 2 ( , ) N x s distribution. By using the transformation: x z µ σ − = The test statistic becomes: The hypothesis of normality should be rejected at significant level α if the test statistic is bigger that it's 1 -α critical value. A Table 1 of critical values is available in Chen and ye [20].

Simulation
Since a theoretical comparison among the test statistics is not feasible, a simulation study has been conducted instead to compare the performance of the test statistics in this section.

Simulation techniques
The results of the simulation vary across different levels of significance, sample size and alternative distributions. The results for the 0.05 significance level for the different distribution considered are as presented in the Tables 1-3. First we generate samples of sizes 20, 50, 100 and 500 from standard normal distribution to compare probabilities of type I error. The empirical probability of Type I error is defined as the number of times null hypothesis of normality rejected divided by the total number of simulations. The results in Table 1 are based on 10,000 simulations. We use the software R 3.1.1 R Core Team [29] for all simulations.
The empirical power of a test is calculated as the ratio of the number of times the null hypothesis rejected when the alternative hypothesis of non-normality is true. For power comparison purposes we have considered the following distributions: Beta, Uniform, Student's t and Laplace and these are class of symmetric distributions. For assymetric class of distributions, we consider Gamma, Chi-square, Exponential, Log-Normal, Weibull and Gompertz distributions. To compare power of the tests we generate samples of sizes 20, 50, 100 and 500 from nonnormal distributions. The power based on 10,000 simulations from different symmetric distributions is presented in Table 2 and those from asymmetric distributions are shown in Table 3. The critical values for corresponding test statistics are discussed in Section 2.

Discussion of simulation results
The best test is the one with maximum power while keeping the nominal significance level. Table 1 gives the type I error rate while  Tables 2 and 3 give the power of the tests for the several alternative distributions.
An examination of the performance of the tests in terms of type I error rate shows that the LL, AD, CS, DK, SK, KU, SW, SF, RJB, DH tests were found better than the other tests; these tests have Type I error rates that were around the 5% level specified. The RJB test also have generally acceptable type I error rate but these rate were slightly higher than specified when the sample size was less than 50. The JB, BH, BS, BM (1) and G statistic all have Type I error rates lower than 5% and tend to under-reject while the BHBS, BM (2) and the GMG have Type I error rates higher than 5% and tend to over-reject.  of the sample size and the significance level. A general and expected pattern was observed that as sample size increase the power of the test also increase.
With Beta (2, 2) and Beta (3, 3) as the alternative distributions, we have symmetric distributions with short tails. With Beta (2, 2), only the KU at 78.79% exhibited significant power when the sample size was less than 100, followed by the CS at 64.97%. However, with the sample size of 200, all the test reached at least 80% except for BHBS at 77.99, SF at 75.40, AD at 70.79% and JB at 61.04%. All other tests do not exhibit significant power especially the SK and BH which had 0.05% and 46.74 % power respectively, even at n=1000, and are clearly not suitable for these conditions. It is noticed that as the value of the parameter increases, the tail of the distribution reduces and consequently the coefficient of kurtosis resulting in a loss of power. In fact, for Beta (3, 3), considerable power was not achieved until when the sample size was 200; the kurtosis test was able to achieve a 79.72% power at this point.
In the case of a Uniform (0, 1) as the alternative distribution, the KU test had a power 88.59% at n=50 to prove being the most powerful under this condition, followed closely by DK (79.77%). With n=100, all tests excepts the LL, CS, SK, JB, RJB, BH, BM (1) and the G had power greater than 80%; the CS, SK, JB, RJB, BH, BM(1) and G particularly proved to be very bad test with n ≤ 50 in this situation with the SK only achieving a power of 0.07% even at n=1000.
Considering a Laplace (0, 1) with a mean of zero, the GMG is the most powerful for all sample sizes and achieved a power of 94.86% with n=100, with the AD, SW, SF, RJB, DH, BS, BHBS and BM(2) all achieving power above the 80% threshold. The SK and the G tests are the least powerful under this alternative distribution.
In the situation where the alternative distribution is a Gamma (4,5), the most powerful test was the SW reaching a power of 95.81% at n=100, it was followed closely by the DH, BM(2), SF, and SK all achieving more than 90% power at n=100. The least powerful under the situation are the G, KU and BS. Both G and KU that did not achieve 80% power until n=500; the BS only achieved a power of 61.99% even at n=1000.
The chi-square (3)   even at sample size as small as 30. At n=50, all eighteen tests considered had reached at least the 80% threshold except for the KU, BH, BS, BHBS, GMG and G. The least powerful was the BS test never achieving 100% power at n=1000 whereas all other tests have.
Exponential (1) also proved to be a distribution that was easy for the tests to identify as non-normal with the SW and SF having power above 80% at only n=20. All tests were able to achieve more than 80% power at only n=50 except for the KU, BH, BS, BHBS, and GMG. All tests however surpassed the 80% threshold at n=100 except for the BS which only achieved a 57.70% power at this sample size and proved the least powerful never achieving 100% power at n=1000 whereas all other tests have.
The SW test proved to be the most powerful under the Log-normal alternative distribution achieving a power of 83.73% at n=20, followed closely by its modified form the SF (80.13%). All tests surpassed the 80% threshold at n=40 except for the BH and BS which only achieved power of 65.76% and 69.87% respectively. BHBS, a joint test of the BH and BS however proved more powerful than the individual tests by achieving a power of 89.52 at n=40. However, BHBS is not recommended as it has unacceptable type I error rate.
The result of power on a weibull (2,2) alternative distribution showed that the SW is the most powerful under this distribution. The test achieved a power of 79.33% at n=100 which is just a little below the 80% rate that is usually described as acceptable. The SW is closely followed by the DH (72.64%) and SF (71.64). The AD, DK, SK, JB, RJB, BM(1) and BM (2) were also able to achieve at least 80% power at n=200. The BS once again proved to be the least powerful among the tests under this distribution by only achieving a power of 16.94%.
An asymmetric, short-tailed Gompertz distribution as an alternative distribution showed the SK test to be powerful, and a strong rival to the popular SW test, however, none of the test was able to achieve 80% power until the sample size was increased to 100 at which point all of the tests except the LL, CS, KU, BH, BS, GMG and G had surpassed the threshold. The BS once more was the least powerful under this distribution; despite most of the tests achieving the 80% threshold and a significant number of them achieving 100% at n=500, the test was only able to achieve 65.88% power.
A weibull (2, 2) distribution also showed RJB as the most powerful for sample sizes of 40 or less and SW for larger sample sizes as against BHBS for a sample size of 10 and SW for larger sample sizes at the 5% level. There is however, the most drastic change in the case of the Gompertz (0.001,1) distribution at 1% level, where the GMG was the most powerful for sample size on 10 and SK for other sample sizes. The SK will probably be the most powerful for a sample size of 10 but for the unavailability of the SK along with the KU and DK for sample sizes less than 20. At the 5% level on the other hand, the RJB was the most powerful for sample sizes of 40 or less and BM (2) for larger sample sizes.
As it is clear from the above discussions that all these tests behave differently depending on the alternative distribution under    consideration. Even though the BHBS, BM(2) and GMG showed powerful in certain situations, they are not recommended for testing for normality as they do not effectively control for type I error rate. The results are in good agreement with those obtained in Yap and Sim [22]. A general and expected pattern was observed that as sample size increases the power of the test also increases for all tests.

Application
This section highlights the illustration of the performance of the tests using a real life example of medical data. The postmortem interval (PMI) is defined as the elapsed time between death and an autopsy. Knowledge of PMI is considered essential when conducting medical research on human cadavers. The following data (Data Source: Hayes and Lewis [30]) are PMIs of 22 human brain specimens obtained at autopsy in a recent study: The sample is positively skewed with skewness=0.99 and shorttailed with kurtosis=-0.16, mean=7.30, SD=3.18 and sample size is 22. The QQ plot of PMI data is given below, which certainly indicates that the data are not symmetric (Figure 1).
The computed values of the test statistics along with their p-values and decisions are presented in Table 4. This dataset was originally modeled by a gamma distribution with shape parameter α=5. 25 and scale parameter β=1.39, so one may assume that the hypothesis of normality will be rejected, however, seven of the eighteen test considered failed to reject this hypothesis including the popular DK, SW and SF tests. It can be noted that the coefficient of kurtosis of the data is 0.16 and close enough to that of a normal distribution.

Summary and Conclusions
We have considered eighteen different tests of normality comprising the most popular along with some of the recently proposed tests. The performance was measured in terms of type I error rate and power of the test [31,32]. The type I error rate is the rate of rejection of the hypothesis of normality for data from the normal distribution while the power of the test is the rate of rejection of normality hypothesis for data generated from a non-normal distribution. We have considered both symmetric and asymmetric distributions in the simulation study. Based on the simulation results we have found several useful test statistics for testing the normality. However, the Kurtosis Test is the most powerful for symmetric data and Shapiro Wilk test [33] is the most powerful for asymmetric data among all the methods with acceptable type I error rate. The findings of this paper are in good agreement with Yap and Sim [22], but Kurtosis test and Skewness test were not included in their paper. Interestingly, the kurtosis test turned out to be the best test for symmetric distributions and the Skewness test performs well for both symmetric and asymmetric distributions.