The logic of the Fisher method to combine P-values

Posted on 11.May.2012 by A. M. Winkler

Consider a set of $k$ independent tests, each of these to test a certain null hypothesis $\mathcal{H}_{0|i}$ , $i=\{1, 2, \ldots, k\}$ . For each test, a significance level $P_{i}$ , i.e., a p-value, is obtained. All these p-values can be combined into a joint test whether there is a global effect, i.e., if a global null hypothesis $\mathcal{H}_0$ can be rejected.

There are a number of ways to combine these independent, partial tests. The Fisher method is one of these, and is perhaps the most famous and most widely used. The test was presented in Fisher’s now classical book, Statistical Methods for Research Workers, and was described rather succinctly:

When a number of quite independent tests of significance have been made, it sometimes happens that although few or none can be claimed individually as significant, yet the aggregate gives an impression that the probabilities are on the whole lower than would often have been obtained by chance. It is sometimes desired, taking account only of these probabilities, and not of the detailed composition of the data from which they are derived, which may be of very different kinds, to obtain a single test of the significance of the aggregate, based on the product of the probabilities individually observed.
The circumstance that the sum of a number of values of $\chi^{2}$ is itself distributed in the $\chi^{2}$ distribution with the appropriate number of degrees of freedom, may be made the basis of such a test. For in the particular case when $n=2$ , the natural logarithm of the probability is equal to $\frac{1}{2}\chi^{2}$ . If therefore we take the natural logarithm of a probability, change its sign and double it, we have the equivalent value of $\chi^{2}$ for 2 degrees of freedom. Any number of such values may be added together, to give a composite test, using the Table of $\chi^{2}$ to examine the significance of the result. — Fisher, 1932.

The test is based on the fact that the probability of rejecting the global null hypothesis is related to intersection of the probabilities of each individual test, $\prod_{i} P_{i}$ . However, $\prod_{i} P_{i}$ is not uniformly distributed, even if the null is true for all partial tests, and cannot be used itself as the joint significance level for the global test. To remediate this fact, some interesting properties and relationships among distributions of random variables were exploited by Fisher and embodied in the succinct excerpt above. These properties are discussed below.

The logarithm of uniform is exponential

The cumulative distribution function (cdf) of an exponential distribution is:

$F(x)=1- e^{-\lambda x}$

where $\lambda$ is the rate parameter, the only parameter of this distribution. The inverse cdf is, therefore, given by:

$x = -\dfrac{1}{\lambda}\ln(1-F(x))$

If $P$ is a random variable uniformly distributed in the interval $[0, 1]$ , so is $1-P$ , and it is immaterial to differ between them. As a consequence, the previous equation can be equivalently written as:

$x = -\dfrac{1}{\lambda}\ln(P)$

where $P \sim \mathcal{U}(0,1)$ , which highlights the fact that the negative of the natural logarithm of a random variable distributed uniformly between 0 and 1 follows an exponential distribution with rate parameter $\lambda=1$ .

An exponential with rate 1/2 is chi-squared

The cdf of a chi-squared distribution with $\nu$ degrees of freedom, i.e. $\chi^{2}_{\nu}$ , is given by:

$F(x; \nu) = \dfrac{\int_{0}^{x/2} t^{\frac{\nu}{2}-1}e^{-t}{\rm d}t}{\left(\frac{\nu}{2}-1\right)!}$

If $\nu=2$ , and solving the integral we have:

$F(x; \nu=2) = \dfrac{\int_{0}^{x/2} t^{\frac{2}{2}-1}e^{-t}{\rm d}t}{\left(\frac{2}{2}-1\right)!} = \int_{0}^{x/2} e^{-t}{\rm d}t = 1-e^{-x/2}$

In other words, a $\chi^{2}$ distribution with $\nu=2$ is equivalent to an exponential distribution with rate parameter $\lambda=1/2$ .

The sum of chi-squared is also chi-squared

The moment-generating function (mgf) of a sum of independent variables is the product of the mgfs of the respective variables. The mgf of a $\chi^{2}_{\nu}$ is:

$M(t) = (1-2t)^{-\nu/2}$

The mgf of the sum of $k$ independent variables that follow each a $\chi^{2}_{2}$ distribution is then given by:

$M_{\text{sum}}(t) = \prod_{i=1}^{k} (1-2t)^{-2/2} = (1-2t)^{-k}$

which also defines a $\chi^{2}$ distribution, however with degrees of freedom $\nu=2k$ .

Assembling the pieces

With these facts in mind, how to transform the product $\prod_{i} P_{i}$ into a p-value that is uniformly distributed when the global null is true? The product can be converted into a sum by taking the logarithm. And as shown above, the logarithm of uniformly distributed variables follows an exponential distribution with rate parameter $\lambda=1$ . Multiplication of each $\ln(P_{i})$ by 2 changes the rate parameter to $\lambda=1/2$ and makes this distribution equivalent to a $\chi^{2}$ distribution with degrees of freedom $\nu=2$ . The sum of $k$ of these logarithms also follow a $\chi^2$ distribution, now with $\nu=2k$ degrees of freedom, i.e., $\chi^{2}_{2k}$ .

The statistic for the Fisher method is, therefore, computed as:

$X = -2 \sum_{i=1}^{k} \ln(P_{i})$

with $X$ following a $\chi^{2}_{2k}$ distribution, from which a p-value for the global hypothesis can be easily obtained.

Reference

The details above are not in the book, presumably omitted by Fisher as the knowledge of these derivation details would be of little practical use. Nonetheless, the reference for the book is:

Fisher, R. A., 1932. Statistical Methods for Research Workers, 4th Edition. Oliver and Boyd, Edinburgh.

Confidence intervals for Bernoulli trials

Posted on 21.April.2012 by A. M. Winkler

A Bernoulli trial is an experiment in which the outcome can be one of two mutually exclusive results, e.g. true/false, success/failure, heads/tails and so on. A number of methods are available to compute confidence intervals after many such trials have been performed. The most common have been discussed and reviewed by Brown et al. (2001), and are presented below. Consider $n$ trials, with $X$ successes and a significance level of $\alpha=0.05$ to obtain a 95% confidence interval. For each of the methods, the interval is shown graphically for $1 \leqslant n \leqslant 100$ and $1 \leqslant X \leqslant n$ .

Wald

This is the most common method, discussed in many textbooks, and probably the most problematic for small samples. It is based on a normal approximation to the binomial distribution, and it is often called “Wald interval” for it’s relationship with the Wald test. The interval is calculated as:

Lower bound: $L=p-k\sqrt{pq/n}$
Upper bound: $U=p+k\sqrt{pq/n}$

where $k = \Phi^{-1}\{1-\alpha/2\}$ , $\Phi^{-1}$ is the probit function, $p=X/n$ and $q=1-p$ .

Wald confidence interval.

Wilson

This interval appeared in Wilson (1927) and is defined as:

Lower bound: $L = \tilde{p} - (k/\tilde{n})\sqrt{npq+k^2/4}$
Upper bound: $U = \tilde{p} + (k/\tilde{n})\sqrt{npq+k^2/4}$

where $\tilde{p} = \tilde{X}/\tilde{n}$ , $\tilde{n}=n+k^2$ , $\tilde{X} = X+ k^2/2$ , and the remaining are as above. This is probably the most appropriate for the majority of situations.

Wilson confidence interval.

Agresti-Coull

This interval appeared in Agresti and Coull (1998) and shares many features with the Wilson interval. It is defined as:

Lower bound: $L = \tilde{p} - k\sqrt{\tilde{p}\tilde{q}/\tilde{n}}$
Upper bound: $U = \tilde{p} + k\sqrt{\tilde{p}\tilde{q}/\tilde{n}}$

where $\tilde{q}=1-\tilde{p}$ , and the remaining are as above.

Agresti-Coull confidence interval.

Jeffreys

This interval has a Bayesian motivation and uses the Jeffreys prior (Jeffreys, 1946). It seems to have been introduced by Brown et al. (2001). It is defined as:

Lower bound: $L = \mathcal{B}^{-1}\{\alpha/2,X+1/2,n-X+1/2\}$
Upper bound: $U = \mathcal{B}^{-1}\{1-\alpha/2,X+1/2,n-X+1/2\}$

where $\mathcal{B}^{-1}\{x,s_1,s_2\}$ is the inverse cdf of the beta distribution (not to be confused with the beta function), at the quantile $x$ , and with shape parameters $s_1$ and $s_2$ .

Jeffreys confidence interval.

Clopper-Pearson

This interval was proposed by Clopper and Pearson (1934) and is based on a binomial test, rather than on approximations, hence sometimes being called “exact”, although it is not “exact” in the common sense. It is considered overly conservative.

Lower bound: $L = \mathcal{B}^{-1}\{\alpha/2,X,n-X+1\}$
Upper bound: $U = \mathcal{B}^{-1}\{1-\alpha/2,X+1,n-X\}$

where $\mathcal{B}^{-1}\{x,s_1,s_2\}$ is the inverse cdf of the beta distribution as above.

Clopper-Pearson confidence interval.

Arc-Sine

This interval is based on the arc-sine variance-stabilising transformation. The interval is defined as:

Lower bound: $L = \sin\{\arcsin\{\sqrt{a}\} - k/(2\sqrt{n})\}^2$
Upper bound: $U = \sin\{\arcsin\{\sqrt{a}\} + k/(2\sqrt{n})\}^2$

where $a=\frac{X+3/8}{n+3/4}$ replaces what otherwise would be $p$ (Anscombe, 1948).

Arc-sine confidence interval.

Logit

This interval is based on the Wald interval for $\lambda = \ln\{\frac{X}{n-X}\}$ . It is defined as:

Lower bound: $L = e^{\lambda_L}/(1+e^{\lambda_L})$
Upper bound: $U = e^{\lambda_U}/(1+e^{\lambda_U})$

where $\lambda_L = \lambda - k\sqrt{\hat{V}}$ , $\lambda_U = \lambda + k\sqrt{\hat{V}}$ , and $\hat{V} = \frac{n}{X(n-X)}$ .

Logit confidence interval.

Anscombe

This interval was proposed by Anscombe (1956) and is based on the logit interval:

Lower bound: $L = e^{\lambda_L}/(1+e^{\lambda_L})$
Upper bound: $U = e^{\lambda_U}/(1+e^{\lambda_U})$

The difference is that $\lambda=\ln\{\frac{X+1/2}{n-X+1/2}\}$ and $\hat{V}=\frac{(n+1)(n+2)}{n(X+1)(n-X+1)}$ . The values for $\lambda_L$ and $\lambda_U$ are as above.

Anscombe’s confidence interval.

Octave/MATLAB implementation

A function that computes these intervals is available here: confint.m.

References

Agresti A and Coull BA. Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician. 1998; 52(2):119-126.
Anscombe FJ The transformation of Poisson, binomial and negative-binomial data. Biometrika. 1948; 35(3/4:246-254.
Anscombe FJ. On estimating binomial response relations. Biometrika. 1956; 43(3/4):461-464.
Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Statist Sci. 2001; 16(2):101-133.
Clopper CJ, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934; 26(4):404-413.
Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A. 1946; 186(1007):453-461.
Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927; 22(158):209-21.

False Discovery Rate: Corrected & Adjusted P-values

Posted on 05.September.2011 by A. M. Winkler

Until mid-1990’s, control of the error rate under multiple testing was done, in general, using family-wise error rate (FWER). An example of this kind of correction is the Bonferroni correction. In an influential paper, Benjamini and Hochberg (1995) introduced the concept of false discovery rate (FDR) as a way to allow inference when many tests are being conducted. Differently than FWER, which controls the probability of committing a type I error for any of a family of tests, FDR allows the researcher to tolerate a certain number of tests to be incorrectly discovered. The word rate in the FDR is in fact a misnomer, as the FDR is the proportion of discoveries that are false among all discoveries, i.e., proportion of incorrect rejections among all rejections of the null hypothesis.

Benjamini and Hochberg’s FDR-controlling procedure

Consider testing $N$ hypotheses, $H_{1}, H_{2}, \ldots , H_{N}$ based on their respective p-values, $p_{1}, p_{2}, \ldots , p_{N}$ . Consider that a fraction $q$ of discoveries are allowed (tolerated) to be false. Sort the p-values in ascending order, $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(N)}$ and denote $H_{(i)}$ the hypothesis corresponding to $p_{(i)}$ . Let $k$ be the largest $i$ for which $p_{(i)} \leq \frac{i}{N}\frac{q}{c(N)}$ . Then reject all $H_{(i)}$ , $i=1, 2, \ldots , k$ . The constant $c(N)$ is not in the original publication, and appeared in Benjamini and Yekutieli (2001) for cases in which independence cannot be ascertained. The possible choices for the constant are $c(N)=1$ or $c(N)=\sum_{i=1}^{N} 1/i$ . The second is valid in any situation, whereas the first is valid for most situations, particularly where there are no negative correlations among tests. The B&H procedure has found many applications across different fields, including neuroimaging, as introduced by Genovese et al. (2002).

FDR correction

The procedure described above effectively defines a single number, a threshold, that can be used to declare tests as significant or not at the level $q$ . One might, instead, be interested to know, for each original p-value, the minimum $q$ level in which they would still be rejected. To find this out, define $q_{(i)}=\frac{p_{(i)}N}{i}c(N)$ . This formulation, however, has problems, as discussed next.

FDR adjustment

The problem with the FDR-correction is that $q_{(i)}$ is not a monotonic function of $p_{(i)}$ . This means that if someone uses any $q_{(i)}$ to threshold the set of FDR-corrected values, the result is not the same as would be obtained by applying sequentially the B&H procedure for all these corresponding $q$ levels.

To address this concern, Yekutieli and Benjamini (1999) introduced the FDR-adjustment, in which monotonicity is enforced, and which definition is compatible with the original FDR definition. Let $q*_{(i)}$ be the FDR-adjusted value for $p_{(i)}$ . It’s value is the smallest $q_{(k)}, k \geq i$ , where $q_{(k)}$ is the FDR-corrected as defined above. That’s just it!

Example

Consider the image below, on the left, a square of 9×9 pixels containing each some t-statistic with some sparse, low magnitude signal added. On the right are the corresponding p-values, ranging approximately between 0-1:

Statistical maps

Statistic and p-values

colorscale

Using the B&H procedure to compute the threshold, with $q=0.20$ , and applying this threshold to the image rejects the null hypothesis in 6 pixels. On the right panel, the red line corresponds to the threshold. All p-values (in blue) below this line are declared significant.

FDR-threshold

FDR-threshold

Computing the FDR-corrected values at each pixel, and thresholding at the same $q=0.20$ , however, does not produce the same results, and just 1 pixel is declared significant. Although conservative, the “correction” is far from the nominal false discovery rate and is not appropriate. Note on the right panel the lack of monotonicity.

FDR-corrected

FDR-corrected

Computing instead the FDR-adjusted values, and thresholding again at $q=0.20$ produces the same results as with the simpler FDR-threshold. The adjustment is, thus, more “correct” than the “corrected”. Observe, on the right panel, the monotonicity enforced.

FDR-adjusted

FDR-adjusted

Implementation

An implementation of the three styles of FDR for Octave/MATLAB is available here: fdr.m
SPM users will find a similar feature in the function spm_P_FDR.m.

References

Benjamini Y and Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Statist Soc B. 1995; 57(1):289-300.
Yekutieli D and Benjamini Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Infer. 1999; 82(1-2):171-96.
Benjamini Y and Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist. 2001; 29(4):1165-88.
Genovese CR, Lazar NA, Nichols T. Thresholding of statistical maps in functional neuroimaging using the false discovery rate. Neuroimage. 2002 Apr;15(4):870-8.

Log-normality and the Box-Cox transformation

Posted on 17.July.2011 by A. M. Winkler

The Box-Cox transformation (Box and Cox, 1964) is a way to transform data that ordinarily do not follow to a normal distribution so that it then conforms to it. The transformation is a piecewise function of the power parameter $\lambda$ :

The function is, given the definition, continuous at the singular point $\lambda = 0$ . Plotting the transformation as a function of the original data $y$ and the parameter $\lambda$ produces the following:

An interesting aspect of the Box-Cox transformation is that it can be used as a metric to quantify how normal or log-normal certain data are. This has clear applications in biology. Tissue growth that depends on cellular multiplication is exponential, with a constant factor that is often normally distributed, resulting in tissue size that is log-normally distributed (for discussion, see McAlister, 1879; Koch, 1966; Koch, 1969). Measurements of the final tissue size in different individuals, however, may not perfectly conform to the log-normal due to external influences that may hinder tissue growth. Moreover, it is not always possible to measure the final amount of tissue that is the product of a single lineage of self-multiplicative cells. The combination of different cell lineages, each with their own growth rate, as well as external influences, tend to produce a distribution that is more normally (Gaussianly) distributed. A distribution gradient, therefore, may exist between these two extremes of normality and log-normality. Estimating the parameter $\lambda$ allows one to estimate also how closer to normal or log-normal certain measurements are. If $\lambda \approx 0$ , the original data tend to be log-normally distributed, whereas if $\lambda \approx +1$ , the data can be considered approximately more normally distributed.

References

Box GEP, Cox DR. An analysis of transformations. J R Stat Soc B. 1964; 26(2):211-52.
Koch AL. The logarithm in biology. 1. Mechanisms generating the log-normal distribution exactly. J Theor Biol. 1966; 12(2):276-90.
Koch AL. The logarithm in biology. II. Distributions simulating the log-normal. J Theor Biol. 1969; 23(2):251-68.
McAlister D. The law of the geometric mean. Proc R Soc London. 1879; 29:367-76.

Inverse normal transformation in SOLAR

Posted on 13.July.2011 by A. M. Winkler

SOLAR software can, at the discretion of the user, apply a rank-based inverse-normal transformation to the data using the command inormal. This transformation is the one suggested by Van der Waerden (1952) and is given by:

$\tilde y_i = \Phi^{-1}\left\{\dfrac{r_i}{n+1}\right\}$

where $\tilde y_i$ is the transformed value for observation $i$ , $\Phi^{-1}\left\{\cdot\right\}$ is the probit function, $r_i$ is the ordinary rank of the $i$ -th case among $n$ observations.

This transformation is a particular case of the family of transformations discussed in the paper by Beasley et al. (2009). The family can be represented as:

$\tilde y_i = \Phi^{-1}\left\{\dfrac{r_i+c}{n-2c+1}\right\}$

where $c$ is a constant and the remaining variables are as above. The value of $c$ varies for different proposed methods. Blom (1958) suggests $c=3/8$ , Tukey (1962) suggests $c=1/3$ , Bliss (1967) suggests $c=1/2$ and, as just decribed, Van der Waerden suggests $c=0$ .

Interesting enough, the Q-Q plots produced by Octave use the Bliss (1967) transformation.

An Octave/matlab function to perform these transformations in arbitrary data is here: inormal.m (note that this function does not require or use SOLAR).

References

Van der Waerden BL. Order tests for the two-sample problem and their power. Proc Koninklijke Nederlandse Akademie van Wetenschappen. Ser A. 1952; 55:453–458.
Blom G. Statistical estimates and transformed beta-variables. Wiley, New York, 1958 (page 71).
Tukey JW. The future of data analysis. Ann Math Stat. 1962; 33:1–67.
Bliss CI. Statistics in biology. McGraw-Hill, New York, 1967 (pages 117–120).
Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav Genet. 2009; 39(5):580-95.

Version history

23.Jul.2011: First public release.
19.Jun.2014: Added ability to deal with ties, as well as NaNs in the data.

Normality tests II: Asymptotic approximations

Posted on 07.July.2011 by A. M. Winkler

In the previous post, the procedures to compute a statistic for six common normality tests and how to obtain the respective p-values using asymptotic approximations were briefly reviewed. In this post, it is shown that these approximations, that appear in classic statistical textbooks and papers, are not entirely accurate for arbitrary sample sizes or nominal levels of significance.

For each test, 1,000,000 realisations of normal distributions were simulated using matlab, computed analytically the p-values, and plotted their frequencies at significance intervals of 0.01, with 34 different sample sizes, ranging from 10 to 10,000 (more specifically: 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000). The best approximations should produce flat series of bars between p-values 0-1, and be stable across sample sizes, particularly at the most common significance levels: near or below p=0.05.

In general, any test can be analysed according to power, robustness to sample size, ability to correctly reject some specific types of distributions and other parameters. The analyses presented here refer exclusively to the ability to reject the null when the null is true at the nominal significance level using analytical approximations. Poor performance in this analysis does not imply that the test, per se, is poor. It only implies that the approximation used to derive the p-values is not accurate. The tests may still be very powerful to detect non-normality, but accurate p-values must then rely on tables produced via Monte Carlo simulations, which are, by definition, always exact.

Skewness test

The skewness test is perhaps the one that performed better in terms of asymptotic approximation to a p-value (which does not imply that this test is at all more powerful or has any other feature that would qualify it as better than the others). The approximation produces p-values that are virtually identical to the nominal p-values for any of the sample sizes.

Kurtosis test

The asymptotic approximation for the kurtosis test performed reasonably well for most significance levels and for large sample sizes. However, it produced an excess of low p-values for sample sizes smaller than 20, implying that the test can be anti-conservative with small samples.

D’Agostino-Pearson omnibus test

The p-value approximation for the D’Agostino-Pearson $K^2$ statistic had a poor and erratic behaviour for sample sizes smaller than 200, particularly at the low significance levels, which are generally the ones of greatest interest. For this test, Monte Carlo simulations to derive the critical values are highly recommended.

Jarque-Bera test

Another test for which the asymptotic approximation was poor is the Jarque-Bera test. The approximation to the distribution of the $\text{LM}$ statistic is not valid for small p-values for any of the sample sizes tested, although for sample sizes >1500 the behaviour was reasonably stable. Monte Carlo simulations also more appropriate for this test.

Shapiro-Wilk test

The asymptotic approximation to the distribution of the Shapiro-Wilk’s $W$ statistic seems adequate for sample sizes larger than 15 and smaller than 4000, being slightly anticonservative for most p-values at sample sizes smaller than 20. It has aberrant behaviour at p-values close to 1, which are generally of no interest anyway.

Shapiro-Francia test

For the Shapiro-Francia’s $W'$ statistic, the asymptotic behaviour is similar to the observed for the Shapiro-Wilk test, except that it remains valid for sample sizes larger than 4000.

Other tests

Other popular normality or distributional tests, such as the Kolmogorov-Smirnov test, the Lilliefors test, the Cramér-von Mises criterion or the Anderson-Darling test, do not have well established asymptotic approximations and, as a consequence, depend on Monte Carlo simulations to derive critical values. The p-values for these tests are always accurate, as long as the simulations were performed adequately.

Conclusion

Care should be taken when using formulas that approximate the distribution for the statistic for these tests. The approximations are not perfectly valid for all sample sizes, and can be very inaccurate in many cases. When in doubt, critical values based on Monte Carlo tests might be preferred.

Normality tests I: Brief overview

Posted on 03.July.2011 by A. M. Winkler

Before the fast computers era, it was difficult to test how valid certain statistical methods were. Even in the 1970’s and 1980’s, when computers were already available at Universities and even at home, certain matrix operations were limited by hardware constraints. We are now at a much more comfortable position to analyse how these methods perform. One area where (nowadays) simple computer runs can be illuminating is with respect to normality tests. Tests to evaluate normality or a specific distribution, frequentist approaches can be broadly divided into two categories:

Tests based on comparison (“best fit”) with a given distribution, often specified in terms of its cumulative distribution funtion (cdf). Examples are the Kolmogorov-Smirnov test, the Lilliefors test, the Anderson-Darling test, the Cramér-von Mises criterion, as well as the Shapiro-Wilk and Shapiro-Francia tests.
Tests based on descriptive statistics of the sample. Examples are the skewness test, the kurtosis test, the D’Agostino-Pearson omnibus test, the Jarque-Bera test.

In this article I’ll briefly review six well-known normality tests: (1) the test based on skewness, (2) the test based on kurtosis, (3) the D’Agostino-Pearson omnibus test, (4) the Shapiro-Wilk test, (5) the Shapiro-Francia test, and (6) the Jarque-Bera test. In a subsequent article, I’ll analyse the analytical p-value approximations for these tests, and in a third article, I’ll provide Monte Carlo critical values for each of them.

Skewness test

The skewness of the sample data can be converted to an useful score to evaluate normality with the following transformation (D’Agostino, 1970):

Compute the sample skewness $\sqrt{b_1} = m_3/m_2^{3/2}$ , where $m_k=\sum_i (X_i-\bar X)^k/n$ is the k-th moment, $\bar X=\sum_i X_i/n$ is the sample mean, and $n$ is the sample size.
Compute:
$Y = \sqrt{b_1}\left(\frac{(n+1)(n+3)}{6(n+2)}\right)^{1/2}$
$\beta_2(\sqrt{b_1}) = \frac{3(n^2+27n-70)(n+1)(n+3)}{(n-2)(n+5)(n+7)(n+9)}$
$W^2 = -1+(2(\beta_2(\sqrt{b_1})-1))^{1/2}$
$\delta = 1/\sqrt{\ln W}$
$\alpha = (2/(W^2-1))^{1/2}$
Compute:
$Z(\sqrt{b_1}) = \delta \ln(Y/\alpha + ((Y/\alpha)^2+1)^{1/2})$

The $Z(\sqrt{b_1})$ is the test statistic and it is approximately normally distributed under the null hypothesis that the population data follows a normal distribution.

Kurtosis test

Like the test based on skewness, the test based on kurtosis needs that the sample kurtosis is calculated, then transformed, and finally converted to a p-value. The steps are (Anscombe & Glynn, 1983):

Compute the sample kurtosis $b_2=m_4/m_2^2$ , where $m_k=\sum_i (X_i-\bar X)^k/n$ is the k-th moment, $\bar X=\sum_i X_i/n$ is the sample mean, and $n$ is the sample size.
Compute:
$\mathsf{E}\{b_2\} = 3(n-1)/(n+1)$
$\mathsf{var}\{b_2\} = \frac{24n(n-2)(n-3)}{(n+1)^2(n+3)(n+5)}$
$x = (b_2-\mathsf{E}\{b_2\})/\sqrt{\mathsf{var}\{b_2\}}$
$\sqrt{\beta_1(b_2)}=\frac{6(n^2-5n+2)}{(n+7)(n+9)}\sqrt{\frac{6(n+3)(n+5)}{n(n-2)(n-3)}}$
$A = 6+\frac{8}{\sqrt{\beta_1(b_2)}}\left(\frac{2}{\sqrt{\beta_1(b_2)}}+\sqrt{1+\frac{4}{\beta_1(b_2)}}\right)$ , where $\mathsf{E}\{\cdot\}$ and $\mathsf{var}\{\cdot\}$ represent respectively the expected value and the variance.
Compute:
$Z(b_2) = \left(\left(1-\frac{2}{9A}\right)-\left(\frac{1-2/A}{1+x\sqrt{2/(A-4)}}\right)^{1/3}\right)/\sqrt{\frac{2}{9A}}$

The $Z(b_2)$ is the test statistic and is considered approximately normally distributed under the null hypothesis that the population data follows a normal distribution.

D’Agostino-Pearson omnibus test

The skewness and kurtosis tests can be combined to produce a single, global, “omnibus” statistic. This global test has been proposed by D’Agostino and Pearson (1973) and its statistic is simply $K^2=(Z(\sqrt{b_1}))^2 + (Z(b_2))^2$ . In other words, simply square the statistics from the skewness and kurtosis tests and sum them together. The distribution of the $K^2$ statistic is approximately a $\chi^2$ distribution with two degrees of freedom under the null hypothesis that the sample was drawn from a population with normally distributed values. The skewness, kurtosis and the D’Agostino-Pearson tests have been collectively reviewed and discussed in D’Agostino et al. (1990).

A matlab/Octave implementation of the D’Agostino-Pearson test is available here: daptest.m.

Jarque-Bera test

In a certain regard, this test, introduced by Jarque & Bera (1980) and later discussed in more detail in Jarque & Bera (1987), could also be called an “omnibus” test, that combines skewness and kurtosis into a single statistic. The test statistic is given by $\text{\textsc{lm}}= n\left(\frac{(\sqrt{b_1})^2}{6} + \frac{(b_2-3)^2}{24}\right)$ . The $\text{\textsc{lm}}$ is asymptotically distributed as a $\chi^2$ distribution with two degrees of freedom under the null hypothesis that the sample was drawn from a population with normally distributed values.

A matlab/Octave implementation of the Jarque-Bera test is available here: jbtest.m.

Shapiro-Wilk test

The Shapiro & Wilk (1965) test depends on the covariance matrix between the order statistics of the observations. Values for this covariance matrix have been laboriously tabulated for small samples (Sahan & Greenberg,1956) and algorithms have been developed (Davis & Stephens, 1978; Royston, 1982). In an effort to produce equivalents, yet computationally affordable results, it has been suggested to use approximations. The description here follow the approximation suggested by Royston (1992).

Sort the data in ascending order.
Compute the sample mean $\bar X=\sum_i X_i/n$ , where $n$ is the sample size.
Compute the Blom scores (Blom, 1958) as $\tilde m_i = \Phi^{-1}\{(i-3/8)/(n+1/4)\}$ , where $i$ is the rank order and $\Phi^{-1}\{\cdot\}$ represents the inverse normal cdf.
Compute a set of weights $c_i = \tilde m_i \sqrt{(\mathbf{\tilde m}'\mathbf{\tilde m})}$ . These weights are the same used in the Shapiro-Francia test (see below).
Compute $\tilde a_n= - 2.706056u^5 + 4.434685u^4 - 2.071190u^3 - 0.147981u^2 + 0.221157u + c_n$ , where $u=n^{-1/2}$ . If $n>5$ , compute also $\tilde a_{n-1}= -3.582633u^4 + 5.682633u^4 -1.752461u^3 -0.293762u^2 + 0.042981u + c_{n-1}$ .
Compute $\phi = (\mathbf{\tilde m}'\mathbf{\tilde m} - 2\tilde m_n^2)/(1-2\tilde a_n^2)$ if $n \leq 5$ or $\phi = (\mathbf{\tilde m}'\mathbf{\tilde m} - 2\tilde m_n^2 - 2\tilde m_{n-1}^2)/(1-2\tilde a_n^2-2\tilde a_{n-1}^2)$ otherwise.
Compute the remaining $\tilde a_i = \phi^{-1/2} \tilde m_i$ for $i=2, \ldots ,n-1$ if $n \leq 5$ or $i=3,\ldots ,n-2$ if $n>5$ .
Compute the test statistic $W=\left(\sum_i \tilde a_i X_i \right)^2 / \sum_i (X_i-\bar X)^2$ .

Once the test statistic $W$ has been produced, it can be approximated through a function $g(W)$ to a normal distribution with mean $\mu$ and standard deviation $\sigma$ (Royston, 1993). For sample sizes between 4 and 11 (inclusive), $g(W)=-\ln(\gamma-\ln(1-W))$ , where $\gamma=0.459n-2.273$ . For sample sizes equal to or larger than 12, $g(W)=\ln(1-W)$ .

For $4 \leq n \leq 11$ , the parameters of the statistic transformed (normalized) by $g(W)$ are given by $\mu=-0.0006714n^3 + 0.025054n^2 -0.39978n + 0.5440$ and $\sigma=\mathsf{exp}\{-0.0020322n^3+ 0.062767n^2 -0.77857n + 1.3822\}$ . For $n \geq 12$ , $\mu = 0.0038915u^3 -0.083751u^2 -0.31082u -1.5851$ and $\sigma = \mathsf{exp}\{0.0030302u^2 -0.082676u -0.4803\}$ , where $u = \ln(n)$ .

A $Z$ statistic can then be produced trivially by $(g(W)-\mu)/\sigma$ , and the p-values can be obtained from the normal cdf.

A matlab/Octave implementation of the Shapiro-Wilk test is available here: swtest.m.

Shapiro-Francia test

The test was proposed by Shapiro & Francia (1972) as a simplification of the Shapiro-Wilk test. The simplification consisted in replacing the covariance matrix of the order statistics by the identity matrix. The test is generally considered asymptotically equivalent to the Shapiro-Wilk test for large and independent samples. To apply the test, the steps are (Royston, 1993):

Sort the data in ascending order.
Compute the sample mean $\bar X=\sum_i X_i/n$ , where $n$ is the sample size.
Compute the Blom scores (Blom, 1958) as $\tilde m_i = \Phi^{-1}\{(i-3/8)/(n+1/4)\}$ , where $i$ is the rank order and $\Phi^{-1}\{\cdot\}$ represents the inverse normal cumulative distribution function (cdf).
Compute a set of weights $c_i = \tilde m_i \sqrt{(\mathbf{\tilde m}'\mathbf{\tilde m})}$ .
Compute the test statistic $W'=\left(\sum_i c_i X_i \right)^2 / \sum_i (X_i-\bar X)^2$ .

Similarly as for the Shapiro-Wilk test, once the test statistic $W'$ has been produced, it can be approximated through a function $g(W')=\ln(1-W')$ to a normal distribution with mean $\mu$ and standard deviation $\sigma$ (Royston, 1993). The parameters of the statistic transformed (normalized) by $g(W')$ are given by $\mu=1.0521u - 1.2725$ , where $u = \ln(\ln(n))-\ln(n)$ and $\sigma=-0.26758v+1.030$ , where $v = \ln(\ln(n))+2/\ln(n)$ . These approximations are valid for samples between 5 and 5000 at least.