正常性測試"基本上沒有用"嗎?


323

一位前同事曾經對我說過以下話:

We usually apply normality tests to the results of processes that, under the null, generate random variables that are only asymptotically or nearly normal (with the 'asymptotically' part dependent on some quantity which we cannot make large); In the era of cheap memory, big data, and fast processors, normality tests should always reject the null of normal distribution for large (though not insanely large) samples. And so, perversely, normality tests should only be used for small samples, when they presumably have lower power and less control over type I rate.

這是一個有效的論點嗎?這是眾所周知的論點嗎?是否有眾所周知的關於"模糊性"原假設的檢驗方法?

243

It's not an argument. It is a (a bit strongly stated) fact that formal normality tests always reject on the huge sample sizes we work with today. It's even easy to prove that when n gets large, even the smallest deviation from perfect normality will lead to a significant result. And as every dataset has some degree of randomness, no single dataset will be a perfectly normally distributed sample. But in applied statistics the question is not whether the data/residuals ... are perfectly normal, but normal enough for the assumptions to hold.

Let me illustrate with the Shapiro-Wilk test. The code below constructs a set of distributions that approach normality but aren't completely normal. Next, we test with shapiro.test whether a sample from these almost-normal distributions deviate from normality. In R:

x <- replicate(100, { # generates 100 different tests on each distribution
                     c(shapiro.test(rnorm(10)+c(1,0,2,0,1))$p.value,   #$
                       shapiro.test(rnorm(100)+c(1,0,2,0,1))$p.value,  #$
                       shapiro.test(rnorm(1000)+c(1,0,2,0,1))$p.value, #$
                       shapiro.test(rnorm(5000)+c(1,0,2,0,1))$p.value) #$
                    } # rnorm gives a random draw from the normal distribution
               )
rownames(x) <- c("n10","n100","n1000","n5000")

rowMeans(x<0.05) # the proportion of significant deviations
  n10  n100 n1000 n5000 
 0.04  0.04  0.20  0.87 

The last line checks which fraction of the simulations for every sample size deviate significantly from normality. So in 87% of the cases, a sample of 5000 observations deviates significantly from normality according to Shapiro-Wilks. Yet, if you see the qq plots, you would never ever decide on a deviation from normality. Below you see as an example the qq-plots for one set of random samples

alt text

with p-values

  n10  n100 n1000 n5000 
0.760 0.681 0.164 0.007 

178

When thinking about whether normality testing is 'essentially useless', one first has to think about what it is supposed to be useful for. Many people (well... at least, many scientists) misunderstand the question the normality test answers.

The question normality tests answer: Is there convincing evidence of any deviation from the Gaussian ideal? With moderately large real data sets, the answer is almost always yes.

The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.


13

Let me add one small thing:
Performing a normality test without taking its alpha-error into account heightens your overall probability of performing an alpha-error.

You shall never forget that each additional test does this as long as you don't control for alpha-error accumulation. Hence, another good reason to dismiss normality testing.


59

IMHO normality tests are absolutely useless for the following reasons:

  1. On small samples, there's a good chance that the true distribution of the population is substantially non-normal, but the normality test isn't powerful to pick it up.

  2. On large samples, things like the T-test and ANOVA are pretty robust to non-normality.

  3. The whole idea of a normally distributed population is just a convenient mathematical approximation anyhow. None of the quantities typically dealt with statistically could plausibly have distributions with a support of all real numbers. For example, people can't have a negative height. Something can't have negative mass or more mass than there is in the universe. Therefore, it's safe to say that nothing is exactly normally distributed in the real world.


7

I think a maximum entropy approach could be useful here. We can assign a normal distribution because we believe the data is "normally distributed" (whatever that means) or because we only expect to see deviations of about the same Magnitude. Also, because the normal distribution has just two sufficient statistics, it is insensitive to changes in the data which do not alter these quantities. So in a sense you can think of a normal distribution as an "average" over all possible distributions with the same first and second moments. this provides one reason why least squares should work as well as it does.


7

The argument you gave is an opinion. I think that the importance of normality testing is to make sure that the data does not depart severely from the normal. I use it sometimes to decide between using a parametric versus a nonparametric test for my inference procedure. I think the test can be useful in moderate and large samples (when the central limit theorem does not come into play). I tend to use Wilk-Shapiro or Anderson-Darling tests but running SAS I get them all and they generally agree pretty well. On a different note I think that graphical procedures such as Q-Q plots work equally well. The advantage of a formal test is that it is objective. In small samples it is true that these goodness of fit tests have practically no power and that makes intuitive sense because a small sample from a normal distribution might by chance look rather non normal and that is accounted for in the test. Also high skewness and kurtosis that distinguish many non normal distributions from normal distributions are not easily seen in small samples.


7

I think the first 2 questions have been thoroughly answered but I don't think question 3 was addressed. Many tests compare the empirical distribution to a known hypothesized distribution. The critical value for the Kolmogorov-Smirnov test is based on F being completely specified. It can be modified to test against a parametric distribution with parameters estimated. So if fuzzier means estimating more than two parameters then the answer to the question is yes. These tests can be applied the 3 parameter families or more. Some tests are designed to have better power when testing against a specific family of distributions. For example when testing normality the Anderson-Darling or the Shapiro-Wilk test have greater power than K-S or chi square when the null hypothesized distribution is normal. Lillefors devised a test that is preferred for exponential distributions.


127

I think that tests for normality can be useful as companions to graphical examinations. They have to be used in the right way, though. In my opinion, this means that many popular tests, such as the Shapiro-Wilk, Anderson-Darling and Jarque-Bera tests never should be used.

Before I explain my standpoint, let me make a few remarks:

  • In an interesting recent paper Rochon et al. studied the impact of the Shapiro-Wilk test on the two-sample t-test. The two-step procedure of testing for normality before carrying out for instance a t-test is not without problems. Then again, neither is the two-step procedure of graphically investigating normality before carrying out a t-test. The difference is that the impact of the latter is much more difficult to investigate (as it would require a statistician to graphically investigate normality $100,000$ or so times...).
  • It is useful to quantify non-normality, for instance by computing the sample skewness, even if you don't want to perform a formal test.
  • Multivariate normality can be difficult to assess graphically and convergence to asymptotic distributions can be slow for multivariate statistics. Tests for normality are therefore more useful in a multivariate setting.
  • Tests for normality are perhaps especially useful for practitioners who use statistics as a set of black-box methods. When normality is rejected, the practitioner should be alarmed and, rather than carrying out a standard procedure based on the assumption of normality, consider using a nonparametric procedure, applying a transformation or consulting a more experienced statistician.
  • As has been pointed out by others, if $n$ is large enough, the CLT usually saves the day. However, what is "large enough" differs for different classes of distributions.

(In my definiton) a test for normality is directed against a class of alternatives if it is sensitive to alternatives from that class, but not sensitive to alternatives from other classes. Typical examples are tests that are directed towards skew or kurtotic alternatives. The simplest examples use the sample skewness and kurtosis as test statistics.

Directed tests of normality are arguably often preferable to omnibus tests (such as the Shapiro-Wilk and Jarque-Bera tests) since it is common that only some types of non-normality are of concern for a particular inferential procedure.

Let's consider Student's t-test as an example. Assume that we have an i.i.d. sample from a distribution with skewness $\gamma=\frac{E(X-\mu)^3}{\sigma^3}$ and (excess) kurtosis $\kappa=\frac{E(X-\mu)^4}{\sigma^4}-3.$ If $X$ is symmetric about its mean, $\gamma=0$. Both $\gamma$ and $\kappa$ are 0 for the normal distribution.

Under regularity assumptions, we obtain the following asymptotic expansion for the cdf of the test statistic $T_n$: $$P(T_n\leq x)=\Phi(x)+n^{-1/2}\frac{1}{6}\gamma(2x^2+1)\phi(x)-n^{-1}x\Big(\frac{1}{12}\kappa (x^2-3)-\frac{1}{18}\gamma^2(x^4+2x^2-3)-\frac{1}{4}(x^2+3)\Big)\phi(x)+o(n^{-1}),$$

where $\Phi(\cdot)$ is the cdf and $\phi(\cdot)$ is the pdf of the standard normal distribution.

$\gamma$ appears for the first time in the $n^{-1/2}$ term, whereas $\kappa$ appears in the $n^{-1}$ term. The asymptotic performance of $T_n$ is much more sensitive to deviations from normality in the form of skewness than in the form of kurtosis.

It can be verified using simulations that this is true for small $n$ as well. Thus Student's t-test is sensitive to skewness but relatively robust against heavy tails, and it is reasonable to use a test for normality that is directed towards skew alternatives before applying the t-test.

As a rule of thumb (not a law of nature), inference about means is sensitive to skewness and inference about variances is sensitive to kurtosis.

Using a directed test for normality has the benefit of getting higher power against ''dangerous'' alternatives and lower power against alternatives that are less ''dangerous'', meaning that we are less likely to reject normality because of deviations from normality that won't affect the performance of our inferential procedure. The non-normality is quantified in a way that is relevant to the problem at hand. This is not always easy to do graphically.

As $n$ gets larger, skewness and kurtosis become less important - and directed tests are likely to detect if these quantities deviate from 0 even by a small amount. In such cases, it seems reasonable to, for instance, test whether $|\gamma|\leq 1$ or (looking at the first term of the expansion above) $$|n^{-1/2}\frac{1}{6}\gamma(2z_{\alpha/2}^2+1)\phi(z_{\alpha/2})|\leq 0.01$$ rather than whether $\gamma=0$. This takes care of some of the problems that we otherwise face as $n$ gets larger.


31

I think that pre-testing for normality (which includes informal assessments using graphics) misses the point.

  1. Users of this approach assume that the normality assessment has in effect a power near 1.0.
  2. Nonparametric tests such as the Wilcoxon, Spearman, and Kruskal-Wallis have efficiency of 0.95 if normality holds.
  3. In view of 2. one can pre-specify the use of a nonparametric test if one even entertains the possibility that the data may not arise from a normal distribution.
  4. Ordinal cumulative probability models (the proportional odds model being a member of this class) generalize standard nonparametric tests. Ordinal models are completely transformation-invariant with respect to $Y$, are robust, powerful, and allow estimation of quantiles and mean of $Y$.

-3

One good use of normality test that I don't think has been mentioned is to determine whether using z-scores is okay. Let's say you selected a random sample from a population, and you wish to find the probability of selecting one random individual from the population and get a value of 80 or higher. This can be done only if the distribution is normal, because to use z-scores, the assumption is that the population distribution is normal.

But then I guess I can see this being arguable too...


17

Before asking whether a test or any sort of rough check for normality is "useful" you have to answer the question behind the question: "Why are you asking?"

For example, if you only want to put a confidence limit around the mean of a set of data, departures from normality may or not be important, depending on how much data you have and how big the departures are. However, departures from normality are apt to be crucial if you want to predict what the most extreme value will be in future observations or in the population you have sampled from.


5

Tests where "something" important to the analysis is supported by high p-values are I think wrong headed. As others pointed out, for large data sets, a p-value below 0.05 is assured. So, the test essentially "rewards" for small and fuzzy data sets and "rewards" for a lack of evidence. Something like qq plots are much more useful. The desire for hard numbers to decide things like this always (yes/no normal/not normal) misses that modeling is partially an art and how hypotheses are actually supported.


7

I wouldn't say it is useless, but it really depends on the application. Note, you never really know the distribution the data is coming from, and all you have is a small set of the realizations. Your sample mean is always finite in sample, but the mean could be undefined or infinite for some types of probability density functions. Let us consider the three types of Levy stable distributions i.e Normal distribution, Levy distribution and Cauchy distribution. Most of your samples do not have a lot of observations at the tail (i.e away from the sample mean). So empirically it is very hard to distinguish between the three, so the Cauchy (has undefined mean) and the Levy (has infinite mean) could easily masquerade as a normal distribution.


11

I used to think that tests of normality were completely useless.

However, now I do consulting for other researchers. Often, obtaining samples is extremely expensive, and so they will want to do inference with n = 8, say.

In such a case, it is very difficult to find statistical significance with non-parametric tests, but t-tests with n = 8 are sensitive to deviations from normality. So what we get is that we can say "well, conditional on the assumption of normality, we find a statistically significant difference" (don't worry, these are usually pilot studies...).

Then we need some way of evaluating that assumption. I'm half-way in the camp that looking at plots is a better way to go, but truth be told there can be a lot of disagreement about that, which can be very problematic if one of the people who disagrees with you is the reviewer of your manuscript.

In many ways, I still think there are plenty of flaws in tests of normality: for example, we should be thinking about the type II error more than the type I. But there is a need for them.


10

For what it's worth, I once developed a fast sampler for the truncated normal distribution, and normality testing (KS) was very useful in debugging the function. This sampler passes the test with huge sample sizes but, interestingly, the GSL's ziggurat sampler didn't.


11

Answers here have already addressed several important points. To quickly summarize:

  • There is no consistent test that can determine whether a set of data truly follow a distribution or not.
  • Tests are no substitute for visually inspecting the data and models to identify high leverage, high influence observations and commenting on their effects on models.
  • The assumptions for many regression routines are often misquoted as requiring normally distributed "data" [residuals] and that this is interpreted by novice statisticians as requiring that the analyst formally evaluate this in some sense before proceeding with analyses.

I am adding an answer firstly to cite to one of my, personally, most frequently accessed and read statistical articles: "The Importance of Normality Assumptions in Large Public Health Datasets" by Lumley et. al. It is worth reading in entirety. The summary states:

The t-test and least-squares linear regression do not require any assumption of Normal distribution in sufficiently large samples. Previous simulations studies show that “sufficiently large” is often under 100, and even for our extremely non-Normal medical cost data it is less than 500. This means that in public health research, where samples are often substantially larger than this, the t-test and the linear model are useful default tools for analyzing differences and trends in many types of data, not just those with Normal distributions. Formal statistical tests for Normality are especially undesirable as they will have low power in the small samples where the distribution matters and high power only in large samples where the distribution is unimportant.

While the large-sample properties of linear regression are well understood, there has been little research into the sample sizes needed for the Normality assumption to be unimportant. In particular, it is not clear how the necessary sample size depends on the number of predictors in the model.

The focus on Normal distributions can distract from the real assumptions of these methods. Linear regression does assume that the variance of the outcome variable is approximately constant, but the primary restriction on both methods is that they assume that it is sufficient to examine changes in the mean of the outcome variable. If some other summary of the distribution is of greater interest, then the t-test and linear regression may not be appropriate.

To summarize: normality is generally not worth the discussion or the attention it receives in contrast to the importance of answering a particular scientific question. If the desire is to summarize mean differences in data, then the t-test and ANOVA or linear regression are justified in a much broader sense. Tests based on these models remain of the correct alpha level, even when distributional assumptions are not met, although power may be adversely affected.

The reasons why normal distributions may receive the attention they do may be for classical reasons, where exact tests based on F-distributions for ANOVAs and Student-T-distributions for the T-test could be obtained. The truth is, among the many modern advancements of science, we generally deal with larger datasets than were collected previously. If one is in fact dealing with a small dataset, the rationale that those data are normally distributed cannot come from those data themselves: there is simply not enough power. Remarking on other research, replications, or even the biology or science of the measurement process is, in my opinion, a much more justified approach to discussing a possible probability model underlying the observed data.

For this reason, opting for a rank-based test as an alternative misses the point entirely. However, I will agree that using robust variance estimators like the jackknife or bootstrap offer important computational alternatives that permit conducting tests under a variety of more important violations of model specification, such as independence or identical distribution of those errors.