Tests of Normality

Author

Dr Andrew Dalby

This is an update on my original post about normality which goes into a more detailed examination of whether we should be doing the tests or not. The original post was my feeling that the tests are not good to teach and not good for students. They are too much of an algorithmic solution that stops people thinking about and looking at their data. They simply run the test say it is either normal or not and then go about doing a parametric or non-parametric analysis depending on the result.

This auto-pilot view of normality concerned me as it stops people thinking about the data. As a referee in peer review I have seen people do a particular test, decide the data is not normal and then use non-parametric methods when the data seemed to me to be normal and they should have been using parametric analysis. This came up in a post on LinkedIn which has made me update my thoughts about normality testing. I still think that we should not be encouraging students to use them or making them the conventional wisdom for scientific publication, but I needed to re-evaluate my reasoning. Thanks to Natalie Rodrigue, Adrain Olszewski and Mark Ramos and apologies to Muthomi Kaaria who was the target of the original post and subject to some unnecessary criticism.

What I most wanted to know how effective they are in terms of type I and type II errors especially as sample size increases. You would hope that as sample size increases there would be better discrimination of normality.

To do this I carried out some simulations of normally distributed data. All of those datasets were normal and if the test was 100% accurate it should classify them all as normal. To this I have now added some examples of skewed data to look at type II errors.

Normality Tests on a Small Sample of Normally Distributed Data

I can plot the histogram/distribution of the calculated p-values to see how many times the null hypothesis is rejected.

library(ggplot2)
library(nortest)

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rnorm(8,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 8") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rnorm(8,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 8",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rnorm(8,168,6.4)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Lilliefors (Kolmogorov Smirnov) test for normality for a sample size of 8") +
  xlab("p-value")

My mistake was now understanding that under H0 being true you should see a uniform distribution for the p-values of the test statistic. This is a bit of a tautologous piece of reasoning because that is how the p-value is defined as the probability of making a type I error. This does not seem intuitive but thinking slowly it has to be true. There has to the a 5% probability of making an error at 5%. This then later brings in to question whether we should be using the null hypothesis significance testing framework at all, but that is another question for another day.

Note that H0 has to be true, H0 has to be a point value (is normality a point value?) and the dependent value has to be continuous (amount of normality is a continuous variable ?!?).

Shapiro-Wilks produces almost a uniform distribution with all of the possible p-values equally represented except at the ends of the distribution. It rejects between 2.5% and 3% of cases as not normal at the 5% value.

Anderson-Darling has an unexpected peak at a p-value of 0.5 which might be an artefact of binning and also rejects about 2.5% of cases at the 5% p-value.

I ran it with more bins and the peak is still there. Note that the drops at the tails are boundary issues. With the narrower bins the shortfall of the lowest and highest values is even more pronounced.

x <- vector()
for (i in 1:10000){
  y <- ad.test(rnorm(8,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=40) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 8",) +
    xlab("p-value")

The Kolmogorov-Smirnov test produces a result very similar to the Shapiro-Wilks case.

Normality Tests on Medium Size Samples of Normally Distributed Data.

Small samples and subject to sample fluctuations and it might be that the tests perform badly on small samples. What happens if we increase sample size and use the tests for normality on these larger samples? If the test was a diagnostic test you would hope for an improvement as you include more information but this is a null hypothesis significance test. We know that the distribution of p-values will be uniformly distributed. This means we will still reject the same proportion of samples as before.

The magic sample size number of 30 is often applied as a cut-off between samples that you would expect to be able to detect normality in clearly and those that you cannot. For samples larger than 30 you can use parametric test methods with medians even if the data is not normally distributed.

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rnorm(30,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 30") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rnorm(30,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 30",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rnorm(30,168,6.4)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 30") +
  xlab("p-value")

For a sample size of 30 the results are the same as that for 8.

Normality Tests on Large Samples of Normally Distributed Data

If you repeat the simulations with a large sample size of 500 where you would expect data sampled from a normal distribution to produce a clear bell shaped distribution curve. You will still get the same histograms for p-value showing that no matter what you will be rejecting about 5% of cases at the 5% level.

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rnorm(500,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 30") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rnorm(500,168,6.4))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 500",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rnorm(500,168,6.4)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 500") +
  xlab("p-value")

Now if I use data where I have some skew will the tests detect that this data should not be normally distributed?

Normality Tests on a Small Sample of Skewed Data

I have set the skew value to be 1.5 which is considered to be a large skew or at least the borderline beyond which we would assume that the data was no longer normally distributed.

library(fGarch)

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rsnorm(8,mean=168,sd=6.4, xi=1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 8") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rsnorm(8,168,6.4,1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 8",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rsnorm(8,168,6.4,1.5)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 8") +
  xlab("p-value")

For the skewed data the small sample is under-powered giving very similar results to the normally distributed data and the tests make a type II error and assume normality when it is not true. The Type II error rate is incredibly high for such small samples at over 90%.

This is precisely when most students and researchers are going to use a normality test as with larger samples normality is clear in the distribution of the data.

What is most frustrating is that the numbers of simulations with a p-value just above 0.05 is much higher indicating that the test would have a better type II error rate when you have a slightly worse type I error rate.

Normality Tests on a Medium Sized Sample of Skewed Data

Repeating the simulations with a sample size of 30 gives better results but at this level a visual inspection of the distribution of the data is likely going to tell you if it is normally distributed or not.

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rsnorm(30,mean=168,sd=6.4, xi=1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 30") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rsnorm(30,168,6.4,1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 30",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rsnorm(30,168,6.4,1.5)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 30") +
  xlab("p-value")

Even with this fairly large sample the type II error rate is still over 80% and you need to go to even larger sample sizes to get good type II error rates.

Normality Tests on Large Samples of Skewed Data

x <- vector()
for (i in 1:10000){
  y <- shapiro.test(rsnorm(500,mean=168,sd=6.4, xi=1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 500") +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  y <- ad.test(rsnorm(500,168,6.4,1.5))
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Anderson Darling test for normality for a sample size of 500",) +
    xlab("p-value")

x <- vector()
for (i in 1:10000){
  s <- rsnorm(500,168,6.4,1.5)
  y <- lillie.test(s)
  x[i] <- y$p.value
}
x <- as.data.frame(x)
ggplot(x, aes(x)) +
    geom_histogram(color="white", fill="#4b0082", bins=20) +
    labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 500") +
  xlab("p-value")

For the large samples the tests give very sharp peaks for p-values showing that the null hypothesis should be rejected and there will only be a very small number of erroneous classifications of normality, but with such large samples a display of the density of the distribution will make it clear if there are deviations from normality.

Coupled with the knowledge that the tests will always incorrectly reject 5% of cases no matter how much data you have or however clearly it hows normality this suggests that tests should not be an automatic go to part of your statistical analysis and visual methods are more important, especially when sample sizes are 30 or less.

DO NOT USE TESTS OF NORMALITY - the QQ-plot is more reliable.

Using Visual Methods - Histograms and QQ-plot to Assess Normality of Data

Now using simulation for a more concrete example I am going to create multiple datasets for student heights based on a normal distribution with the actual mean heights taken from student data. I created them for three different teaching days in two different faculties to represent 6 samples from the same population. Now I want to see from the histograms and QQ-plots if I can apply parametric methods or not.

s1<-rnorm(8,168,6.4)
s2<-rnorm(8,168,6.4)
s3<-rnorm(8,168,6.4)
s4<-rnorm(8,168,6.4)
s5<-rnorm(8,168,6.4)
s6<-rnorm(8,168,6.4)
height_tbl <- data.frame(
Height <- c(s1,s2,s3,s4,s5,s6),
Day <- c(rep("Tuesday",16), rep("Thursday",16),rep("Friday",16)),
School <- c(rep(c(rep("Life Sciences",8),rep("Psychology",8)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

s1<-rnorm(30,168,6.4)
s2<-rnorm(30,168,6.4)
s3<-rnorm(30,168,6.4)
s4<-rnorm(30,168,6.4)
s5<-rnorm(30,168,6.4)
s6<-rnorm(30,168,6.4)
height_tbl <- data.frame(
  Height <- c(s1,s2,s3,s4,s5,s6),
  Day <- c(rep("Tuesday",60), rep("Thursday",60),rep("Friday",60)),
  School <- c(rep(c(rep("Life Sciences",30),rep("Psychology",30)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

s1<-rnorm(500,168,6.4)
s2<-rnorm(500,168,6.4)
s3<-rnorm(500,168,6.4)
s4<-rnorm(500,168,6.4)
s5<-rnorm(500,168,6.4)
s6<-rnorm(500,168,6.4)
height_tbl <- data.frame(
  Height <- c(s1,s2,s3,s4,s5,s6),
  Day <- c(rep("Tuesday",1000), rep("Thursday",1000),rep("Friday",1000)),
  School <- c(rep(c(rep("Life Sciences",500),rep("Psychology",500)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

All of these are normal and all of them come from the same population but as you can see in the histograms they produce very different looking data. For a sample of 5 it is very hard to say whether it is normal or not of if the heights are the same or not. For 30 it becomes clearer in the histograms and then at larger samples the extra details become the issue.

The QQ-plots tell a different story. With both small and large sample the fit to the line is very good and you get the largest deviations and the worst possible fit with a sample size around 30. This is important because we often choose sample sizes of this order of magnitude of size to assure normality and effective sampling. This looks like a sample size of 30 is not a good choice.

For completeness I wanted to look at the histograms and QQ-plots for the skewed data.

Visual Methods for Assessing the Normality of the Skewed Data

library(fGarch)
s1<-rsnorm(8,168,6.4,1.5)
s2<-rsnorm(8,168,6.4,1.5)
s3<-rsnorm(8,168,6.4,1.5)
s4<-rsnorm(8,168,6.4,1.5)
s5<-rsnorm(8,168,6.4,1.5)
s6<-rsnorm(8,168,6.4,1.5)
height_tbl <- data.frame(
Height <- c(s1,s2,s3,s4,s5,s6),
Day <- c(rep("Tuesday",16), rep("Thursday",16),rep("Friday",16)),
School <- c(rep(c(rep("Life Sciences",8),rep("Psychology",8)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

s1<-rsnorm(30,168,6.4,1.5)
s2<-rsnorm(30,168,6.4,1.5)
s3<-rsnorm(30,168,6.4,1.5)
s4<-rsnorm(30,168,6.4,1.5)
s5<-rsnorm(30,168,6.4,1.5)
s6<-rsnorm(30,168,6.4,1.5)
height_tbl <- data.frame(
  Height <- c(s1,s2,s3,s4,s5,s6),
  Day <- c(rep("Tuesday",60), rep("Thursday",60),rep("Friday",60)),
  School <- c(rep(c(rep("Life Sciences",30),rep("Psychology",30)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

s1<-rsnorm(500,168,6.4,1.5)
s2<-rsnorm(500,168,6.4,1.5)
s3<-rsnorm(500,168,6.4,1.5)
s4<-rsnorm(500,168,6.4,1.5)
s5<-rsnorm(500,168,6.4,1.5)
s6<-rsnorm(500,168,6.4,1.5)
height_tbl <- data.frame(
  Height <- c(s1,s2,s3,s4,s5,s6),
  Day <- c(rep("Tuesday",1000), rep("Thursday",1000),rep("Friday",1000)),
  School <- c(rep(c(rep("Life Sciences",500),rep("Psychology",500)),3))
)
library(ggplot2)
ggplot(height_tbl, aes(Height))+
  geom_histogram(binwidth=3, color="white", fill="#4b0082")+
  facet_grid(height_tbl$Day~height_tbl$School)

ggplot(height_tbl, aes(sample = Height))+
  stat_qq() + stat_qq_line()+
  facet_grid(height_tbl$Day~height_tbl$School)

It is very challenging (impossible) to know that the data is skewed from the small sample size data either in the histograms or QQ-plots. For samples of that size I would assume normality unless there is good theoretical evidence to suggest that the data was generated by a process that is not going to generate normally distributed data, for example a growth or decay process.

Even at a sample size of 30 the histograms do not present a clear story but the QQ-plots are suggestive that something is happening with the data and that it might not be normally distributed.

At the large sample size of 500 the histograms are fairly clear that there is a long tail to the right and the data is clearly not symmetrical. The QQ-plots also show a clear U-shaped curve indicating skewed data.

Conclusions

None of the simulations actually tests what we want to know. Can the tests distinguish between cases or normal and not normal data. The issue is that the null hypothesis is simple but the alternative includes all of the possible distributions or degrees of skew and kurtosis that might occur. That is a more complex set of simulations. But it would be possible to generate data from a wide range of skew values sampled on a uniform distribution and then to see how often the tests correctly identify normal and skewed data as normal and not-normal.

Where we should go to know is that:

Using tests of normality automatically should be discouraged.
When peer reviewing papers for statistical validity we should not reject them based on not having/presenting a test of normality.
- By default with small samples we should assume normality and parametric tests.
- There should be an evaluation of the sensitivity of the results to the assumption of normality.
The case of normality tests is just one example of the limits of null hypothesis significance testing and we should consider alternative frameworks where testing for normality is not a requirement.