I was interested in tests of normality. I wanted to know how effective they are in terms of type I and type II errors especially as sample size increases.
To do this I carried out some simulations of normally distributed data. All of these datasets are normal and if the test was 100% accurate it should classify them all as normal.
I can plot the histogram of the calculated p-values to see how many times the null hypothesis is rejected.
library(ggplot2)library(nortest)x <-vector()for (i in1:10000){ y <-shapiro.test(rnorm(8,168,6.4)) x[i] <- y$p.value}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 8") +xlab("p-value")
x <-vector()for (i in1:10000){ y <-ad.test(rnorm(8,168,6.4)) x[i] <- y$p.value}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Anderson Darling test for normality for a sample size of 8",) +xlab("p-value")
x <-vector()z <-vector()for (i in1:10000){ s <-rnorm(8,168,6.4) t <-rnorm(8,mean(s),sd(s)) y <-ks.test(s,t) x[i] <- y$p.value z[i] <-mean(s)}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 8") +xlab("p-value")
Shapiro-Wilks produces almost a rectangular distribution with all of the possible p-values equally represented except at the ends of the distribution. It rejects about 2.5% of cases as not normal at the 5% value.
Anderson-Darling has an unexpected peak at a p-value of 0.5 and also rejects about 2.5% of cases at the 5% p-value
The Kolmogorov-Smirnov test produces a discrete distribution for the p-value which is not ideal. But it does have a lower number of samples which are identified as not normally distributed at less than 200.
These are small samples and subject to sample fluctuations and it might be that the tests perform badly on small samples. What happens if we increase sample size and use the tests for normality on these larger samples? Hopefully then the tests will not reject any of the sample as they should be clearly normal.
x <-vector()for (i in1:10000){ y <-shapiro.test(rnorm(1000,168,6.4)) x[i] <- y$p.value}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Shapiro Wilks test for normality for a sample size of 8") +xlab("p-value")
x <-vector()for (i in1:10000){ y <-ad.test(rnorm(1000,168,6.4)) x[i] <- y$p.value}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Anderson Darling test for normality for a sample size of 8",) +xlab("p-value")
x <-vector()z <-vector()for (i in1:10000){ s <-rnorm(1000,168,6.4) t <-rnorm(1000,mean(s),sd(s)) y <-ks.test(s,t) x[i] <- y$p.value z[i] <-mean(s)}x <-as.data.frame(x)ggplot(x, aes(x)) +geom_histogram(color="white", fill="#4b0082", bins=20) +labs(title="Histogram of the Kolmogorov Smirnov test for normality for a sample size of 8") +xlab("p-value")
In this case Shapiro-Wilks rejects even more cases as not normally distributed approaching nearly 300 rejections at the 5% level.
Anderson-Darling does performs better but still has the odd peak at a p-value of 0.5.
Kolmogorov Smirnov performs best rejecting the least number of samples but the shape of the distribution of p-values is very noisy with multiple peaks.
For large samples testing for normality using Shapiro-Wilks will result in you rejecting the assumption of normality nearly 3% of the time when it is normally distributed. This then has consequences for the subsequent analysis.
DO NOT USE TESTS OF NORMALITY - the QQ-plot is more reliable.
Simulation of Student Heights for Different Cohorts from the same Population
Now using simulation for a more concrete example I am going to create multiple datasets for student heights based on a normal distribution with the actual mean heights taken from student data. I created them for three different teaching days in two different faculties to represent 6 samples from the same population. Now I want to see from the histograms and QQ-plots if I can apply parametric methods or not.
All of these are normal and all of them come from the same population but as you can see in the histograms they produce very different looking data. For a sample of 5 it is very hard to say whether it is normal or not of if the heights are the same or not. For 30 it becomes more obvious and then at larger samples the extra details become the issue in the histograms.
The QQ-plots tell a different story. With both small and large sample the fit to the line is very good and you get the largest deviations and the worst possible fit with a sample size around 30. This is important because we often choose sample sizes of this order of magnitude of size to assure normality and effective sampling.