This is a continuation of my examination of test for normality. This time I am going to create simulations that randomly sample from either the normal distribution, a gamma or an exponential distribution and I am going to use the tests to assign the data as normal or not-normal and then check the test results compared to those values that were actually from normal distributions. From this I can work out the true positives, true negatives, false positives and false negatives.
The first simulation creates 10,000 random samples from a normal distribution of size 10 and calculates the Kolmogorov-Smirnoff, Shapiro-Wilks and Anderson-Darling tests for each of the samples. It then plots histograms of the p-vallues for each of the normality tests, keeping in mind that these data are all from a normal distribution.
<- vector()
x <- vector()
y <- vector()
z <- vector()
w
for (i in 1:10000){
<- sample(1:3, prob=c(0.5,0.25,0.25),1)
a if (a==1){
<- rnorm(10, mean=0,sd=1)
b <- 1
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
else if (a==2){
} <- rexp(10,rate=1)
b <- 0
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
else {
} <- rgamma(10, shape=1, rate=1)
b <- 0
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
}
}
table(x,y)
## y
## x 0 1
## 0 1483 3573
## 1 272 4672
table(x,z)
## z
## x 0 1
## 0 2222 2834
## 1 268 4676
table(x,w)
## w
## x 0 1
## 0 2222 2834
## 1 268 4676
The performance of the normality tests is not very good. Especially the false positives which is where x=0 and y,z or w are 1.
I am going to repeat the process but this time the sample size will be 40. This is the threshold for treating the median as a robust measure of the centre of the distribution for t-tests even if it is not normally distributed.
<- vector()
x <- vector()
y <- vector()
z <- vector()
w
for (i in 1:10000){
<- sample(1:3,prob=c(0.5,0.25,0.25),1)
a if (a==1){
<- rnorm(40, mean=0,sd=1)
b <- 1
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
else if (a==2){
} <- rexp(40,rate=1)
b <- 0
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
else {
} <- rgamma(40, shape=1, rate=1)
b <- 0
x[i] <- lillie.test(b)
temp if (temp$p.value<0.05){y[i] <- 0}
else {y[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){z[i] <- 0}
else {z[i]<- 1}
<- shapiro.test(b)
temp if (temp$p.value<0.05){w[i] <- 0}
else {w[i]<- 1}
}
}
table(x,y)
## y
## x 0 1
## 0 4562 452
## 1 259 4727
table(x,z)
## z
## x 0 1
## 0 4999 15
## 1 269 4717
table(x,w)
## w
## x 0 1
## 0 4999 15
## 1 269 4717
This produces considerably better results. But only for the false positives but the false negatives remain the same which is what we saw from the uniform distribution of the p-values for all of the test statistics.
Sample size is very important for correct tests of normality but once you get to large sample sizes normality is clear in the histograms and QQ-plots or you can use parametric tests anyway which means that in most instances the tests for normality are not effective or useful.