1 Introduction

This is a continuation of my examination of test for normality. This time I am going to create simulations that randomly sample from either the normal distribution, a gamma or an exponential distribution and I am going to use the tests to assign the data as normal or not-normal and then check the test results compared to those values that were actually from normal distributions. From this I can work out the true positives, true negatives, false positives and false negatives.

1.1 Simulation One: Sample size 10 mix of normal, exponential and gamma distributions

The first simulation creates 10,000 random samples from a normal distribution of size 10 and calculates the Kolmogorov-Smirnoff, Shapiro-Wilks and Anderson-Darling tests for each of the samples. It then plots histograms of the p-vallues for each of the normality tests, keeping in mind that these data are all from a normal distribution.

x <- vector()
y <- vector()
z <- vector()
w <- vector()

for (i in 1:10000){
  a <- sample(1:3, prob=c(0.5,0.25,0.25),1)
  if (a==1){
    b <- rnorm(10, mean=0,sd=1)
    x[i] <- 1
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  } else if (a==2){
    b <- rexp(10,rate=1)
    x[i] <- 0
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  } else {
    b <- rgamma(10, shape=1, rate=1)
    x[i] <- 0
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  }
}

table(x,y)

##    y
## x      0    1
##   0 1483 3573
##   1  272 4672

table(x,z)

##    z
## x      0    1
##   0 2222 2834
##   1  268 4676

table(x,w)

##    w
## x      0    1
##   0 2222 2834
##   1  268 4676

The performance of the normality tests is not very good. Especially the false positives which is where x=0 and y,z or w are 1.

1.2 Simulation Two: The effect of large samples

I am going to repeat the process but this time the sample size will be 40. This is the threshold for treating the median as a robust measure of the centre of the distribution for t-tests even if it is not normally distributed.

x <- vector()
y <- vector()
z <- vector()
w <- vector()

for (i in 1:10000){
  a <- sample(1:3,prob=c(0.5,0.25,0.25),1)
  if (a==1){
    b <- rnorm(40, mean=0,sd=1)
    x[i] <- 1
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  } else if (a==2){
    b <- rexp(40,rate=1)
    x[i] <- 0
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  } else {
    b <- rgamma(40, shape=1, rate=1)
    x[i] <- 0
    temp <- lillie.test(b)
    if (temp$p.value<0.05){y[i] <- 0}
    else {y[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){z[i] <- 0}
    else {z[i]<- 1}
    temp <- shapiro.test(b)
    if (temp$p.value<0.05){w[i] <- 0}
    else {w[i]<- 1}
  }
}

table(x,y)

##    y
## x      0    1
##   0 4562  452
##   1  259 4727

table(x,z)

##    z
## x      0    1
##   0 4999   15
##   1  269 4717

table(x,w)

##    w
## x      0    1
##   0 4999   15
##   1  269 4717

This produces considerably better results. But only for the false positives but the false negatives remain the same which is what we saw from the uniform distribution of the p-values for all of the test statistics.

Sample size is very important for correct tests of normality but once you get to large sample sizes normality is clear in the histograms and QQ-plots or you can use parametric tests anyway which means that in most instances the tests for normality are not effective or useful.

Normality Tests Confusion Matrices

1 Introduction

1.1 Simulation One: Sample size 10 mix of normal, exponential and gamma distributions

1.2 Simulation Two: The effect of large samples