Normality Revisited Again

Author

Andrew Dalby

Testing For Normality in the “Real World” - Dealing with Skew

When I did my simulations of skewed data to see how well normality testing was performing in terms of classifying data depending on sample size I sampled the skew from a uniform distribution between -3 and 3. It would possibly be more realistic to sample from a normal distribution with a mean skew of 0 and a standard deviation of 1.5. This would mean that approximately 95% of the cases will have a skew between -3 and 3 and 68% will lie between -1.5 and 1.5 the threshold for treating the data as normal and not being sufficiently skewed.

The question is how does this affect the tests?

Small Sample Simulation with Shapiro-Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(8,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  364  637
         1 2885 6114
                                          
               Accuracy : 0.6478          
                 95% CI : (0.6383, 0.6572)
    No Information Rate : 0.6751          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0215          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.1120          
            Specificity : 0.9056          
         Pos Pred Value : 0.3636          
         Neg Pred Value : 0.6794          
             Prevalence : 0.3249          
         Detection Rate : 0.0364          
   Detection Prevalence : 0.1001          
      Balanced Accuracy : 0.5088          
                                          
       'Positive' Class : 0

The sensitivity of the test is still very low but the specificity is high. The test classifies almost everything as normal and it is only in a small number of cases that it correctly detects that the data is not normal. This shows a very high type II error rate. Whereas the high specificity indicates a lower type I error rate.

Small Sample Simulation with Anderson Darling

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- ad.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  327  630
         1 2805 6238
                                          
               Accuracy : 0.6565          
                 95% CI : (0.6471, 0.6658)
    No Information Rate : 0.6868          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0156          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.1044          
            Specificity : 0.9083          
         Pos Pred Value : 0.3417          
         Neg Pred Value : 0.6898          
             Prevalence : 0.3132          
         Detection Rate : 0.0327          
   Detection Prevalence : 0.0957          
      Balanced Accuracy : 0.5063          
                                          
       'Positive' Class : 0

The results for Anderson-Darling are very similar to those from Shapiro-Wilks

Small Sample Simulation with Lilliefors Test (Kolmogorov-Smirnov)

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  271  515
         1 2863 6351
                                          
               Accuracy : 0.6622          
                 95% CI : (0.6528, 0.6715)
    No Information Rate : 0.6866          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0144          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.08647         
            Specificity : 0.92499         
         Pos Pred Value : 0.34478         
         Neg Pred Value : 0.68928         
             Prevalence : 0.31340         
         Detection Rate : 0.02710         
   Detection Prevalence : 0.07860         
      Balanced Accuracy : 0.50573         
                                          
       'Positive' Class : 0

For the Lilliefors test (equivalent to the Kolmogorov-Smirmov test) the specificity is even higher but at the cost of sensitivity.

Medium Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(30,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1336 2113
         1 1836 4715
                                          
               Accuracy : 0.6051          
                 95% CI : (0.5954, 0.6147)
    No Information Rate : 0.6828          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1092          
                                          
 Mcnemar's Test P-Value : 1.123e-05       
                                          
            Sensitivity : 0.4212          
            Specificity : 0.6905          
         Pos Pred Value : 0.3874          
         Neg Pred Value : 0.7197          
             Prevalence : 0.3172          
         Detection Rate : 0.1336          
   Detection Prevalence : 0.3449          
      Balanced Accuracy : 0.5559          
                                          
       'Positive' Class : 0

There is a considerable improvement in sensitivity with a moderate loss of specificity. The prediction accuracy is still only about 57% which is not very good.

Medium Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(30,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  764 1231
         1 2443 5562
                                          
               Accuracy : 0.6326          
                 95% CI : (0.6231, 0.6421)
    No Information Rate : 0.6793          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0633          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.2382          
            Specificity : 0.8188          
         Pos Pred Value : 0.3830          
         Neg Pred Value : 0.6948          
             Prevalence : 0.3207          
         Detection Rate : 0.0764          
   Detection Prevalence : 0.1995          
      Balanced Accuracy : 0.5285          
                                          
       'Positive' Class : 0

This is worse than the Shapiro-Wilks results with a much smaller increase in sensitivity but also a smaller loss of specificity.

Large Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(500,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3111 5597
         1    0 1292
                                          
               Accuracy : 0.4403          
                 95% CI : (0.4305, 0.4501)
    No Information Rate : 0.6889          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1256          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.1875          
         Pos Pred Value : 0.3573          
         Neg Pred Value : 1.0000          
             Prevalence : 0.3111          
         Detection Rate : 0.3111          
   Detection Prevalence : 0.8708          
      Balanced Accuracy : 0.5938          
                                          
       'Positive' Class : 0

The sensitivity has improved to perfection. There is a 0 type II error rate as the test can now identify that all skewed data is not normal. However this comes at a cost of a large fall in specificity in that now lots of normal data is also being rejected and classified as not normal.

Test accuracy still remains low with the best possible case of around 65%.

Large Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(500,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3184 5196
         1    3 1617
                                          
               Accuracy : 0.4801          
                 95% CI : (0.4703, 0.4899)
    No Information Rate : 0.6813          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1649          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9991          
            Specificity : 0.2373          
         Pos Pred Value : 0.3800          
         Neg Pred Value : 0.9981          
             Prevalence : 0.3187          
         Detection Rate : 0.3184          
   Detection Prevalence : 0.8380          
      Balanced Accuracy : 0.6182          
                                          
       'Positive' Class : 0

Again the Kolmogorov-Smirnov test outperforms the Shapiro Wilk test. While it does not have 100% specificity and some skewed data is classified as normal, this is a very small number. It has a better specificity and can pick out normally distributed data better and has a better prediction accuracy.

Conclusion

There is very little difference between the results for the normally distributed and the uniformly distributed skew distributions and of anything for large samples there is a slightly worse performance in terms of specificity for the normally distributed skew. This still raises questions about the use of tests for normality as a yes/no binary before starting subsequent statistical analysis and either applying parametric or non-parametric methods. Testing for normality should be approached with caution.

One way of resolving this issue for the skewed data cases is determining the effect of skew on the subsequent parametric NHST. How much does skew affect t-tests and ANOVA? There is some existing literature regarding these effects but they are largely uncited because we have become stuck with the normal/non-normal dichotomy and not the reality that normality can be a continuum in terms of skew.