Normality Revisited Again

Author

Andrew Dalby

Testing For Normality in the “Real World” - Dealing with Skew

When I did my simulations of skewed data to see how well normality testing was performing in terms of classifying data depending on sample size I sampled the skew from a uniform distribution between -3 and 3. It would possibly be more realistic to sample from a normal distribution with a mean skew of 0 and a standard deviation of 1.5. This would mean that approximately 95% of the cases will have a skew between -3 and 3 and 68% will lie between -1.5 and 1.5 the threshold for treating the data as normal and not being sufficiently skewed.

The question is how does this affect the tests?

Small Sample Simulation with Shapiro-Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(8,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  423  578
         1 3787 5212
                                          
               Accuracy : 0.5635          
                 95% CI : (0.5537, 0.5733)
    No Information Rate : 0.579           
    P-Value [Acc > NIR] : 0.9992          
                                          
                  Kappa : 7e-04           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.1005          
            Specificity : 0.9002          
         Pos Pred Value : 0.4226          
         Neg Pred Value : 0.5792          
             Prevalence : 0.4210          
         Detection Rate : 0.0423          
   Detection Prevalence : 0.1001          
      Balanced Accuracy : 0.5003          
                                          
       'Positive' Class : 0

The sensitivity of the test is still very low but the specificity is high. The test classifies almost everything as normal and it is only in a small number of cases that it correctly detects that the data is not normal. This shows a very high type II error rate. Whereas the high specificity indicates a lower type I error rate.

Small Sample Simulation with Anderson Darling

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- ad.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  383  574
         1 3698 5345
                                         
               Accuracy : 0.5728         
                 95% CI : (0.563, 0.5825)
    No Information Rate : 0.5919         
    P-Value [Acc > NIR] : 0.9999         
                                         
                  Kappa : -0.0035        
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.09385        
            Specificity : 0.90302        
         Pos Pred Value : 0.40021        
         Neg Pred Value : 0.59106        
             Prevalence : 0.40810        
         Detection Rate : 0.03830        
   Detection Prevalence : 0.09570        
      Balanced Accuracy : 0.49844        
                                         
       'Positive' Class : 0

The results for Anderson-Darling are very similar to those from Shapiro-Wilks

Small Sample Simulation with Lilliefors Test (Kolmogorov-Smirnov)

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  331  455
         1 3730 5484
                                          
               Accuracy : 0.5815          
                 95% CI : (0.5718, 0.5912)
    No Information Rate : 0.5939          
    P-Value [Acc > NIR] : 0.9943          
                                          
                  Kappa : 0.0056          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.08151         
            Specificity : 0.92339         
         Pos Pred Value : 0.42112         
         Neg Pred Value : 0.59518         
             Prevalence : 0.40610         
         Detection Rate : 0.03310         
   Detection Prevalence : 0.07860         
      Balanced Accuracy : 0.50245         
                                          
       'Positive' Class : 0

For the Lilliefors test (equivalent to the Kolmogorov-Smirmov test) the specificity is even higher but at the cost of sensitivity.

Medium Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(30,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1410 2039
         1 2731 3820
                                          
               Accuracy : 0.523           
                 95% CI : (0.5132, 0.5328)
    No Information Rate : 0.5859          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : -0.0077         
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.3405          
            Specificity : 0.6520          
         Pos Pred Value : 0.4088          
         Neg Pred Value : 0.5831          
             Prevalence : 0.4141          
         Detection Rate : 0.1410          
   Detection Prevalence : 0.3449          
      Balanced Accuracy : 0.4962          
                                          
       'Positive' Class : 0

There is a considerable improvement in sensitivity with a moderate loss of specificity. The prediction accuracy is still only about 57% which is not very good.

Medium Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(30,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  842 1153
         1 3288 4717
                                          
               Accuracy : 0.5559          
                 95% CI : (0.5461, 0.5657)
    No Information Rate : 0.587           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0081          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.2039          
            Specificity : 0.8036          
         Pos Pred Value : 0.4221          
         Neg Pred Value : 0.5893          
             Prevalence : 0.4130          
         Detection Rate : 0.0842          
   Detection Prevalence : 0.1995          
      Balanced Accuracy : 0.5037          
                                          
       'Positive' Class : 0

This is worse than the Shapiro-Wilks results with a much smaller increase in sensitivity but also a smaller loss of specificity.

Large Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- shapiro.test(rsnorm(500,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3686 5022
         1  323  969
                                          
               Accuracy : 0.4655          
                 95% CI : (0.4557, 0.4753)
    No Information Rate : 0.5991          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.068           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9194          
            Specificity : 0.1617          
         Pos Pred Value : 0.4233          
         Neg Pred Value : 0.7500          
             Prevalence : 0.4009          
         Detection Rate : 0.3686          
   Detection Prevalence : 0.8708          
      Balanced Accuracy : 0.5406          
                                          
       'Positive' Class : 0

The sensitivity has improved to perfection. There is a 0 type II error rate as the test can now identify that all skewed data is not normal. However this comes at a cost of a large fall in specificity in that now lots of normal data is also being rejected and classified as not normal.

Test accuracy still remains low with the best possible case of around 65%.

Large Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-rnorm(1,0,1.5)
  y <- lillie.test(rsnorm(500,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3664 4716
         1  442 1178
                                         
               Accuracy : 0.4842         
                 95% CI : (0.4744, 0.494)
    No Information Rate : 0.5894         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0796         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8924         
            Specificity : 0.1999         
         Pos Pred Value : 0.4372         
         Neg Pred Value : 0.7272         
             Prevalence : 0.4106         
         Detection Rate : 0.3664         
   Detection Prevalence : 0.8380         
      Balanced Accuracy : 0.5461         
                                         
       'Positive' Class : 0

Again the Kolmogorov-Smirnov test outperforms the Shapiro Wilk test. While it does not have 100% specificity and some skewed data is classified as normal, this is a very small number. It has a better specificity and can pick out normally distributed data better and has a better prediction accuracy.

Conclusion

There is very little difference between the results for the normally distributed and the uniformly distributed skew distributions and of anything for large samples there is a slightly worse performance in terms of specificity for the normally distributed skew. This still raises questions about the use of tests for normality as a yes/no binary before starting subsequent statistical analysis and either applying parametric or non-parametric methods. Testing for normality should be approached with caution.

One way of resolving this issue for the skewed data cases is determining the effect of skew on the subsequent parametric NHST. How much does skew affect t-tests and ANOVA? There is some existing literature regarding these effects but they are largely uncited because we have become stuck with the normal/non-normal dichotomy and not the reality that normality can be a continuum in terms of skew.