Normality Revisited

Author

Andrew Dalby

Testing For Normality in the “Real World”

Previously I wrote a page about normality testing where I had data that was either normally distributed or that was skewed and I ran tests for normality on them to see if the tests worked in all of the cases.

In the real world we have a model/hypothesis and we cannot know if the data we collect will either agree with the hypothesis or be something completely different. In the case of normality we need to assess how well the test works against the distribution of all of the possible data distributions that we might see. My data might be normal or it might be some other distribution or it might be skewed.

The easiest of these situations to test is the effect of skew. If I say that if the data has a skew of over 1.5 it should not be be considered as normally distributed how well do the tests for normality work in discriminating these differences?

If the skewness is sampled from a uniform distribution from -3 to 3 then 50% of the cases will have values less than -1.5 or more than 1.5 and we want those to be rated as skewed and not normal. I can then apply the normality tests to these data simulations to see how well they perform.

This is doing something that will make hardcore statisticians roll their eyes and might even cause them to have raised blood pressure, because they work on falsification. If the test for normality does not reject the null hypothesis that the data is not normal, you are not proving that the data is normal, just that given that set of evidence you cannot show that it isn’t normal. But pragmatically that is exactly what we assume when there is no significant deviation from normality in the tests. This is why we use parametric methods for most analysis.

I am going to abuse the NHST framework in these simulations but from them I can then construct a truth table where 0 means it is not normal and 1 means it is normal. Then I can compare this to the statistical test where if the p-value is less than 0.05 we reject the null hypothesis of normality (0) otherwise we cannot reject the null hypothesis and we consider it normal (1).

Small Sample Simulation with Shapiro-Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- shapiro.test(rsnorm(8,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  536  450
         1 4437 4577
                                          
               Accuracy : 0.5113          
                 95% CI : (0.5015, 0.5211)
    No Information Rate : 0.5027          
    P-Value [Acc > NIR] : 0.04362         
                                          
                  Kappa : 0.0183          
                                          
 Mcnemar's Test P-Value : < 2e-16         
                                          
            Sensitivity : 0.1078          
            Specificity : 0.9105          
         Pos Pred Value : 0.5436          
         Neg Pred Value : 0.5078          
             Prevalence : 0.4973          
         Detection Rate : 0.0536          
   Detection Prevalence : 0.0986          
      Balanced Accuracy : 0.5091          
                                          
       'Positive' Class : 0               
                                          

The sensitivity of the test is very low but the specificity is high. The test classifies almost everything as normal and it is only in a small number of cases that it correctly detects that the data is not normal. This shows a very high type II error rate. Whereas the high specificity indicates a lower type I error rate.

Small Sample Simulation with Anderson Darling

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- ad.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  542  397
         1 4549 4512
                                          
               Accuracy : 0.5054          
                 95% CI : (0.4956, 0.5152)
    No Information Rate : 0.5091          
    P-Value [Acc > NIR] : 0.7734          
                                          
                  Kappa : 0.0252          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.1065          
            Specificity : 0.9191          
         Pos Pred Value : 0.5772          
         Neg Pred Value : 0.4980          
             Prevalence : 0.5091          
         Detection Rate : 0.0542          
   Detection Prevalence : 0.0939          
      Balanced Accuracy : 0.5128          
                                          
       'Positive' Class : 0               
                                          

The results for Anderson-Darling are very similar to those from Shapiro-Wilks

Small Sample Simulation with Lilliefors Test (Kolmogorov-Smirnov)

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- lillie.test(rsnorm(8,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  428  352
         1 4506 4714
                                         
               Accuracy : 0.5142         
                 95% CI : (0.5044, 0.524)
    No Information Rate : 0.5066         
    P-Value [Acc > NIR] : 0.0655         
                                         
                  Kappa : 0.0175         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.08675        
            Specificity : 0.93052        
         Pos Pred Value : 0.54872        
         Neg Pred Value : 0.51128        
             Prevalence : 0.49340        
         Detection Rate : 0.04280        
   Detection Prevalence : 0.07800        
      Balanced Accuracy : 0.50863        
                                         
       'Positive' Class : 0              
                                         

For the Lilliefors test (equivalent to the Kolmogorov-Smirmov test) the specificity is even higher but at the cost of sensitivity.

Medium Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- shapiro.test(rsnorm(30,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2204 1428
         1 2819 3549
                                         
               Accuracy : 0.5753         
                 95% CI : (0.5655, 0.585)
    No Information Rate : 0.5023         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.1517         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.4388         
            Specificity : 0.7131         
         Pos Pred Value : 0.6068         
         Neg Pred Value : 0.5573         
             Prevalence : 0.5023         
         Detection Rate : 0.2204         
   Detection Prevalence : 0.3632         
      Balanced Accuracy : 0.5759         
                                         
       'Positive' Class : 0              
                                         

There is a considerable improvement in sensitivity with a moderate loss of specificity. The prediction accuracy is still only about 57% which is not very good.

Medium Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- lillie.test(rsnorm(30,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1220  817
         1 3824 4139
                                          
               Accuracy : 0.5359          
                 95% CI : (0.5261, 0.5457)
    No Information Rate : 0.5044          
    P-Value [Acc > NIR] : 1.553e-10       
                                          
                  Kappa : 0.0766          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.2419          
            Specificity : 0.8351          
         Pos Pred Value : 0.5989          
         Neg Pred Value : 0.5198          
             Prevalence : 0.5044          
         Detection Rate : 0.1220          
   Detection Prevalence : 0.2037          
      Balanced Accuracy : 0.5385          
                                          
       'Positive' Class : 0               
                                          

This is worse than the Shapiro-Wilks results with a much smaller increase in sensitivity but also a smaller loss of specificity.

Large Sample Size Simulation with Shapiro Wilk Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

set.seed(1)
predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- shapiro.test(rsnorm(500,mean=168,sd=6.4, xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4940 4034
         1    0 1026
                                          
               Accuracy : 0.5966          
                 95% CI : (0.5869, 0.6062)
    No Information Rate : 0.506           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2008          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 1.0000          
            Specificity : 0.2028          
         Pos Pred Value : 0.5505          
         Neg Pred Value : 1.0000          
             Prevalence : 0.4940          
         Detection Rate : 0.4940          
   Detection Prevalence : 0.8974          
      Balanced Accuracy : 0.6014          
                                          
       'Positive' Class : 0               
                                          

The sensitivity has improved to perfection. There is a 0 type II error rate as the test can now identify that all skewed data is not normal. However this comes at a cost of a large fall in specificity in that now lots of normal data is also being rejected and classified as not normal.

Test accuracy still remains below 60%.

Large Sample Size Simulation with Kolmogorov-Smirnov Test

library(fGarch)
library(ggplot2)
library(nortest)
library(caret)

predicted <- vector()
actual <- vector()
for (i in 1:10000){
  xi<-runif(1,-3,3)
  y <- lillie.test(rsnorm(500,168,6.4,xi))
  if( xi < -1.5){actual[i]=0}
    else if (xi > 1.5){actual[i]=0}
    else{actual[i]=1}
  if(y$p.value < 0.05){predicted[i]=0}
    else{predicted[i]=1}
}
predicted <- factor(predicted, levels = c(0,1))
actual <- factor(actual, levels = c(0,1))
confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5055 3627
         1    4 1314
                                          
               Accuracy : 0.6369          
                 95% CI : (0.6274, 0.6463)
    No Information Rate : 0.5059          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2674          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9992          
            Specificity : 0.2659          
         Pos Pred Value : 0.5822          
         Neg Pred Value : 0.9970          
             Prevalence : 0.5059          
         Detection Rate : 0.5055          
   Detection Prevalence : 0.8682          
      Balanced Accuracy : 0.6326          
                                          
       'Positive' Class : 0               
                                          

This time the Kolmogorov-Smirnov test outperforms the Shapiro Wilk test. While it does not have 100% specificity and some skewed data is classified as normal, this is a very small number. It has a better specificity and can pick out normally distributed data better and has a better prediction accuracy.

ROC Curve

By running the simulations for different sample sizes including, 8,30,50,100 and 500 you can construct an ROC curve for the data

FPR <- c(0.0895,0.2869,0.438,0.5576,0.7972)
TPR <- c(0.1078, 0.4388,0.6915,0.9385, 1)
data <- data.frame(FPR,TPR)

ggplot(data, aes(x=FPR,y=TPR))+
  geom_point()+
  theme(aspect.ratio=1)+
  coord_fixed(xlim = c(0,1), ylim=c(0,1))+
  labs(title="ROC Curve for the Shapiro Wilks Test")