Testing For Normality in the “Real World” - Dealing with Skew
When I did my simulations of skewed data to see how well normality testing was performing in terms of classifying data depending on sample size I sampled the skew from a uniform distribution between -3 and 3. It would possibly be more realistic to sample from a normal distribution with a mean skew of 0 and a standard deviation of 1.5. This would mean that approximately 95% of the cases will have a skew between -3 and 3 and 68% will lie between -1.5 and 1.5 the threshold for treating the data as normal and not being sufficiently skewed.
The question is how does this affect the tests?
Small Sample Simulation with Shapiro-Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-shapiro.test(rsnorm(8,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 364 637
1 2885 6114
Accuracy : 0.6478
95% CI : (0.6383, 0.6572)
No Information Rate : 0.6751
P-Value [Acc > NIR] : 1
Kappa : 0.0215
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1120
Specificity : 0.9056
Pos Pred Value : 0.3636
Neg Pred Value : 0.6794
Prevalence : 0.3249
Detection Rate : 0.0364
Detection Prevalence : 0.1001
Balanced Accuracy : 0.5088
'Positive' Class : 0
The sensitivity of the test is still very low but the specificity is high. The test classifies almost everything as normal and it is only in a small number of cases that it correctly detects that the data is not normal. This shows a very high type II error rate. Whereas the high specificity indicates a lower type I error rate.
Small Sample Simulation with Anderson Darling
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-ad.test(rsnorm(8,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 327 630
1 2805 6238
Accuracy : 0.6565
95% CI : (0.6471, 0.6658)
No Information Rate : 0.6868
P-Value [Acc > NIR] : 1
Kappa : 0.0156
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1044
Specificity : 0.9083
Pos Pred Value : 0.3417
Neg Pred Value : 0.6898
Prevalence : 0.3132
Detection Rate : 0.0327
Detection Prevalence : 0.0957
Balanced Accuracy : 0.5063
'Positive' Class : 0
The results for Anderson-Darling are very similar to those from Shapiro-Wilks
Small Sample Simulation with Lilliefors Test (Kolmogorov-Smirnov)
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-lillie.test(rsnorm(8,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 271 515
1 2863 6351
Accuracy : 0.6622
95% CI : (0.6528, 0.6715)
No Information Rate : 0.6866
P-Value [Acc > NIR] : 1
Kappa : 0.0144
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.08647
Specificity : 0.92499
Pos Pred Value : 0.34478
Neg Pred Value : 0.68928
Prevalence : 0.31340
Detection Rate : 0.02710
Detection Prevalence : 0.07860
Balanced Accuracy : 0.50573
'Positive' Class : 0
For the Lilliefors test (equivalent to the Kolmogorov-Smirmov test) the specificity is even higher but at the cost of sensitivity.
Medium Sample Size Simulation with Shapiro Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-shapiro.test(rsnorm(30,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1336 2113
1 1836 4715
Accuracy : 0.6051
95% CI : (0.5954, 0.6147)
No Information Rate : 0.6828
P-Value [Acc > NIR] : 1
Kappa : 0.1092
Mcnemar's Test P-Value : 1.123e-05
Sensitivity : 0.4212
Specificity : 0.6905
Pos Pred Value : 0.3874
Neg Pred Value : 0.7197
Prevalence : 0.3172
Detection Rate : 0.1336
Detection Prevalence : 0.3449
Balanced Accuracy : 0.5559
'Positive' Class : 0
There is a considerable improvement in sensitivity with a moderate loss of specificity. The prediction accuracy is still only about 57% which is not very good.
Medium Sample Size Simulation with Kolmogorov-Smirnov Test
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-lillie.test(rsnorm(30,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 764 1231
1 2443 5562
Accuracy : 0.6326
95% CI : (0.6231, 0.6421)
No Information Rate : 0.6793
P-Value [Acc > NIR] : 1
Kappa : 0.0633
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.2382
Specificity : 0.8188
Pos Pred Value : 0.3830
Neg Pred Value : 0.6948
Prevalence : 0.3207
Detection Rate : 0.0764
Detection Prevalence : 0.1995
Balanced Accuracy : 0.5285
'Positive' Class : 0
This is worse than the Shapiro-Wilks results with a much smaller increase in sensitivity but also a smaller loss of specificity.
Large Sample Size Simulation with Shapiro Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-shapiro.test(rsnorm(500,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3111 5597
1 0 1292
Accuracy : 0.4403
95% CI : (0.4305, 0.4501)
No Information Rate : 0.6889
P-Value [Acc > NIR] : 1
Kappa : 0.1256
Mcnemar's Test P-Value : <2e-16
Sensitivity : 1.0000
Specificity : 0.1875
Pos Pred Value : 0.3573
Neg Pred Value : 1.0000
Prevalence : 0.3111
Detection Rate : 0.3111
Detection Prevalence : 0.8708
Balanced Accuracy : 0.5938
'Positive' Class : 0
The sensitivity has improved to perfection. There is a 0 type II error rate as the test can now identify that all skewed data is not normal. However this comes at a cost of a large fall in specificity in that now lots of normal data is also being rejected and classified as not normal.
Test accuracy still remains low with the best possible case of around 65%.
Large Sample Size Simulation with Kolmogorov-Smirnov Test
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-rnorm(1,0,1.5) y <-lillie.test(rsnorm(500,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 3184 5196
1 3 1617
Accuracy : 0.4801
95% CI : (0.4703, 0.4899)
No Information Rate : 0.6813
P-Value [Acc > NIR] : 1
Kappa : 0.1649
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9991
Specificity : 0.2373
Pos Pred Value : 0.3800
Neg Pred Value : 0.9981
Prevalence : 0.3187
Detection Rate : 0.3184
Detection Prevalence : 0.8380
Balanced Accuracy : 0.6182
'Positive' Class : 0
Again the Kolmogorov-Smirnov test outperforms the Shapiro Wilk test. While it does not have 100% specificity and some skewed data is classified as normal, this is a very small number. It has a better specificity and can pick out normally distributed data better and has a better prediction accuracy.
Conclusion
There is very little difference between the results for the normally distributed and the uniformly distributed skew distributions and of anything for large samples there is a slightly worse performance in terms of specificity for the normally distributed skew. This still raises questions about the use of tests for normality as a yes/no binary before starting subsequent statistical analysis and either applying parametric or non-parametric methods. Testing for normality should be approached with caution.
One way of resolving this issue for the skewed data cases is determining the effect of skew on the subsequent parametric NHST. How much does skew affect t-tests and ANOVA? There is some existing literature regarding these effects but they are largely uncited because we have become stuck with the normal/non-normal dichotomy and not the reality that normality can be a continuum in terms of skew.