Previously I wrote a page about normality testing where I had data that was either normally distributed or that was skewed and I ran tests for normality on them to see if the tests worked in all of the cases.
In the real world we have a model/hypothesis and we cannot know if the data we collect will either agree with the hypothesis or be something completely different. In the case of normality we need to assess how well the test works against the distribution of all of the possible data distributions that we might see. My data might be normal or it might be some other distribution or it might be skewed.
The easiest of these situations to test is the effect of skew. If I say that if the data has a skew of over 1.5 it should not be be considered as normally distributed how well do the tests for normality work in discriminating these differences?
If the skewness is sampled from a uniform distribution from -3 to 3 then 50% of the cases will have values less than -1.5 or more than 1.5 and we want those to be rated as skewed and not normal. I can then apply the normality tests to these data simulations to see how well they perform.
This is doing something that will make hardcore statisticians roll their eyes and might even cause them to have raised blood pressure, because they work on falsification. If the test for normality does not reject the null hypothesis that the data is not normal, you are not proving that the data is normal, just that given that set of evidence you cannot show that it isn’t normal. But pragmatically that is exactly what we assume when there is no significant deviation from normality in the tests. This is why we use parametric methods for most analysis.
I am going to abuse the NHST framework in these simulations but from them I can then construct a truth table where 0 means it is not normal and 1 means it is normal. Then I can compare this to the statistical test where if the p-value is less than 0.05 we reject the null hypothesis of normality (0) otherwise we cannot reject the null hypothesis and we consider it normal (1).
Small Sample Simulation with Shapiro-Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-shapiro.test(rsnorm(8,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 536 450
1 4437 4577
Accuracy : 0.5113
95% CI : (0.5015, 0.5211)
No Information Rate : 0.5027
P-Value [Acc > NIR] : 0.04362
Kappa : 0.0183
Mcnemar's Test P-Value : < 2e-16
Sensitivity : 0.1078
Specificity : 0.9105
Pos Pred Value : 0.5436
Neg Pred Value : 0.5078
Prevalence : 0.4973
Detection Rate : 0.0536
Detection Prevalence : 0.0986
Balanced Accuracy : 0.5091
'Positive' Class : 0
The sensitivity of the test is very low but the specificity is high. The test classifies almost everything as normal and it is only in a small number of cases that it correctly detects that the data is not normal. This shows a very high type II error rate. Whereas the high specificity indicates a lower type I error rate.
Small Sample Simulation with Anderson Darling
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-ad.test(rsnorm(8,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 542 397
1 4549 4512
Accuracy : 0.5054
95% CI : (0.4956, 0.5152)
No Information Rate : 0.5091
P-Value [Acc > NIR] : 0.7734
Kappa : 0.0252
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1065
Specificity : 0.9191
Pos Pred Value : 0.5772
Neg Pred Value : 0.4980
Prevalence : 0.5091
Detection Rate : 0.0542
Detection Prevalence : 0.0939
Balanced Accuracy : 0.5128
'Positive' Class : 0
The results for Anderson-Darling are very similar to those from Shapiro-Wilks
Small Sample Simulation with Lilliefors Test (Kolmogorov-Smirnov)
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-lillie.test(rsnorm(8,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 428 352
1 4506 4714
Accuracy : 0.5142
95% CI : (0.5044, 0.524)
No Information Rate : 0.5066
P-Value [Acc > NIR] : 0.0655
Kappa : 0.0175
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.08675
Specificity : 0.93052
Pos Pred Value : 0.54872
Neg Pred Value : 0.51128
Prevalence : 0.49340
Detection Rate : 0.04280
Detection Prevalence : 0.07800
Balanced Accuracy : 0.50863
'Positive' Class : 0
For the Lilliefors test (equivalent to the Kolmogorov-Smirmov test) the specificity is even higher but at the cost of sensitivity.
Medium Sample Size Simulation with Shapiro Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-shapiro.test(rsnorm(30,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2204 1428
1 2819 3549
Accuracy : 0.5753
95% CI : (0.5655, 0.585)
No Information Rate : 0.5023
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.1517
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4388
Specificity : 0.7131
Pos Pred Value : 0.6068
Neg Pred Value : 0.5573
Prevalence : 0.5023
Detection Rate : 0.2204
Detection Prevalence : 0.3632
Balanced Accuracy : 0.5759
'Positive' Class : 0
There is a considerable improvement in sensitivity with a moderate loss of specificity. The prediction accuracy is still only about 57% which is not very good.
Medium Sample Size Simulation with Kolmogorov-Smirnov Test
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-lillie.test(rsnorm(30,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1220 817
1 3824 4139
Accuracy : 0.5359
95% CI : (0.5261, 0.5457)
No Information Rate : 0.5044
P-Value [Acc > NIR] : 1.553e-10
Kappa : 0.0766
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.2419
Specificity : 0.8351
Pos Pred Value : 0.5989
Neg Pred Value : 0.5198
Prevalence : 0.5044
Detection Rate : 0.1220
Detection Prevalence : 0.2037
Balanced Accuracy : 0.5385
'Positive' Class : 0
This is worse than the Shapiro-Wilks results with a much smaller increase in sensitivity but also a smaller loss of specificity.
Large Sample Size Simulation with Shapiro Wilk Test
library(fGarch)library(ggplot2)library(nortest)library(caret)set.seed(1)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-shapiro.test(rsnorm(500,mean=168,sd=6.4, xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4940 4034
1 0 1026
Accuracy : 0.5966
95% CI : (0.5869, 0.6062)
No Information Rate : 0.506
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2008
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 1.0000
Specificity : 0.2028
Pos Pred Value : 0.5505
Neg Pred Value : 1.0000
Prevalence : 0.4940
Detection Rate : 0.4940
Detection Prevalence : 0.8974
Balanced Accuracy : 0.6014
'Positive' Class : 0
The sensitivity has improved to perfection. There is a 0 type II error rate as the test can now identify that all skewed data is not normal. However this comes at a cost of a large fall in specificity in that now lots of normal data is also being rejected and classified as not normal.
Test accuracy still remains below 60%.
Large Sample Size Simulation with Kolmogorov-Smirnov Test
library(fGarch)library(ggplot2)library(nortest)library(caret)predicted <-vector()actual <-vector()for (i in1:10000){ xi<-runif(1,-3,3) y <-lillie.test(rsnorm(500,168,6.4,xi))if( xi <-1.5){actual[i]=0}elseif (xi >1.5){actual[i]=0}else{actual[i]=1}if(y$p.value <0.05){predicted[i]=0}else{predicted[i]=1}}predicted <-factor(predicted, levels =c(0,1))actual <-factor(actual, levels =c(0,1))confusionMatrix(data=predicted, reference = actual)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5055 3627
1 4 1314
Accuracy : 0.6369
95% CI : (0.6274, 0.6463)
No Information Rate : 0.5059
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2674
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9992
Specificity : 0.2659
Pos Pred Value : 0.5822
Neg Pred Value : 0.9970
Prevalence : 0.5059
Detection Rate : 0.5055
Detection Prevalence : 0.8682
Balanced Accuracy : 0.6326
'Positive' Class : 0
This time the Kolmogorov-Smirnov test outperforms the Shapiro Wilk test. While it does not have 100% specificity and some skewed data is classified as normal, this is a very small number. It has a better specificity and can pick out normally distributed data better and has a better prediction accuracy.
ROC Curve
By running the simulations for different sample sizes including, 8,30,50,100 and 500 you can construct an ROC curve for the data
FPR <-c(0.0895,0.2869,0.438,0.5576,0.7972)TPR <-c(0.1078, 0.4388,0.6915,0.9385, 1)data <-data.frame(FPR,TPR)ggplot(data, aes(x=FPR,y=TPR))+geom_point()+theme(aspect.ratio=1)+coord_fixed(xlim =c(0,1), ylim=c(0,1))+labs(title="ROC Curve for the Shapiro Wilks Test")