The normality tests for residuals are related to Prof. Brain Caffo’s textbook “Regression Models for Data Science in R” Chapter 11 Exercise #3
The exercise requests to “perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals” from the linear model of driver deaths with variables kms, PetrolPrice and law as predictors".
The discussion for exercise #3 is posted in https://youtu.be/XEqlmqFTVOI
From previous exercises, the variables kms and PetrolPrice are scaled to more meaningful numbers and centered the model to the variables means.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(datasets)
data("Seatbelts")
seatbelts <- as.data.frame(Seatbelts)
seatbelts <- mutate(seatbelts,
pp = (PetrolPrice - mean(PetrolPrice)) / sd(PetrolPrice),
mm = kms / 1000,
mmc = mm - mean(mm))
fit <- lm(DriversKilled ~ pp + mmc + law, data = seatbelts)
round(summary(fit)$coef, 4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.2263 1.8012 68.9674 0.0000
## pp -6.9199 1.8514 -3.7377 0.0002
## mmc -1.2233 0.6657 -1.8378 0.0677
## law -11.8892 6.0258 -1.9731 0.0500
The first step is to plot the linear model
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))
From the video discussion (starts at 1:42), Prof. Caffo discusses the qqplot (top right plot), which illustrates the standarized residual fit to a line. If the standarized residuals align to the line, the presumption is that the residuals (estimators of the errors) are normally distributed.
Since the qqplot shows some deviations from the line in the lower quadrant (theoretical quantiles <= -1), let’s perform traditional normality tests (Shapiro-Wilk test, Kolgomorov-Smirnov (k-s) test, Anderson-Darling normality test) to calculate the p-value (null hypothesis: residuals are normally distributed).
e <- resid(fit)
shapiro.test(e)
##
## Shapiro-Wilk normality test
##
## data: e
## W = 0.97413, p-value = 0.001267
The p-value from the Shapiro-Wilk test is equal to 0.001267, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).
res_mean <- mean(resid(fit))
res_sd <- sd(resid(fit))
ks.test(e, "pnorm", res_mean, res_sd, alternative = "greater")
##
## One-sample Kolmogorov-Smirnov test
##
## data: e
## D^+ = 0.091139, p-value = 0.04119
## alternative hypothesis: the CDF of x lies above the null hypothesis
The p-value from the Kolgomorv-Smirnov (k-s) one-sided test is equal to 0.04119, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).
library(nortest)
ad.test(e)
##
## Anderson-Darling normality test
##
## data: e
## A = 1.7327, p-value = 0.0001876
The p-value from the Anderson-Darling test is equal to 0.0001867, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).
The tests (Shapiro-Wilk, k-s, Anderson-Darling) p-value results are less than alpha, which indicates that the null hypothesis (residuals are normally distributed) must be rejected in favor of the alternate hypothesis.
hist(e, prob = TRUE, main = "Residuals Histogram and Normal Curve Fit")
curve(dnorm(x, res_mean, res_sd), col = "red", add = TRUE)
The histogram illustrates a positive skewness (skewed to the left). This bias could explained the residuals deviations in the qqplot lower quadrant.
library(psych)
describe(e)
## vars n mean sd median trimmed mad min max range skew
## X1 1 192 0 22.69 -4.05 -1.32 23.7 -50.69 60.71 111.4 0.46
## kurtosis se
## X1 -0.34 1.64
The table above show a skewness of 0.46 agreeing with the histogram plot.