Normality Test for Seatbelts Regression Model Residuals

The normality tests for residuals are related to Prof. Brain Caffo’s textbook “Regression Models for Data Science in R” Chapter 11 Exercise #3

The exercise requests to “perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals” from the linear model of driver deaths with variables kms, PetrolPrice and law as predictors".

The discussion for exercise #3 is posted in https://youtu.be/XEqlmqFTVOI

Linear model of driver deaths with variables kms (kilometers driven), PetrolPrice (index of gas prices) and law (seatbelts law in effect 0 - No, 1 - Yes) as predictors using the Seatbelts dataset in R.

From previous exercises, the variables kms and PetrolPrice are scaled to more meaningful numbers and centered the model to the variables means.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(datasets)
data("Seatbelts")
seatbelts <- as.data.frame(Seatbelts)
seatbelts <- mutate(seatbelts,
                    pp = (PetrolPrice - mean(PetrolPrice)) / sd(PetrolPrice), 
                    mm = kms / 1000, 
                    mmc = mm - mean(mm))
fit <- lm(DriversKilled ~ pp + mmc + law, data = seatbelts)
round(summary(fit)$coef, 4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.2263     1.8012 68.9674   0.0000
## pp           -6.9199     1.8514 -3.7377   0.0002
## mmc          -1.2233     0.6657 -1.8378   0.0677
## law         -11.8892     6.0258 -1.9731   0.0500

Chapter 11 Exercise 3

The first step is to plot the linear model

par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

From the video discussion (starts at 1:42), Prof. Caffo discusses the qqplot (top right plot), which illustrates the standarized residual fit to a line. If the standarized residuals align to the line, the presumption is that the residuals (estimators of the errors) are normally distributed.

Since the qqplot shows some deviations from the line in the lower quadrant (theoretical quantiles <= -1), let’s perform traditional normality tests (Shapiro-Wilk test, Kolgomorov-Smirnov (k-s) test, Anderson-Darling normality test) to calculate the p-value (null hypothesis: residuals are normally distributed).

Shapiro_Wilk test

e <- resid(fit)
shapiro.test(e)

## 
##  Shapiro-Wilk normality test
## 
## data:  e
## W = 0.97413, p-value = 0.001267

The p-value from the Shapiro-Wilk test is equal to 0.001267, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).

Kolgomorov-Smirnov test (k-s)

res_mean <- mean(resid(fit))
res_sd <- sd(resid(fit))
ks.test(e, "pnorm", res_mean, res_sd, alternative = "greater")

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  e
## D^+ = 0.091139, p-value = 0.04119
## alternative hypothesis: the CDF of x lies above the null hypothesis

The p-value from the Kolgomorv-Smirnov (k-s) one-sided test is equal to 0.04119, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).

Anderson-Darling normality test

library(nortest)
ad.test(e)

## 
##  Anderson-Darling normality test
## 
## data:  e
## A = 1.7327, p-value = 0.0001876

The p-value from the Anderson-Darling test is equal to 0.0001867, which is less than 0.05 (alpha). Since the p-value is less than alpha, the null hypothesis is rejected in favor of the alternate hypothesis (residuals are not normally distributed).

The tests (Shapiro-Wilk, k-s, Anderson-Darling) p-value results are less than alpha, which indicates that the null hypothesis (residuals are normally distributed) must be rejected in favor of the alternate hypothesis.

Residuals density histogram and normal curve

hist(e, prob = TRUE, main = "Residuals Histogram and Normal Curve Fit")
curve(dnorm(x, res_mean, res_sd), col = "red", add = TRUE)

The histogram illustrates a positive skewness (skewed to the left). This bias could explained the residuals deviations in the qqplot lower quadrant.

Residuals descriptive statistics

library(psych)
describe(e)

##    vars   n mean    sd median trimmed  mad    min   max range skew
## X1    1 192    0 22.69  -4.05   -1.32 23.7 -50.69 60.71 111.4 0.46
##    kurtosis   se
## X1    -0.34 1.64

The table above show a skewness of 0.46 agreeing with the histogram plot.