Sameer Mathur
Example: Detect and Rectify Non-normality of Residuals using cars dataset
Regression Diagnostics
---
The residual error terms are assumed to be normally distributed i.e. \( \epsilon ~ N(\mu, \sigma^2) \).
If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non - normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.
# fitting simple linear model
fitCarsModel <- lm(dist ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsModel)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# residual plots of OLS model
par(mfrow=c(2,2))
plot(fitCarsModel)
The normality assumption can be checked by inspecting the Residuals vs Fitted plot (2nd plot) from the Diagnostic Plots.
The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line (2nd plot).
# normal probability plot of residuals
plot(fitCarsModel, 2)
In our example, all the points does not fall approximately along this reference line, so we cannot assume normality.
Normality plot using Quantile-Quantile (Q-Q plot) using qqnorm() and qqline() functions.
Normality plot using qqPlot() function in car package.
qqnorm(cars$dist)
qqline(cars$dist, col = "red")
# normality plot using qqPlot() in car package
library("car")
qqPlot(cars$dist)
[1] 49 48
The value showing in output 48 and 49 are the outliers.
Visual inspection, described in the previous section, is usually unreliable. It's possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.
There are several methods for normality test such as
Anderson-Darling test
Shapiro-Wilk's test
Kolmogorov-Smirnov (K-S) test
The Anderson-Darling test (AD test, for short) is one of the most commonly used normality tests, and can be executed using the ad.test() command present within the nortest package.
# Anderson-Darling normality test
library(nortest)
ad.test(cars$dist)
Anderson-Darling normality test
data: cars$dist
A = 0.74067, p-value = 0.05021
Shapiro-Wilk's method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
# Shapiro-Wilk's normality test
shapiro.test(cars$dist)
Shapiro-Wilk normality test
data: cars$dist
W = 0.95144, p-value = 0.0391
From the output, the p-value < 0.05 implying that the distribution of the data are significantly different from normal distribution. In other words, we cannot assume the normality.
Note:
Normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it's important to combine visual inspection and significance test in order to take the right decision.
library(caret)
distTrans <- BoxCoxTrans(cars$dist)
distTrans
Box-Cox Transformation
50 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 26.00 36.00 42.98 56.00 120.00
Largest/Smallest: 60
Sample Skewness: 0.759
Estimated Lambda: 0.5
# append the transformed variable to cars
cars <- cbind(cars, distNew = predict(distTrans, cars$dist))
# first few rows of the datset
head(cars)
speed dist distNew
1 4 2 0.8284271
2 4 10 4.3245553
3 7 4 2.0000000
4 7 22 7.3808315
5 8 16 6.0000000
6 9 10 4.3245553
predict() uses the fitted model to predict response values for the new dataset.
The new regresison model will be based on the transformed data.
# fitting simple linear model
fitCarsTransModel <- lm(distNew ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsTransModel)
Call:
lm(formula = distNew ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-4.1369 -1.3966 -0.3598 1.1817 6.3069
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.55410 0.96888 0.572 0.57
speed 0.64483 0.05957 10.825 1.77e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.205 on 48 degrees of freedom
Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
# residual vs. fitted plot
plot(fitCarsTransModel, 2)
Before Box-Cox transformation
# normal probability plot of residuals
plot(fitCarsModel, 2)
After Box-Cox transformation
# normal probability plot of residuals
plot(fitCarsTransModel, 2)
We can not say anything related to change in normality of the residuals. Hence, we will run the statistical test to check the change in normality after the Box-Cox transformation.
Anderson-Darling normality test
# Anderson-Darling normality test
library(nortest)
ad.test(cars$distNew)
Anderson-Darling normality test
data: cars$distNew
A = 0.1389, p-value = 0.9732
Shapiro-Wilk's method to test normality
# Shapiro-Wilk's normality test
shapiro.test(cars$distNew)
Shapiro-Wilk normality test
data: cars$distNew
W = 0.99347, p-value = 0.9941
From both the test, the p-value is more than 0.05 which implies that the distribution of the data are not significantly different from normal distribution.
In other words, we can assume the normality after the transforming our outcome variable using BoxCox transformation.