Sameer Mathur
Example: Detect and Rectify Non-normality of Residuals using mtcars dataset
Regression Diagnostics
---
The residual error terms are assumed to be normally distributed i.e. \( \epsilon ~ N(\mu, \sigma^2) \).
If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non - normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.
Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
Data Description
mpg Miles/(US) galloncyl Number of cylindersdisp Displacement (cu.in.)hp Gross horsepowerdrat Rear axle ratiowt Weight (1000 lbs)qsec ¼ mile timevs Engine (0 = V-shaped, 1 = straight)am Transmission (0 = automatic, 1 = manual)gear Number of forward gearscarb Number of carburetors# importing data
data(mtcars)
# attaching data columns
attach(mtcars)
# data rows and columns
dim(mtcars)
[1] 32 11
# first few rows
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# descriptive statistics
library(psych)
describe(mtcars)[, c(1:5, 8:9)]
vars n mean sd median min max
mpg 1 32 20.09 6.03 19.20 10.40 33.90
cyl 2 32 6.19 1.79 6.00 4.00 8.00
disp 3 32 230.72 123.94 196.30 71.10 472.00
hp 4 32 146.69 68.56 123.00 52.00 335.00
drat 5 32 3.60 0.53 3.70 2.76 4.93
wt 6 32 3.22 0.98 3.33 1.51 5.42
qsec 7 32 17.85 1.79 17.71 14.50 22.90
vs 8 32 0.44 0.50 0.00 0.00 1.00
am 9 32 0.41 0.50 0.00 0.00 1.00
gear 10 32 3.69 0.74 4.00 3.00 5.00
carb 11 32 2.81 1.62 2.00 1.00 8.00
# fitting simple linear model
fitmtcarsModel <- lm(mpg ~ am + wt + hp + disp + cyl, data = mtcars)
# summary of the fitted model
summary(fitmtcarsModel)
Call:
lm(formula = mpg ~ am + wt + hp + disp + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5952 -1.5864 -0.7157 1.2821 5.5725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
am 1.55649 1.44054 1.080 0.28984
wt -3.30262 1.13364 -2.913 0.00726 **
hp -0.02796 0.01392 -2.008 0.05510 .
disp 0.01226 0.01171 1.047 0.30472
cyl -1.10638 0.67636 -1.636 0.11393
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.505 on 26 degrees of freedom
Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line (2nd plot).
# normal probability plot of residuals
plot(fitmtcarsModel, 2)
In our example, all the points does not fall approximately along this reference line, so we cannot assume normality.
Normality plot using Quantile-Quantile (Q-Q plot) using qqnorm() and qqline() functions.
Normality plot using qqPlot() function in car package.
qqnorm(mtcars$mpg)
qqline(mtcars$mpg, col = "red")
# normality plot using qqPlot() in car package
library("car")
qqPlot(mtcars$mpg)
[1] 20 18
The value showing in output 18 and 20 are the outliers.
Visual inspection, described in the previous section, is usually unreliable. It's possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.
There are several methods for normality test such as
Anderson-Darling test
Shapiro-Wilk's test
Kolmogorov-Smirnov (K-S) test
The Anderson-Darling test (AD test, for short) is one of the most commonly used normality tests, and can be executed using the ad.test() command present within the nortest package.
# Anderson-Darling normality test
library(nortest)
ad.test(mtcars$mpg)
Anderson-Darling normality test
data: mtcars$mpg
A = 0.57968, p-value = 0.1207
Shapiro-Wilk's method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
# Shapiro-Wilk's normality test
shapiro.test(mtcars$mpg)
Shapiro-Wilk normality test
data: mtcars$mpg
W = 0.94756, p-value = 0.1229
From the output, the p-value > 0.05 implying that the distribution of the data are significantly same from normal distribution. In other words, we can assume the normality.
Note:
Normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it's important to combine visual inspection and significance test in order to take the right decision.