Sameer Mathur
Selecting and Diagnosing a Regression Model and using mtcars
---
Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
Data Description
mpg Miles/(US) galloncyl Number of cylindersdisp Displacement (cu.in.)hp Gross horsepowerdrat Rear axle ratiowt Weight (1000 lbs)qsec ¼ mile timevs Engine (0 = V-shaped, 1 = straight)am Transmission (0 = automatic, 1 = manual)gear Number of forward gearscarb Number of carburetors# importing data
data(mtcars)
# data rows and columns
dim(mtcars)
[1] 32 11
# first few rows
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# data structure
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# convert 'am' to a factor variable
mtcars$am <- as.factor(mtcars$am)
# convert 'cyl' to a factor variable
mtcars$cyl <- as.factor(mtcars$cyl)
# verify
str(mtcars$am)
Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
str(mtcars$cyl)
Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
# attaching data columns
attach(mtcars)
# descriptive statistics
library(psych)
describe(mtcars)[, c(1:5, 8:9)]
vars n mean sd median min max
mpg 1 32 20.09 6.03 19.20 10.40 33.90
cyl* 2 32 2.09 0.89 2.00 1.00 3.00
disp 3 32 230.72 123.94 196.30 71.10 472.00
hp 4 32 146.69 68.56 123.00 52.00 335.00
drat 5 32 3.60 0.53 3.70 2.76 4.93
wt 6 32 3.22 0.98 3.33 1.51 5.42
qsec 7 32 17.85 1.79 17.71 14.50 22.90
vs 8 32 0.44 0.50 0.00 0.00 1.00
am* 9 32 1.41 0.50 1.00 1.00 2.00
gear 10 32 3.69 0.74 4.00 3.00 5.00
carb 11 32 2.81 1.62 2.00 1.00 8.00
Here we use Automatic Model Selection using step().
The R function step() can be used to perform variable selection.
Selecting linear regression model (without interaction) using step().
# automatic model selection
stepLinOLSModel <- step(lm(mpg ~ ., data=mtcars),
trace=0, steps=10000)
summary(stepLinOLSModel)
Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4811 -1.5555 -0.7257 1.4110 4.6610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am1 2.9358 1.4109 2.081 0.046716 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
There are major linear regression assumtions.
# residual plots of OLS model
par(mfrow=c(2,2))
plot(stepLinOLSModel)
Linearity
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (plot on the top-left) from the Diagnostic Plots.
# residual vs. fitted plot
plot(stepLinOLSModel, 1)
In this plot, we cannot see that the red line is parallel to the dotted line. Hence, we cannot assume the linearity.
We will use the Box-Cox transformation to rectify linearity.
library(caret)
mpgTrans <- BoxCoxTrans(mtcars$mpg)
mpgTrans
Box-Cox Transformation
32 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
Largest/Smallest: 3.26
Sample Skewness: 0.611
Estimated Lambda: 0
With fudge factor, Lambda = 0 will be used for transformations
# append the transformed variable to mtcars
mtcars <- cbind(mtcars, mpgNew = predict(mpgTrans, mtcars$mpg))
# first few rows of the datset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpgNew
Mazda RX4 3.044522
Mazda RX4 Wag 3.044522
Datsun 710 3.126761
Hornet 4 Drive 3.063391
Hornet Sportabout 2.928524
Valiant 2.895912
Selecting linear regression model using step().
# automatic model selection
fitLinOLSTransModel <- lm(mpgNew ~ wt + qsec + am, data = mtcars)
# summary of the model
summary(fitLinOLSTransModel)
Call:
lm(formula = mpgNew ~ wt + qsec + am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.13879 -0.08114 -0.03466 0.07030 0.26575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.69410 0.31326 8.600 2.40e-09 ***
wt -0.22456 0.03201 -7.015 1.25e-07 ***
qsec 0.05329 0.01299 4.101 0.00032 ***
am1 0.08558 0.06351 1.347 0.18863
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1107 on 28 degrees of freedom
Multiple R-squared: 0.8752, Adjusted R-squared: 0.8619
F-statistic: 65.47 on 3 and 28 DF, p-value: 9.036e-13
Before Box-Cox transformation
# residual vs. fitted plot
plot(stepLinOLSModel, 1)
After Box-Cox transformation
# residual vs. fitted plot
plot(fitLinOLSTransModel, 1)
After Box-Cox transformation, we find that the red line is closer to being flat.
Hence, we are more comfortable in assuming linearity.
Normality
The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line (2nd plot).
# normal probability plot of residuals
plot(fitLinOLSTransModel, 2)
In our mtcars model, the points fall approximately along the reference line.
Therefore, we could assume normality.
Further, we will confirm our visual inspection by using some statistical tests.
Normality
There are several methods for normality test such as
Anderson-Darling test
Shapiro-Wilk's test
Kolmogorov-Smirnov (K-S) test
The Anderson-Darling test (AD test, for short) is one of the most commonly used normality tests, and can be executed using the ad.test() command present within the nortest package.
# Anderson-Darling normality test
library(nortest)
ad.test(mtcars$mpgNew)
Anderson-Darling normality test
data: mtcars$mpgNew
A = 0.22479, p-value = 0.8055
Shapiro-Wilk's method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
# Shapiro-Wilk's normality test
shapiro.test(mtcars$mpgNew)
Shapiro-Wilk normality test
data: mtcars$mpgNew
W = 0.97668, p-value = 0.699
From both the test output, the p-value > 0.05 implying that the distribution of the data are significantly same from normal distribution. In other words, we can assume the normality. Hence, it supports our visual interpretation.
Note:
Normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it's important to combine visual inspection and significance test in order to take the right decision.
Heteroskedasticity
The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on y axis.
If there is no Heteroskedasticity, you should see a completely random, equal distribution of points throughout the range of x axis and a flat red line.
This assumption can be checked by examining the scale-location plot, also known as the spread-location plot (3rd plot) from the Diagnostic Plots.
# residual vs. fitted plot
plot(fitLinOLSTransModel, 1)
# scale-location plot
plot(fitLinOLSTransModel, 3)
We can use statistical tests to check homogeneity of variance.
Hetroscedasticity
There are a few tests that comes handy to establish the presence or absence of heteroscedasticity -
The NCV test
Breush-Pagan test
# non-constant error variance test
library(car)
ncvTest(fitLinOLSTransModel)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 2.820686, Df = 1, p = 0.093057
library(lmtest)
# Breusch-Pagan test
bptest(fitLinOLSTransModel)
studentized Breusch-Pagan test
data: fitLinOLSTransModel
BP = 6.07, df = 3, p-value = 0.1083
Both the statistical tests give p-value > 0.05,