Load the data and preview the variables:
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
First, look at the relationship between MPG and the other quantitative variables. We can see that all the quantitative variables have some kind of linear relationship with MPG despite some being weaker than others.
pairs(~mpg+disp+drat+wt+qsec,data=mtcars,
main="Simple Scatterplot Matrix")
#source: http://www.statmethods.net/graphs/scatterplot.html
Next, we can see how much explanatory power each variable adds into a linear model. Adding “drat” into the model increased the explanatory power of the model, but not by much.
lm1 <- lm(mpg~disp,data=mtcars)
lm2 <- lm(mpg~disp+drat,data=mtcars)
lm3 <- lm(mpg~disp+drat+wt,data=mtcars)
lm4 <- lm(mpg~disp+drat+wt+qsec,data=mtcars)
anova(lm1,lm2,lm3,lm4)
## Analysis of Variance Table
##
## Model 1: mpg ~ disp
## Model 2: mpg ~ disp + drat
## Model 3: mpg ~ disp + drat + wt
## Model 4: mpg ~ disp + drat + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 317.16
## 2 29 302.90 1 14.263 2.1157 0.157323
## 3 28 243.75 1 59.142 8.7731 0.006306 **
## 4 27 182.02 1 61.739 9.1583 0.005387 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, we can look at the confidence intervals of the slopes in the full linear model and see if there is anything that might be better left outside the model. “disp” is actually a bigger suspect variable that might be increasing the error of the model without adding much explanatory power.
summary(lm4)
##
## Call:
## lm(formula = mpg ~ disp + drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1428 -1.9328 -0.1491 0.9191 5.5408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.032229 9.588294 0.942 0.354543
## disp 0.005222 0.011047 0.473 0.640215
## drat 1.870855 1.324623 1.412 0.169265
## wt -4.867682 1.208716 -4.027 0.000412 ***
## qsec 1.052478 0.347782 3.026 0.005387 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.596 on 27 degrees of freedom
## Multiple R-squared: 0.8384, Adjusted R-squared: 0.8144
## F-statistic: 35.01 on 4 and 27 DF, p-value: 2.547e-10
Time to re-run the ANOVA and change the order so that we can see the impact specifically of “disp”. Based on the sequential sum of squares of 1.5064, it is probably a good idea to remove disp.
lm5 <- lm(mpg~drat+wt+qsec,data=mtcars)
#same lm4 as above, just repeating for readibility
lm4 <- lm(mpg~disp+drat+wt+qsec,data=mtcars)
anova(lm5,lm4)
## Analysis of Variance Table
##
## Model 1: mpg ~ drat + wt + qsec
## Model 2: mpg ~ disp + drat + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 183.52
## 2 27 182.01 1 1.5064 0.2235 0.6402
One extra piece of informtion is to look at the variance inflation factors just to see how much variance is being added before making a final decision. We can see that disp is adding a large amount of variance without really adding much explanatory model, so it is definitely a good idea to remove it.
library("car")
vif(lm4)
## disp drat wt qsec
## 8.620676 2.306684 6.432062 1.776040
The working draft linear model based on the quantative variables:
lm6 <- lm(mpg~drat+wt+qsec,data=mtcars)
summary(lm6)
##
## Call:
## lm(formula = mpg ~ drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1152 -1.8273 -0.2696 1.0502 5.5010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3945 8.0689 1.412 0.16892
## drat 1.6561 1.2269 1.350 0.18789
## wt -4.3978 0.6781 -6.485 5.01e-07 ***
## qsec 0.9462 0.2616 3.616 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.8196
## F-statistic: 47.93 on 3 and 28 DF, p-value: 3.723e-11
Now to check the whether the linear model assumptions are met. We can see that the independence of the error terms is lacking. Also the normality is lacking.
plot(lm6)
Going back to the VIF figures, the VIF for “wt” was quite high. Maybe it is could be removed. We lost some of the r^2 and (somehow) the standard error actually increased, but we can see that the linear model assumptions around the linear, independence, and normality of the error terms are now being met.
lm7 <- lm(mpg~drat+qsec,data=mtcars)
summary(lm7)
##
## Call:
## lm(formula = mpg ~ drat + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9293 -2.4823 -0.1581 2.3151 10.1923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.8399 8.2992 -3.355 0.00223 **
## drat 7.3086 1.3423 5.445 7.37e-06 ***
## qsec 1.2127 0.4016 3.019 0.00524 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.979 on 29 degrees of freedom
## Multiple R-squared: 0.5922, Adjusted R-squared: 0.5641
## F-statistic: 21.06 on 2 and 29 DF, p-value: 2.247e-06
plot(lm7)
Now to go onto looking at the automatic transmission asepct. We can see that the “am” categorical variable has a slope with a coefficient that is statistically significant from zero. We can also see that the R^2 increased and the error decreased.
lm8 <- lm(mpg~drat+qsec+factor(am),data=mtcars)
summary(lm8)
##
## Call:
## lm(formula = mpg ~ drat + qsec + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4052 -2.3648 -0.0107 1.8616 7.5908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.9232 7.1862 -3.329 0.00245 **
## drat 2.7638 1.7561 1.574 0.12675
## qsec 1.7592 0.3787 4.645 7.31e-05 ***
## factor(am)1 6.5824 1.9254 3.419 0.00195 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.402 on 28 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.6815
## F-statistic: 23.11 on 3 and 28 DF, p-value: 9.869e-08
But, do the linear model assumptions still hold? Generally, yes, but there is some dependence in the error term showing. But overall, it looks ok.
plot(lm8)
Conclusion:
The positive slope for the “am” categorical variable shows that automatics have a association with higher MPG. All factors being the equal, automatics are associated with 6.58 MPG higher than manuals.
summary(lm8)
##
## Call:
## lm(formula = mpg ~ drat + qsec + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4052 -2.3648 -0.0107 1.8616 7.5908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.9232 7.1862 -3.329 0.00245 **
## drat 2.7638 1.7561 1.574 0.12675
## qsec 1.7592 0.3787 4.645 7.31e-05 ***
## factor(am)1 6.5824 1.9254 3.419 0.00195 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.402 on 28 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.6815
## F-statistic: 23.11 on 3 and 28 DF, p-value: 9.869e-08