Load the data and preview the variables:

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

First, look at the relationship between MPG and the other quantitative variables. We can see that all the quantitative variables have some kind of linear relationship with MPG despite some being weaker than others.

pairs(~mpg+disp+drat+wt+qsec,data=mtcars, 
   main="Simple Scatterplot Matrix")

#source:  http://www.statmethods.net/graphs/scatterplot.html

Next, we can see how much explanatory power each variable adds into a linear model. Adding “drat” into the model increased the explanatory power of the model, but not by much.

lm1 <- lm(mpg~disp,data=mtcars)
lm2 <- lm(mpg~disp+drat,data=mtcars)
lm3 <- lm(mpg~disp+drat+wt,data=mtcars)
lm4 <- lm(mpg~disp+drat+wt+qsec,data=mtcars)

anova(lm1,lm2,lm3,lm4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ disp
## Model 2: mpg ~ disp + drat
## Model 3: mpg ~ disp + drat + wt
## Model 4: mpg ~ disp + drat + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     30 317.16                                
## 2     29 302.90  1    14.263 2.1157 0.157323   
## 3     28 243.75  1    59.142 8.7731 0.006306 **
## 4     27 182.02  1    61.739 9.1583 0.005387 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, we can look at the confidence intervals of the slopes in the full linear model and see if there is anything that might be better left outside the model. “disp” is actually a bigger suspect variable that might be increasing the error of the model without adding much explanatory power.

summary(lm4)
## 
## Call:
## lm(formula = mpg ~ disp + drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1428 -1.9328 -0.1491  0.9191  5.5408 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.032229   9.588294   0.942 0.354543    
## disp         0.005222   0.011047   0.473 0.640215    
## drat         1.870855   1.324623   1.412 0.169265    
## wt          -4.867682   1.208716  -4.027 0.000412 ***
## qsec         1.052478   0.347782   3.026 0.005387 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.596 on 27 degrees of freedom
## Multiple R-squared:  0.8384, Adjusted R-squared:  0.8144 
## F-statistic: 35.01 on 4 and 27 DF,  p-value: 2.547e-10

Time to re-run the ANOVA and change the order so that we can see the impact specifically of “disp”. Based on the sequential sum of squares of 1.5064, it is probably a good idea to remove disp.

lm5 <- lm(mpg~drat+wt+qsec,data=mtcars)
#same lm4 as above, just repeating for readibility
lm4 <- lm(mpg~disp+drat+wt+qsec,data=mtcars)

anova(lm5,lm4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ drat + wt + qsec
## Model 2: mpg ~ disp + drat + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 183.52                           
## 2     27 182.01  1    1.5064 0.2235 0.6402

One extra piece of informtion is to look at the variance inflation factors just to see how much variance is being added before making a final decision. We can see that disp is adding a large amount of variance without really adding much explanatory model, so it is definitely a good idea to remove it.

library("car")

vif(lm4)
##     disp     drat       wt     qsec 
## 8.620676 2.306684 6.432062 1.776040

The working draft linear model based on the quantative variables:

lm6 <- lm(mpg~drat+wt+qsec,data=mtcars)
summary(lm6)
## 
## Call:
## lm(formula = mpg ~ drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1152 -1.8273 -0.2696  1.0502  5.5010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3945     8.0689   1.412  0.16892    
## drat          1.6561     1.2269   1.350  0.18789    
## wt           -4.3978     0.6781  -6.485 5.01e-07 ***
## qsec          0.9462     0.2616   3.616  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 28 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8196 
## F-statistic: 47.93 on 3 and 28 DF,  p-value: 3.723e-11

Now to check the whether the linear model assumptions are met. We can see that the independence of the error terms is lacking. Also the normality is lacking.

plot(lm6)

Going back to the VIF figures, the VIF for “wt” was quite high. Maybe it is could be removed. We lost some of the r^2 and (somehow) the standard error actually increased, but we can see that the linear model assumptions around the linear, independence, and normality of the error terms are now being met.

lm7 <- lm(mpg~drat+qsec,data=mtcars)
summary(lm7)
## 
## Call:
## lm(formula = mpg ~ drat + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9293 -2.4823 -0.1581  2.3151 10.1923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.8399     8.2992  -3.355  0.00223 ** 
## drat          7.3086     1.3423   5.445 7.37e-06 ***
## qsec          1.2127     0.4016   3.019  0.00524 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.979 on 29 degrees of freedom
## Multiple R-squared:  0.5922, Adjusted R-squared:  0.5641 
## F-statistic: 21.06 on 2 and 29 DF,  p-value: 2.247e-06
plot(lm7)

Now to go onto looking at the automatic transmission asepct. We can see that the “am” categorical variable has a slope with a coefficient that is statistically significant from zero. We can also see that the R^2 increased and the error decreased.

lm8 <- lm(mpg~drat+qsec+factor(am),data=mtcars)
summary(lm8)
## 
## Call:
## lm(formula = mpg ~ drat + qsec + factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4052 -2.3648 -0.0107  1.8616  7.5908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.9232     7.1862  -3.329  0.00245 ** 
## drat          2.7638     1.7561   1.574  0.12675    
## qsec          1.7592     0.3787   4.645 7.31e-05 ***
## factor(am)1   6.5824     1.9254   3.419  0.00195 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.402 on 28 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.6815 
## F-statistic: 23.11 on 3 and 28 DF,  p-value: 9.869e-08

But, do the linear model assumptions still hold? Generally, yes, but there is some dependence in the error term showing. But overall, it looks ok.

plot(lm8)

Conclusion:

The positive slope for the “am” categorical variable shows that automatics have a association with higher MPG. All factors being the equal, automatics are associated with 6.58 MPG higher than manuals.

summary(lm8)
## 
## Call:
## lm(formula = mpg ~ drat + qsec + factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4052 -2.3648 -0.0107  1.8616  7.5908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.9232     7.1862  -3.329  0.00245 ** 
## drat          2.7638     1.7561   1.574  0.12675    
## qsec          1.7592     0.3787   4.645 7.31e-05 ***
## factor(am)1   6.5824     1.9254   3.419  0.00195 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.402 on 28 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.6815 
## F-statistic: 23.11 on 3 and 28 DF,  p-value: 9.869e-08