The assignment

Motor Trend, a magazine about the automobile industry is looking at a data set of a collection of cars. The magazine is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). It is particularly interested in the following two questions:

  1. Is an automatic or manual transmission better for MPG

  2. Quantify the MPG difference between automatic and manual transmissions

Executive Summary

An initial analysis, taking only transmission and mpg into account, seems to indicate that a manual transmission yields a higher mpg yield than an automatic transmission does. However, taking other variables into account shows that any positive effect of manual over automatic on mpg is reduced, to an extent that we are not confident that the positive effect is not just down to chance for the data given.

Data processing

# Read in the data and explore its structure
data(mtcars)
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
dim(mtcars)
## [1] 32 11

The variable we will test against “am” is currently numeric, though it takes only two values (0 and 1). It is convenient to convert it to be a factor.

mtcars$am<-as.factor(mtcars$am)
levels(mtcars$am)<-c("automatic", "manual")

Exploratory data analysis

We first assess the effect of transmission on mpg using a boxplot. This indicates that mpg is higher for manual transmissions. However this does not take any of the other variables into account, so it is worthwhile exploring the situation further with some linear regression models.

boxplot(mpg~am, mtcars, col=c("blue", "green"),
        xlab="Transmission",
        ylab="Miles per Gallon",
        main="Miles per Gallon by Transmission type")

Simple regression analysis

We first apply a simple regerssion model to the data to add numeric detail to the boxplot.

mdl_basic<-lm(mpg~am, data=mtcars)
summary(mdl_basic)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

According to this simple model, on average, manual transmission cars have 7.245 miles per gallon more as compared to automatic transmission cars. However, on this model, only 34% of the variation in the data is explainable on there being a linear dependence of mpg on transmission type (using the adjusted R^2 value). This indicates that a more in-depth study of the effects of the other variables on mpg is needed.

Multivariate regression analysis

We start by including all the variabels in the model

mdl_all<-lm(mpg~., data=mtcars)
summary(mdl_all)$adj.r.squared
## [1] 0.8066423

Thus model has an adjusted R^2 value of 81%, but it may contain unnecessary variables. We shall use a step-wise method to arrive at a smaller set of variables. We hope to achieve an adjusted R^2 value close to 80%.

mdl_min<-lm(mpg~1,data=mtcars)
mdl_fwd=step(mdl_min, direction="forward", scope=(~am+cyl+disp+hp+drat+wt+qsec+vs+gear+carb),data=mtcars)
## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654

On the basis of this analysis, we choose to regress mpg on am, wt, cyl and hd. We bear in mind that cyl has only three levels for this analysis (4, 6 and 8).

mdl_reduced<-lm(mpg~am+wt+factor(cyl)+hp, data=mtcars)
summary(mdl_reduced)
## 
## Call:
## lm(formula = mpg ~ am + wt + factor(cyl) + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.70832    2.60489  12.940 7.73e-13 ***
## ammanual      1.80921    1.39630   1.296  0.20646    
## wt           -2.49683    0.88559  -2.819  0.00908 ** 
## factor(cyl)6 -3.03134    1.40728  -2.154  0.04068 *  
## factor(cyl)8 -2.16368    2.28425  -0.947  0.35225    
## hp           -0.03211    0.01369  -2.345  0.02693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

This model has an adjusted R^2 value of 84%, so that it is a good candidate for consideration, as it explains a high proportion of the variability on the basis of an underying trend, and it is reasonably parsimonious. We now note that while manual again contributes to a higher mpg than automatic, its effect is considerably reduced, when we take the other variables into account.

On the basis of the summary data, we look at the model with the variable hp removed (as it only contributes a decrease of 0.03).

mdl_reduced_2<-lm(mpg~am+wt+factor(cyl),data=mtcars)
summary(mdl_reduced_2)
## 
## Call:
## lm(formula = mpg ~ am + wt + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4898 -1.3116 -0.5039  1.4162  5.7758 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.7536     2.8135  11.997  2.5e-12 ***
## ammanual       0.1501     1.3002   0.115  0.90895    
## wt            -3.1496     0.9080  -3.469  0.00177 ** 
## factor(cyl)6  -4.2573     1.4112  -3.017  0.00551 ** 
## factor(cyl)8  -6.0791     1.6837  -3.611  0.00123 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.603 on 27 degrees of freedom
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8134 
## F-statistic: 34.79 on 4 and 27 DF,  p-value: 2.73e-10

In this more parsimonious model, we still have an adjusted R^2 measure of 80%, but in this case the positive effect of manual transmission on miles per gallon is very small (only 0.15).

anova(mdl_all, mdl_reduced, mdl_reduced_2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ am + wt + factor(cyl) + hp
## Model 3: mpg ~ am + wt + factor(cyl)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     21 147.49                              
## 2     26 151.03 -5    -3.531 0.1006 0.99093  
## 3     27 182.97 -1   -31.943 4.5480 0.04492 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Inspect PRESS values for the three models
sum(abs(resid(mdl_reduced)/(1-hatvalues(mdl_reduced))))
## [1] 67.53938
sum(abs(resid(mdl_reduced_2)/(1-hatvalues(mdl_reduced_2))))
## [1] 70.546
sum(abs(resid(mdl_all)/(1-hatvalues(mdl_all))))
## [1] 87.80029

We see that of the three models examined, the model using am, wt, cyl and hp seems best. Although this indicates a positive contribution of 1.81 mpg for manual transmission, because of the other results, we are less sure of its overall positive contribution.

Appendix: Graphical analysis

We first look at the distribution of the variable “mpg”.

library(ggplot2)
library(reshape2)
longcar<-melt(mtcars, measure.var="mpg")
g<-ggplot(longcar, aes(x=value))+geom_histogram(aes(y=..density..,fill=variable), binwidth=1, color="black")+geom_density(size=2)
g<-g+labs(title="Density plot of mpg")+labs(x="mpg")+labs(y="Frequency/Density")
g

We now summarize the model “mdl_reduced”" graphically

# We shall plot the fit for mdl_reduced
par(mfrow=c(2,2))
plot(mdl_reduced)