Executive summary

A manual transmission is better for fuel consumption (mpg) because cars with an automatic transmission tend to weigh more; weight is a major determinant of fuel efficiency. In addition, a higher flow rate, qsec, is associated with better mpg.

Exploratory Data Analysis

The question to be considered is whether an automatic or manual transmission is better for miles per gallon (mpg) using the data available in the mtcars dataset available in R.

Before any comparisons and statistical inferences are made, the results presented in Figure 1 show that cars with fewer cylinders have, on average, a higher mpg which is likely related to the weight of the car. Indeed, the right panel in Figure 1 (Appendix 1) shows that there is a negative association between mpg and the weight of the card and that cars with an automatic transmission tend to be heavier.

The relative importance of weight as a predictor is further exemplified when fitting a linear model using each feature individually to predict mpg and calculating the sum of squared errors (SSE) for each.

y <- mtcars$mpg; x1 <- mtcars$wt; n <- length(y)
fit <- lm(y ~ x1)
e <- resid(fit)
sum(e^2)
## [1] 278.3219

The SSE is smallest when using weight as the only feature (see above), followed by the number of cylinders (SSE = 308.33), displacement (SSE = 317.16), and horsepower (SSE = 447.67). Figure 2 (Appendix 1) shows the residual plot with weight as the x variable. Although a model with more features will decrease the SSE of the model, it should be realized that displacement is directly related to the number of cylinders as well as horsepower.

Optimally, one would like to minimize the variation around the regression line as much as possible using transmission type (am) as this is part of the original question and one or more other features. Figure 3 (Appendix 1) shows an example comparing the residual variation when comparing the mpg to the mean mpg for all observations (left) or using wt and am as features (right). The goal is now to develop a model with features that decrease the residual variation as much as possible without overly complicating the model by using all features.

In order to select a model, one can use the Aikake information criterion (AIC) which provides a measure of the relative quality of the model where a lower AIC value is typically associated with a better model. From the results (Appendix 2), it would appear that the “best” model uses the following features: am, wt, and qsec.

As such, the model’s coefficients are as follows:

summary(lm(mpg ~ am + wt + qsec, data=mtcars))$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## am           2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

These coefficients show that with a manual transmission there is an increase of approximately 2.94 mpg while each unit increase in weight results in a decrease of approximately 3.91 mpg. Finally, a unit increase in qsec (a measure of flow rate) is associated with a 1.23 increase in mpg.

Appendix 1

Figure 1. Fuel consumption (mpg) by transmission type or weight

plot1 <- ggplot(mtcars, aes(factor(am), mpg)) + geom_point(size=4) +
  geom_point(aes(colour=factor(cyl)), size=4) +
  theme(legend.position="top") + xlab("Transmission type") +
  ylab("Miles per gallon")
plot2 <- ggplot(mtcars, aes(wt, mpg)) + geom_point(size=4) +
  geom_point(aes(colour=factor(am)), size=4) + theme(legend.position="top") + xlab("Weight (tons)") +
  ylab("Miles per gallon")
grid.arrange(plot1, plot2, ncol=2)

Figure 2. Residual plot with weight as the sole predictor for mpg.

y <- mtcars$mpg; x <- mtcars$wt; n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)

plot(x, e, xlab = "Weight (tons)",
     ylab="Residuals (mpg)",
     bg="lightblue",
     col="black", cex=1.1, pch=21, frame=FALSE)
abline(h=0, lwd=2)
for (i in 1:n)
  lines(c(x[i], x[i]), c(e[i], 0), col="red", lwd=2)

sum(e^2)
## [1] 278.3219

Figure 3. Residual variation around the mean mpg (Itc) or using weight and transmission as features (Itc + slope)

e = c(resid(lm(mpg ~ 1, data=mtcars)),
      resid(lm(mpg ~ wt+am, data=mtcars)))
fit = factor(c(rep("Itc", nrow(mtcars)),
               rep("Itc, slope", nrow(mtcars))))
g = ggplot(data.frame(e=e, fit=fit), aes(y=e, x=fit, fill=fit))
g = g + geom_dotplot(binaxis="y", stackdir="center", binwidth=1)
g = g + xlab("Fitting approach")
g= g + ylab("Residual mpg")
g

Appendix 2

model <- step(lm(mpg ~ ., data = mtcars), direction = "backward")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790