This study was commissioned to examine the relationship between a set of variables the resulting miles per gallon (MPG). A key consideration of this study is the transmission type (Manual vs Automatic). The results will be provided to Motor Trend. Regression model techniques are used to draw conclusions which would be provided to Motor Trend for their decision making. The study has shown that manual would better for MPG and there is quite a significant difference between the transmission types. If using a regression model, weight is the best indicator.
After loading the data, the first thing to note is that the data consists of 32 observations with 10 predictor variables and the outcome (MPG). Since our main focus will be the difference between between Manual and Automatic transmission, let’s see how they split across the MPG.
Frequency <- data.frame(table(mtcars$am))[, 2]
y <- data.frame(ddply(car_data, "am", function(x) summary(x$mpg))
,row.names = c("Automatic Cars", "Manual Cars"))
kable(cbind(y, Frequency))
| am | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | Frequency | |
|---|---|---|---|---|---|---|---|---|
| Automatic Cars | 0 | 10.4 | 14.95 | 17.3 | 17.15 | 19.2 | 24.4 | 19 |
| Manual Cars | 1 | 15.0 | 21.00 | 22.8 | 24.39 | 30.4 | 33.9 | 13 |
There are 19 automatic cars and 13 manual cars in the data. From a simple summary of the data, we’d notice that the average and range of the automatic is comparably lower than that of the manual cars. A further boxplot illustration can be found in the Appendix (Figure 1).
am_fit <- lm(mpg ~ am, data = mtcars)
summary(am_fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
We can see that fitting just the transmission will not be sufficient for a model since the variance explained is so low (0.3597989). This should really be obvious since am is a binomial variable.The regression coefficient suggests that the y-intercept is 17.147 which we know is our mean of automatic cars from the exploratory analysis (where am=0). The slope is 7.24 which means that the mean of manual cars is 7.24 more than automatic. Figure 2 shows the residuals of this fit. You’ll see the data simply split in half since this is binomial data as I mentioned previously.
One way to find a better fir would be to include other variables; but which ones? I will look at the correlation between MPG and the other variables.
mpg_corr <- cor(mtcars)
mpg_corr[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
The variables closest to 1 or -1 are the most indicative of a correlation and would therefore be most useful in a regression model. For this experiment I will use a threshold of |0.8|. Hence, I will add the following 3 variables to the model:
It turns out that all three of these variables have a negative correlation with the MPG. Illustrated by figure 2.
multi_fit <- lm(mpg ~ am + wt + cyl + disp, data = mtcars)
summary(multi_fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.898313414 3.60154037 11.3557837 8.677574e-12
## am 0.129065571 1.32151163 0.0976651 9.229196e-01
## wt -3.583425472 1.18650433 -3.0201537 5.468412e-03
## cyl -1.784173258 0.61819218 -2.8861142 7.581533e-03
## disp 0.007403833 0.01208067 0.6128661 5.450930e-01
summary(multi_fit)$r.squared
## [1] 0.8326661
We see that the weight of the vehicles contributes heavily to the model. We can conclude that this model is better since the R-Squared has jumped to around 83%. This model also produces a very good residual plot (figure 3).
boxplot(mpg~am
,data=mtcars
,main="Miles Per Gallon"
,col="lightyellow"
,border="steelblue"
,names=c("Automatic", "Manual")
, horizontal = T)
box(which="outer")
Figure 1 - Boxplot of MPG across Manual and Automatic transmissions
par(mfrow = c(2, 2))
cyl_plot <- qplot(cyl, mpg, data=car_data, color=am)
disp_plot <- qplot(disp, mpg, data=car_data, color=am)
wt_plot <- qplot(wt, mpg, data=car_data, color=am)
grid.arrange(cyl_plot, disp_plot, wt_plot, ncol=2)
Figure 2 - cyl, disp, wt plotted against mpg
par(mfrow = c(2, 2))
plot(multi_fit)
box(which="outer")
Figure 3 - Residuals of MPG across cyl, wt, disp and am