Executive Summary

This study was commissioned to examine the relationship between a set of variables the resulting miles per gallon (MPG). A key consideration of this study is the transmission type (Manual vs Automatic). The results will be provided to Motor Trend. Regression model techniques are used to draw conclusions which would be provided to Motor Trend for their decision making. The study has shown that manual would better for MPG and there is quite a significant difference between the transmission types. If using a regression model, weight is the best indicator.

Load and Explore Data

After loading the data, the first thing to note is that the data consists of 32 observations with 10 predictor variables and the outcome (MPG). Since our main focus will be the difference between between Manual and Automatic transmission, let’s see how they split across the MPG.

Frequency <- data.frame(table(mtcars$am))[, 2]
y <- data.frame(ddply(car_data, "am", function(x) summary(x$mpg))
                ,row.names = c("Automatic Cars", "Manual Cars"))
kable(cbind(y, Frequency))
am Min. X1st.Qu. Median Mean X3rd.Qu. Max. Frequency
Automatic Cars 0 10.4 14.95 17.3 17.15 19.2 24.4 19
Manual Cars 1 15.0 21.00 22.8 24.39 30.4 33.9 13

There are 19 automatic cars and 13 manual cars in the data. From a simple summary of the data, we’d notice that the average and range of the automatic is comparably lower than that of the manual cars. A further boxplot illustration can be found in the Appendix (Figure 1).

Simple Linear Regression

am_fit <- lm(mpg ~ am, data = mtcars)
summary(am_fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

We can see that fitting just the transmission will not be sufficient for a model since the variance explained is so low (0.3597989). This should really be obvious since am is a binomial variable.The regression coefficient suggests that the y-intercept is 17.147 which we know is our mean of automatic cars from the exploratory analysis (where am=0). The slope is 7.24 which means that the mean of manual cars is 7.24 more than automatic. Figure 2 shows the residuals of this fit. You’ll see the data simply split in half since this is binomial data as I mentioned previously.

Correlation

One way to find a better fir would be to include other variables; but which ones? I will look at the correlation between MPG and the other variables.

mpg_corr <-  cor(mtcars)
mpg_corr[1,]
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

The variables closest to 1 or -1 are the most indicative of a correlation and would therefore be most useful in a regression model. For this experiment I will use a threshold of |0.8|. Hence, I will add the following 3 variables to the model:

It turns out that all three of these variables have a negative correlation with the MPG. Illustrated by figure 2.

Multivariate Linear Regression

multi_fit <- lm(mpg ~ am + wt + cyl + disp, data = mtcars)
summary(multi_fit)$coefficients
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 40.898313414 3.60154037 11.3557837 8.677574e-12
## am           0.129065571 1.32151163  0.0976651 9.229196e-01
## wt          -3.583425472 1.18650433 -3.0201537 5.468412e-03
## cyl         -1.784173258 0.61819218 -2.8861142 7.581533e-03
## disp         0.007403833 0.01208067  0.6128661 5.450930e-01
summary(multi_fit)$r.squared
## [1] 0.8326661

We see that the weight of the vehicles contributes heavily to the model. We can conclude that this model is better since the R-Squared has jumped to around 83%. This model also produces a very good residual plot (figure 3).

Appendix

boxplot(mpg~am
        ,data=mtcars
        ,main="Miles Per Gallon"
        ,col="lightyellow"
        ,border="steelblue"
        ,names=c("Automatic", "Manual")
        , horizontal = T)
box(which="outer")

Figure 1 - Boxplot of MPG across Manual and Automatic transmissions

par(mfrow = c(2, 2))
cyl_plot <- qplot(cyl, mpg, data=car_data, color=am)
disp_plot <- qplot(disp, mpg, data=car_data, color=am)
wt_plot <- qplot(wt, mpg, data=car_data, color=am)
grid.arrange(cyl_plot, disp_plot, wt_plot, ncol=2)

Figure 2 - cyl, disp, wt plotted against mpg

par(mfrow = c(2, 2))
plot(multi_fit)
box(which="outer")

Figure 3 - Residuals of MPG across cyl, wt, disp and am