Author: Ying Jiang
Date: 19th June 2015

Executive summary

This report analyses the mtcars dataset to find out whether an automatic or manual transmission better for mileage (MPG), and the MPG difference between automatic and manual transmission vehicles. The findings indicate that manual cars’ mileage is higher than auto cars’ by 7.2 MPG on average. On the other hand, transmission is a relatively minor and indirect indicator of mileage, compared to major factors such as car weight. Prediction of car mileage could be meaningful based on knowing the car’s transmission alone, but could be much more so considering these major factors. Our model suggests that the heavier the car, the more likely it is an auto car, and the lower the mileage.

Data analysis

Exploratory analysis

First, we visualize the relationship between mileage (variable mpg) and transmission (factor variable am) (Figure 1):

A preliminary comparison of the mean mpg show that quantity is higher for manual (24.3923077) than for auto vehicles (17.1473684). At 95% confidence level, the p-value for testing the difference between the means is 0.0013736. This indicates that the mileage for manual vehicles is significantly higher.

Fitting models

Simple linear regression on transmission (am)

Let:

\(Y = mpg\)
\(X_1 = am\)
\(Y = \beta_0 + \beta_1 * X_1\)

\(X_1\) is a factor variable: 0 for auto cars, 1 for manual cars. Linearly regressing \(Y\) on \(X_1\) gives the following results:

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

The intercept of the linear fit (17.1473684) refers to the average mpg of auto cars (\(X_1\) = 1). The coefficient of am (a slope of 7.2449393) refers to the difference in mileage between auto and manual cars. The positive slope indicates mpg increases with am, ie we get higher mileage out of manual cars than auto cars.

Exploring other models

First, we regress mpg on all other variables to have an idea on which predictor impact mpg the most:

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

am’s coefficient is still positive, but am’s effect is weak (large p-value associated with the coefficient). In particular, other predictors, especially car weight, play more important and complex roles in affecting mpg. We construct a pairwise plot (Figure 2) to survey all attributes of cars in the dataset.

Visually inspecting the pairs plot, there are 3 variables that are significantly correlated to both mpg and am: disp (displacement), drat (real axle ratio), wt (car weight). Especially, wt is significantly correlated to am given the data (a low t-test p-value of 6.272019910^{-6}. Also see Figure 3). We investigate a model including car weight as an additional predictor:

\(Y = mpg\)
\(X_1 = am\)
\(X_2 = wt\)
\(Y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2\)

##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## factor(am)1 -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt          -5.35281145  0.7882438 -6.79080719 1.867415e-07

The \(R^2\) values for the multivariable fit is much higher (0.7528348) than for the model with transmission as the only predictor (0.3597989), indicating better explanation for variability in mpg. However, the coefficient for \(X_1\) is now negative. This confirms that correlations exist between \(X_1\) and \(X_2\). Therefore, we try a third model, adding an interaction term between \(X_1\) and \(X_2\):

\(Y = mpg\)
\(X_1 = am\)
\(X_2 = wt\)
\(Y = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_1 * X_2\)

##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    31.416055  3.0201093 10.402291 4.001043e-11
## wt             -3.785908  0.7856478 -4.818836 4.551182e-05
## factor(am)1    14.878423  4.2640422  3.489277 1.621034e-03
## wt:factor(am)1 -5.298360  1.4446993 -3.667449 1.017148e-03

This model gives the highest \(R^2\) (0.8330375). The coefficient for \(X_1\) is back to positive. The residuals-vs-prediction plots (Figure 4) indicates Model 3 gives the lowest residuals.

Conclusion

Overall, manual cars give better mileage, - 7.2449393 MPG higher than automatic cars.

To understand such a difference, we note that transmission is correlated to other confounding factors, which have a more direct impact on mileage. Examples of such factors are displacement volume, real axle ratio, and especially weight. A model that explains mileage trends the best will have to include one or more of these major predictors. But the complex relationship between the multiple variables reduces the interpretive power of the model, despite an improved fit. Therefore caution needs to be exercised against over-fitting and including too many predictors.

Overall, however, looking at transmission alone, the significantly higher mileage of manual cars have predictive power for future car mileages given its transmission category.