Executive summary

We explore the best way to fit a linear model to the data contained in the mtcars data set so that we can determine whether the gas mileage (miles per gallon, or MPG) is higher for vehicles with manual or automatic transmission. The process of deciding which types of data to use as the independent variables, or regressors, entails calculating correlations between the data types, comparing different models, and looking at variance inflation factors. The simplest viable linear model is yields a result that manual transmission cars do not necessarily have higher gas mileage than automatic cars.

Exploration of potential regressors for a linear model

The mtcars data set contains data on fuel consumption (MPG) and ten aspects of vehicle design and performance on 32 automobiles from the 1973 and 1974 model years that was published in the U.S. magazine Motor Trend during 1974. As we attempt to determine how much MPG depends upon the the type of transmission (in column mtcars$am), we assess which of the other factors may also be relevant in determining fuel economy.

Removing the columns mpg and am from mtcars leaves us with nine potential regressors. First, the first part of the Appendix contains exploratory plots of mpg versus wt, hp, cyl, disp, drat, qsec, gear, and carb using am as a factor variable (via conversion to amLevels as seen below), along with the lines depicting the linear regression fits for each factor level. (I omitted a plot of mpg versus vs because the data in that column record only whether it is a V engine or straight engine, and a plot of the fit to binary data was not informative.)

Next, I looked at correlations between the columns of mtcars.

round(cor(mtcars[,1:11], mtcars[,1:11]), digits = 2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Columns that appeared to be strong candidates for regressors due to their apparent correlation with mpg were wt (-0.8676594), hp (-0.7761684), disp (-0.8475514), and cyl (-0.852162).

However, it turns out from the data in the table of correlations that both hp and wt appear highly correlated with both disp and cyl. In fact, displacement is directly related to the number of cylinders (see https://en.wikipedia.org/wiki/Engine_displacement). Because variance is increased when the regressors are highly correlated, it seems logical to leave both disp and cyl out of the linear model. Also, the other columns do not appear to correlate nearly as well with mpg as these first four columns.

Finding a model and validating against other models

Ultimately, due to the principle of finding a parsimonious model and reducing the I decided that the best fit would use hp and wt as regressors, with the factor variable am so that we can obtain the fits for both automatic (amLevels = “0”) and manual (amLevels = “1”) transmissions:

carFit <- lm(mpg ~ hp + wt + amLevels, mtcars)
summary(carFit)$coefficients
##                   Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)    34.00287512 2.642659337 12.866916 2.824030e-13
## hp             -0.03747873 0.009605422 -3.901830 5.464023e-04
## wt             -2.87857541 0.904970538 -3.180850 3.574031e-03
## amLevelsManual  2.08371013 1.376420152  1.513862 1.412682e-01

Note that the coefficient for amLevelsManual is 2.0837 MPG, with a standard error of 1.3764 MPG.

A plot of the residuals of this fit against the fitted values is included in the Appendix, and it appears to be free of any systematic bias, which reassures us that this is a good candidate for a model.

For comparison, we can see the coefficients that we get when we include cyl and disp in the model as well. The coefficients for wt, hp, and amLevels are not far off from what we got for our preferred model above, but the p-values for all coefficients increase, suggesting inferior fits:

summary(lm(mpg ~ hp + wt + cyl + amLevels, mtcars))$coefficients
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    36.14653575 3.10478079 11.642218 4.944804e-12
## hp             -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt             -2.60648071 0.91983749 -2.833632 8.603218e-03
## cyl            -0.74515702 0.58278741 -1.278609 2.119166e-01
## amLevelsManual  1.47804771 1.44114927  1.025603 3.141799e-01
summary(lm(mpg ~ hp + wt + disp + amLevels, mtcars))$coefficients
##                    Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    34.209443370 2.82282610 12.1188632 1.979953e-12
## hp             -0.039323213 0.01243358 -3.1626624 3.842032e-03
## wt             -3.046747000 1.15711931 -2.6330448 1.382936e-02
## disp            0.002489354 0.01037681  0.2398959 8.122229e-01
## amLevelsManual  2.159270737 1.43517565  1.5045341 1.440531e-01

We can also compare the variance inflation factors to show that if we included more variables, we would get variance inflation, as seen in the coefficients returned by calling vif() on linear models with additional regressors:

library(car)
vif(carFit)
##       hp       wt amLevels 
## 2.088124 3.774838 2.271082
vif(lm(mpg ~ hp + wt + cyl + amLevels, mtcars))
##       hp       wt      cyl amLevels 
## 4.310029 3.988305 5.333685 2.546159
vif(lm(mpg ~ hp + wt + disp + amLevels, mtcars))
##       hp       wt     disp amLevels 
## 3.381008 5.963704 7.695157 2.386005
vif(lm(mpg ~ hp + wt + cyl + disp + amLevels, mtcars))
##        hp        wt       cyl      disp  amLevels 
##  4.501859  6.079452  7.209456 10.401420  2.553064

Conclusion

Looking at our regression model, though our parsimonious model tells us that a car with a manual transmission could have a mileage of 2.0837 MPG more than an automatic vehicle, with a standard error of 1.3764 MPG (meaning a standard deviation of 1.3764 * sqrt(32) = 7.7860942), this is not a big enough margin of difference to draw a firm conclusion. The p value of 0.14127 means that we cannot reject the possibility that a manual transmission does not lead to a car having higher gas mileage than an automatic transmission. In other words, it seems quite likely that the type of transmission does not make as big a difference to the gas mileage as the weight and horsepower of the car.

Appendix

  1. Exploratory plots that show the relationship between MPG and several of the other mtcars column values.

  1. Plot of residuals of the selected fit versus the fitted values.