Executive Summary

In this project we pretend we work for Motor Trend, and we are given the following two problems to solve:

  1. “Is an automatic or manual transmission better for MPG?”
  2. “Quantify the MPG difference between automatic and manual transmissions.”

We use the mtcars dataset in R to explore the relationship between miles per gallon and whether a vehicle is automatic or manual (its transmission type). We find that manual vehicles have higher mpg, but that there is strong evidence of confounding variables present. Specifically, when variables like weight, horsepower, number of cylinders or displacement are included in the linear model, the transmission type loses its statistical significance. Because of this - and the fact that variables like horsepower and weight have a more intuitive connection with mpg - we don’t have enough evidence to say that there is a causal transmission effect on mpg. Although manual vehicles show higher mpg, that is likely because manual vehicles are lighter and have lower horsepower on average as well. We estimate that the true difference in means between automatic and manual vehicles is 7.245, and we describe our uncertainty of this number using a confidence interval. We examine residuals and have a lengthy discussion of model selection. Our model accuracy could be improved with a larger sample size or more advanced methods, but we nevertheless find some interesting insights about what factors explain mpg outside of transmission just using linear models.

Interpreting the Coefficients

We begin with a very simplistic linear model, where mpg is explained by the transmission variable “am”. This is a factor variable where is 1 for manual vehicles and 0 is for automatic variables.

summary(lm(mtcars$mpg ~ mtcars$am))$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## mtcars$am    7.244939   1.764422  4.106127 2.850207e-04

We see an intercept of 17.147 and a coefficient for the transmission variable of 7.245. This means that the mean mpg for automatic vehicles is 17.147 mpg, and the mean for manual vehicles is 24.392 (17.147 + 7.245). Note that the am variable is statistically significant (p < .001).

In the next section we consider other models which include predictors such as vehicle weight, horsepower, number of cylinders and others. As we run statistical tests on these models, we will interpret their coefficients individually.

Exploratory Data Analysis

We begin by conducting a pairs plot across the predictor variables. We omit the variables which have no intuitive connection to the mpg outcome variable. For example, the variable for the number of gears is not connected to mpg in any intuitive sense, nor does it show a clear uni-directional relationship with mpg, as a simple scatterplot shows. So we omit this variable.

We are left with variables for horsepower (hp), weight (wt), number of cylinders (cyl) and displacement (disp). Note that weight is measured per 1,000 lbs and displacement is in cubic inches. This will help us interpret our coefficients.

mtcars2 <- mtcars[, c(1, 2, 3, 4, 6, 9)]
pairs(mtcars2)

The first column of our pairs plot shows the relationship between mpg and our five predictor variables. Clearly there appears to be a negative correlation between mpg and each of cyl, disp, hp and wt. The relationship with transmission is also clear (higher mpg for manual vehicles - as discussed previously). See the Appendix for the results of simple linear regression model outputs, for each of hp, wt, cyl and disp tested against mpg. Clearly there are relationships between each of these variables and mpg.

Model Selection

We saw previously that the transmission variable (am) is statistically significant when included in a simple linear model against mpg. When we add the disp variable, however, the am variable is no longer statistically significant, while the disp variable is statistically significant. This is a key theme in regression models - that adding in new variables can change your coefficients drastically and also change variables’ statistical significance. We plot am versus disp (not shown) and see that manual vehicles have lower displacement than automatic ones. So we have evidence of a confounding variable, disp, which should make us question whether transmission type truly has a direct impact on mpg.

summary(lm(mtcars$mpg ~ mtcars$am + mtcars$disp))$coef
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 27.84808111 1.834071377 15.183750 2.452658e-15
## mtcars$am    1.83345825 1.436099585  1.276693 2.118396e-01
## mtcars$disp -0.03685086 0.005781896 -6.373490 5.747528e-07

Similarly we see that the am variable loses statistical significance when you add the wt variable, and that there is evidence of confounding. Manual cars have lower weight on average than automatics, which can be seen with a scatterplot.

summary(lm(mtcars$mpg ~ mtcars$am + mtcars$wt))$coef
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## mtcars$am   -0.02361522  1.5456453 -0.01527855 9.879146e-01
## mtcars$wt   -5.35281145  0.7882438 -6.79080719 1.867415e-07

The same thing is true for the cyl variable. When you account for the number of cylinders, the am variable once again loses its statistical significance. We can also see that manual vehicles have fewer cylinders on average using a simple scatterplot.

summary(lm(mtcars$mpg ~ mtcars$am + mtcars$cyl))$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 34.522443  2.6031842 13.261621 7.694408e-14
## mtcars$am    2.567035  1.2914280  1.987749 5.635445e-02
## mtcars$cyl  -2.500958  0.3608282 -6.931159 1.284560e-07

Interestingly, the transmission variable maintains its statistical significance when hp is included.

summary(lm(mtcars$mpg ~ mtcars$am + mtcars$hp))$coef
##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 26.5849137 1.425094292 18.654845 1.073954e-17
## mtcars$am    5.2770853 1.079540576  4.888270 3.460318e-05
## mtcars$hp   -0.0588878 0.007856745 -7.495191 2.920375e-08

However, when we add the weight variable to the model, transmission type once again loses its statistical significance.

summary(lm(mtcars$mpg ~ mtcars$am + mtcars$hp + mtcars$wt))$coef
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## mtcars$am    2.08371013 1.376420152  1.513862 1.412682e-01
## mtcars$hp   -0.03747873 0.009605422 -3.901830 5.464023e-04
## mtcars$wt   -2.87857541 0.904970538 -3.180850 3.574031e-03

If we include all five predictor variables to the model we see that only weight is statistically significant beyond a 95% confidence interval. Horsepower shows a p-value of .055, meaning it barely misses the cutoff for statistical significance.

In summary, weight appears to be core components of a model for mpg. Transmission does not. There is obvious evidence of counfounding variable which makes a conclusion that transmission type directly affects mpg. Instead, we see that manual cars are lighter and have lower horsepower, and both of these factors are connected with higher mpg. So, yes, manual transmission is better for mpg, but the relationship is likely driven by factors like horsepower and weight, rather than the automatic/manual factor variable.

Quantifying the Difference & Uncertainty

While there’s not strong evidence that transmission directly affects mpg, it certainly is true that manual cars show higher mpg on average. (Again, manual cars weigh less on average, and weight has a strong inverse relationship with mpg.) But we can still quantify the difference between automatic and manual vehicles, understanding that confounding variables are likely what is driving the difference.

When evaluating differences between two groups categorized into two different factor levels (e.g. automatic or manual), we can use a two group t test where we assume the two groups have common variances. We reject the null hypothesis that the two groups have the same means, with a p-value of .001 and a 95% confidence interval of [-11.28, -3.21] for the true difference in the means.

t.test(mtcars$mpg[mtcars$am==0], mtcars$mpg[mtcars$am==1],
       paired = FALSE)$conf
## [1] -11.280194  -3.209684
## attr(,"conf.level")
## [1] 0.95

In other words, based on this dataset, we are 95% confident that the true mpg difference between automatic and manual vehicles is between 11.28 and 3.21, with manuals being the higher than automatics. Our best estimate is that manuals are on average
approximately 7.245 mpg higher. Again, however, it is a mistake to assign causality between transmission and mpg. We have already had a lengthy discussion about confounding variables such as weight and horsepower. On average, manual vehicles are about 7.25 mpg higher, but that difference is probably explained by other variables than their transmission type. We see considerable uncertainty in our 7.245 estimate, with a 95% confidence interval that is about 8 mpg wide (3.21 to 11.28).
But zero is not contained in our confidence interval, so we can infer there is a true difference in means between the two groups.

Residuals & Diagnostics

If we use the simple linear model where mpg is explained by weight alone, we can plot some residuals and run diagnostics to investigate further the strength of our model. This may help us develop the model further in the future and is an important step in the modeling process.

fit <- lm(mtcars$mpg ~ mtcars$wt)
summary(fit)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## mtcars$wt   -5.344472   0.559101 -9.559044 1.293959e-10
e <- resid(fit)
plot(mtcars$wt, e, xlab = "Weight (1,000's of lbs",
     ylab = "residuals (mpg)")

We see that the middle-weight vehicles appear to have lower residuals, while the vehicles at the low-end and high-end of the weight spectrum have higher residuals. This warrants further investigation and model selection, including potentially adding a squared term for weight and including other variables.

Conclusion

Although manual vehicles show higher mpg, that is likely because manual vehicles are lighter and have lower horsepower on average as well. We estimate that the true difference in means between automatic and manual vehicles is 7.245, and we describe our uncertainty of this number using a confidence interval. We examine residuals and have a lengthy discussion of model selection. Our model accuracy could be improved with a larger sample size or more advanced methods, but we nevertheless find some interesting insights about what factors explain mpg outside of transmission just using linear models.

Appendix

Tables of statistical tests - mpg vs. predictor variables:

summary(lm(mtcars$mpg ~ mtcars$cyl))$coef   
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.88458  2.0738436 18.267808 8.369155e-18
## mtcars$cyl  -2.87579  0.3224089 -8.919699 6.112687e-10
summary(lm(mtcars$mpg ~ mtcars$disp))$coef  
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 29.59985476 1.229719515 24.070411 3.576586e-21
## mtcars$disp -0.04121512 0.004711833 -8.747152 9.380327e-10
summary(lm(mtcars$mpg ~ mtcars$wt))$coef  
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## mtcars$wt   -5.344472   0.559101 -9.559044 1.293959e-10