Executive Summary

Our goal is to investigate whether the transmission type affected the miles per gallon efficiency of several cars in 1974 and then to quantify that difference. We see that modelling mpg based only on transmission type yields a positive and statistically significant result that a manual transmission has a higher mpg than an automatic one by about 7.2449393 miles per gallon. However, upon adding the car’s weight to the model we see that the transmission coefficient becomes negative and statistically insignificant (with a very high p value). ANOVA analysis says that adding the weight variable does a better job at explaining the analysis than using the transmission type alone. Furthermore we see that adding the carburetor variable to the model returns the transmission variable to having a positive effect, but the transmission type is still not statistically significant. ANOVA indicates that adding the carburetor to the transmission + weight model is a significant improvement over modeling mpg based on transmission and weight alone. Overall we conclude that the transmission type alone does not have a significant effect, and including other variables in the model can affect the way that the transmission interacts with the efficiency (as measured in mpg).

Main Questions

Data Processing

In order to make the data slightly easier to work with and report on I added a column to the data file. This column is named “Transmission”, and its values are “Automatic” and “Manual”. While this data is present in the “am” column in the original dataset it is the case that by doing this there is less need to mess with axis labels or legend text in the figures.

Exploratory analysis

A simple box plot display of the mpg on the y axis and the transmission type on the x axis suggests that a manual transmission may have a positive effect on mpg.

A graph using numbers for the transmission levels hows a positive slope for a linear regression. Unfortunately that sort of graph doesn’t work for interpretation because there’s not really any such thing as a transmission that is (for example) 30% manual and 70% automatic. In the real world you have transmissions that combine automatic and manual functionality in certain ways. That is still a discrete (factor) variable, though, rather than a continuous variable.

Least Squares Linear Regressions

Interpretability-wise a regression of mpg as a function of transmission level indicates that a manual transmission increases mpg by 7.2449393 over the automatic transmission.

A distribution of the predicted values for this regression shows only two values.

There are 19 values of 17.1473684210526 and 13 values of 24.3923076923077.
There is little interesting about the plot of those values. it simply shows two columns of points.

So for the simple linear regression of mpg as a function of transmission type we conclude the following:

  • For an automatic transmission the expected mpg would be about 17.1473684
  • For a manual transmission it would be 17.1473684 + 1 * 7.2449393 = 24.3923077.
  • Both coefficients are statistically significant
  • The F statistic of 16.8602788 and its p value of .000285 say that it’s a good model
  • The \(R^2\) value of 0.3597989 and the \(Adjusted.R^2\) value of 0.3384589 indicate that the linear model doesn’t explain much of the variance in the data.
  • So there must be other variables that help to explain the results. It’s likely that the transmission type variable is picking up variation due to some other predictor like one of displacement, weight, Engine (Type), Number of Forward Gears, or Number of Carburetors

Perhaps the vehicle’s weight explains the data better?

If we include the weight in the model then the \(R^2\) value of 0.7528348 is much higher than it was for the transmission-only model, and that says that the new model covers a lot more of the variation than the old model did. The intercept value has increased from 17.1473684 to 37.3215513, and it is confident to within approximately 3.0546385. Also, interestingly, the coefficient on the transmission type has gone from 7.2449393 to -0.0236152. Not only has it decreased, but its sign has flipped. Now, instead of having a significant positive effect, it is the case that switching from an automatic transmission to a manual one actually decreases the mpg slightly.Unfortunately, its p value is 0.9879146. In this model the transmission has no significant effect.

A graph of the residuals as a funciton of the predicted values shows no significant correlation

We can use ANOVA to compare the two models.

The low p value of \(1.867415\times 10^{-7}\) indicates that adding weight as a predictor makes a significant improvement to the model.

Perhaps the number of carburetors is also significant. Being able to burn more fuel at one time would increase performance, and that usually decreases fuel efficiency.

The p values indicate that the intercept, weight, and number of carburetors are all significant predictive variables. We do not take transmission type to be significant in this model (beyond the intercept term) because its p value of 0.1364895 is larger than our threshold of .05. However, it is the case that the coefficient on the transmission type variable is back to being positive again. The new \(R^2\) value is higher even than in the previous model. It’s only gone up by 0.0556801, though, and we know that simply adding regressors cannot reduce the R^2.

A plot of the residuals vs. the fitted values again shows little correlation.

From another ANOVA comparison we conclude that adding the carburetor number as a variable improves the model significantly over the ‘transmission and weight’ model. Residual plots are in the appendix. They show a lack of heteroskedacity and a roughly normal distribution of the residuals. The points all seem to be within reasonable Cook’s distance, so their leverage is likely not adversely affecting the model.

We could continue to add variables and look at the ANOVA comparisons but the conclusion here is fairly clear.

Appendix

library(dplyr)
mtcars1 <- mtcars %>% mutate(Transmission = as.factor(ifelse(am == 0, "Automatic", "Manual")))
library(ggplot2)
ggplot(mtcars1, aes(x = Transmission, y = mpg, fill = Transmission)) + geom_boxplot()

ggplot(mtcars, aes(x = am, y = mpg, fill = am)) + geom_point(aes(colour = am)) + geom_smooth(method = lm)

fit1 <- lm(mpg~Transmission, data = mtcars1)
summary(fit1)$coefficients
##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)        17.147368   1.124603 15.247492 1.133983e-15
## TransmissionManual  7.244939   1.764422  4.106127 2.850207e-04
pred1 <- predict(fit1)
table(pred1)
## pred1
## 17.1473684210526 24.3923076923077 
##               19               13
fit2 <- lm(mpg~Transmission+wt, data = mtcars1)
summary(fit2)$coefficients
##                       Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)        37.32155131  3.0546385 12.21799285 5.843477e-13
## TransmissionManual -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt                 -5.35281145  0.7882438 -6.79080719 1.867415e-07
anova(fit1,fit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ Transmission
## Model 2: mpg ~ Transmission + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 46.115 1.867e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit3 <- lm(mpg~Transmission+wt+carb, data = mtcars1)
summary(fit3)$coefficients
##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)        34.016296  2.9713276 11.448181 4.485516e-12
## TransmissionManual  2.526258  1.6478794  1.533036 1.364895e-01
## wt                 -3.633994  0.9281206 -3.915433 5.269167e-04
## carb               -1.159288  0.4062840 -2.853393 8.046207e-03
anova(fit1,fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ Transmission
## Model 2: mpg ~ Transmission + wt
## Model 3: mpg ~ Transmission + wt + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 57.4719 2.941e-08 ***
## 3     28 215.62  1     62.70  8.1418  0.008046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2,2))
plot(fit3)