Motor Trend MPG Study

Question

Is an automatic or manual transmission better for MPG in a car from 1974?
Can we Quantify the MPG difference between an automatic and manual transmission?

Executive Summary

By analyzing the variables in the mtcars dataset, we determine that the variables weight of car, horsepower, and transmission type can be used to create an accurate model capable of predicting MPG of cars in 1974. Our findings indicate that while holding all other variables constant, a car with a manual transmission will achieve 2.08 MPG higher than a car with an automatic transmission. On average, cars with manual transmissions achieve 7.25 MPG better than cars with automatic transmissions when taking into account all other variables.

Explanation of the dataset

The “mtcars” dataset is a dataset extracted from the 1974 edition of Motor Trend magazine. It comprises of the fuel consumption and other statistics of 32 different cars from model years 1973-74.

The variables included in the dataset are miles per gallon, number of cylinders, engine displacement, horsepower, rear axle ratio, weight in half tons (1000 lbs), quarter mile time, engine archetecture (v shaped or straight), transmission type (automatic or manual), number of gears, and number of carberators.

Exploratory Data Analysis

Taking a quick look at how all the different variables effect MPG.

Model 1

For the first model, we will fit all the variables in the mtcars dataset into a multivariate model.

When fitting all available variables to model MPG, we achieve an R-squared of 98%. That means that 98% of the total variability is explained with the linear model. Although that may sound good, R-squared can be a misleading statistic since adding more variables will always increase R-squared, irregardless on whether they should actually be included in a predictive model.

When looking at all the coefficients’ p-values, we see that NONE of them are significant in this model. This is a problem; we cannot take this model seriously in this state. One reason this may be happening is because some variables may not actually be very relevant to predicting MPG. From our exploratory data analysis, there are some suspicions on whether gear count is a good indicator to predict MPG. Adding unneccessary variables to a model increases the actual standard errors of all other regressors, making them all insignificant. Another reason why all the variables are coming up as insignificant is because many of the variables may be related and highly correlated to each other. Thinking empirically, variables such as horsepower and number of cylinders may be highly correlated. A good way of determining whether or not other variables are correlated to each other is to use a correlation matrix.

Correlation Matrix

From the correlation matrix, we can see that there seems to be strong positive correlations between number of cylinders cyl, engine displacement disp, horsepower hp, and weight of car wt. To test whether or not we can omit some similar variables to create a simpler model, we can use an nested models and anova tests.

Nested Models

From the correlation matrix, we saw that car weight had the highest negative correlation to MPG, indicating that it may be a good predictor of MPG. This is further confirmed by the series of anova tests done. From the anova tests, we can see that when other variables are used as the primary predictor to MPG, adding the weight variable to the model is highly significant. And when weight is used as the primary predictor to MPG, adding other variables to the model is not as significant. Out of the four variables that we found were positively correlated to each other, it may be best to include weight in the final model since it is the variable that seems to add most to the model. In addition to weight, it may also be good to include horsepower since according to the nested models, after using weight as the primary predictor of MPG, the horsepower variable seems the most significant.

Model 2

After removing variables that were highly correlated to each other as well as the variable number of gears, we can finally see that at least one variable is showing significance; and it just so happens to be the single variable out of the correlated cluster that we decided to keep. This is good news since we know of at least one good predictor of MPG. Perhaps we can go further with this method. Going back to the correlation matrix, we see that the variables rear axle ratio (drat) and quarter-mile time (qsec) are highly negatively correlated to horsepower. It may be good to analyse whether keeping drat and qsec in the model is significant.

Nested Models Tests

When we keep weight (wt) and horsepower (hp) in the model, we see that adding quarter-mile time (qsec) and rear axle ratio (drat) variables adds nothing to the model. It will be best to omit them from the model.

Model 3

Remember that including more and more variables will inflate the standard errors. Now that we have removed a number of variables, we can now begin to trust the standard errors and p-values from the linear model’s coefficients table more. From the latest linear model, we can see that two variables in particular are not significant at all, Engine Shape (vs) and Number of Carborators (carb). For the next model, we will remove them.

Final Model (Model 4)

For the final model, we can see from the regression summary that all variables included in the model are significant. The final model uses the weight of the car wt, the car’s horsepower hp, and the transmission type (automatic or manual) am0 or am1 to predict the car’s mpg. Looking at the R-squared statistic, 98% of total variability is explained by the model. R-squared is very high and has not gone down since the first model. This is a very encouraging statistic.

The final model equation is:

\[MPG = -2.879*(Weight Of Car)-0.037*(Horsepower Of Car)+34.003*(1 If Transmission Is Automatic 0 If Not)+36.087*(1 If Transmission Is Manual 0 If Not)\]

Since we made transmission type into dummy variables (am0 for automatic, am1 for manual), the intercept was excluded from the model. This is so that we can more easily see the effects on mpg of having an automatic transmission vs a manual transmission.

Interpreting the Coefficients

For every increase in the car’s weight by 1000 lbs, the MPG of the car goes down by -2.8785
For every one increase in the car’s horsepower, the MPG of the car goes down by -.0347
If the car has an automatic transmission, the MPG of the car starts at 34.003 and goes down from there
If the car has a manual transmission, the MPG of the car starts at 36.09 and goes down from there

Residual Plot and Diagnostics

When plotting residuals against fitted or predicted values, we want to make sure there are no patterns occuring. Any systematic patterns between residuals and fitted values may indicate that there are variables omitted from the model that should have been included. From this residual plot, we can see that the residuals appear to be random.

When examining the hat values between residuals and fitted values, we do not see any large hat values (See Appendix B). This indicates that there are no overly large residuals that are influencing the residual plot.

When examining the cook’s distances (see Appendix B), we do not see any overly large cooks distances (over 1). This indicates that there are no extreme outliers that are exerting influence and leverage.

Conclusion

According to the model, cars with manual transmissions achieve 2.08 MPG more than with an automatic transmission holding all other variables constant
On average, cars with a manual transmission achieve 24.39 MPG; 7.25 MPG more than cars with an automatic transmission (17.15 MPG), taking into account all variables.
The accuracy of the model is confirmed by:
- A High R-squared value of 98%
- Low P-values across all coefficients indicating significance
- Residual plot that shows no systematic patterns; random distribution
- Residual sum of almost 0
- Diagnostic Measures such as hat values and cook’s distances are low
Despite the model’s sucess there is still a reasonable amount of variation that is not explained by the model (less than 2%); because of this, the model can not perfectly model MPG, though it can come close

Appendices

Appendix A: All Code Used:

data("mtcars")
library(ggplot2)
library(ggthemes)
library(corrplot)
library(formattable)
library(htmltools)
library(webshot)
library(gridExtra)
webshot::install_phantomjs()
mtcars2 <- mtcars
mtcars$am <- factor(mtcars$am)
mtcars$cyl <- factor(mtcars$cyl)

export_formattable <- function(f, file, width = "100%", height = NULL, 
                               background = "white", delay = 0.2)
{
  w <- as.htmlwidget(f, width = width, height = height)
  path <- html_print(w, background = background, viewer = NULL)
  url <- paste0("file:///", gsub("\\\\", "/", normalizePath(path)))
  webshot(url,
          file = file,
          selector = ".formattable_widget",
          delay = delay)
}

## table of means of mpg by transmission type
meanByTrans <- aggregate(mtcars$mpg, by = list(mtcars$am), FUN = mean)
colnames(meanByTrans)[1] <- "Transmission"
colnames(meanByTrans)[2] <- "Average MPG"
meanByTrans$Transmission <- sapply(meanByTrans$Transmission, FUN = function(x){if(x == 0){"automatic"} else if(x == 1){ "manual"}})

## bar graph of mpg by transmission type

transMeanPlot <- ggplot(meanByTrans, aes(x = Transmission, y = meanByTrans$`Average MPG`)) +
  geom_bar(stat = "identity") +
  geom_bar(stat = "identity", fill = "#a1d99b") +
  geom_text(aes(label = round(meanByTrans$`Average MPG`,2)), vjust = -0.5, size = 4) +
  labs(title = "Average MPG by Transmission", subtitle = "On average, cars with automatic transmissions have lower Miles per Gallon (MPG)", caption = "On average, cars with a manual transmission get 7.25 MPG more than cars with an automatic transmission") +
  ylab("Average MPG") +
  theme_economist() +
  theme(legend.position = "none")



## table of means of mpg by transmission type
  
meanByCyl <- aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = mean)
colnames(meanByCyl)[1] <- "Number of Cylinders"
colnames(meanByCyl)[2] <- "Average MPG"
meanByCyl$`Number of Cylinders` <- factor(meanByCyl$`Number of Cylinders`)

## bar graph of mpg by number of cylinders

cylMeanPlot <- ggplot(meanByCyl, aes(x = meanByCyl$`Number of Cylinders`, y = meanByCyl$`Average MPG`)) +
  geom_bar(stat = "identity") +
  geom_bar(stat = "identity", fill = "indianred1") +
  geom_text(aes(label = round(meanByCyl$`Average MPG`,2)), vjust = -0.4, size = 4) +
  labs(title = "Average MPG by Number of Cylinders in Engine", subtitle = "On average, cars with more cylinders in their engines have lower Miles per Gallon (MPG)") +
  ylab("Average MPG") +
  xlab("Number of Cylinders") +
  theme_economist() +
  theme(legend.position = "none")



hpPlot <- ggplot(data = mtcars) +
  aes(x = hp, y = mpg) +
  geom_point(color = "#1b9e77") +
  labs(title = "MPG Stats as Horsepower Changes",
    subtitle = "In general, more horsepower is assoiciated with fewer MPG. The correlation is strongly negative.",
    x = "Horsepower",
    y = "Miles Per Gallon (MPG)") +
    geom_smooth(method = 'lm', formula = y ~ x) +
  theme_economist()



wtPlot <- ggplot(data = mtcars) +
  aes(x = wt, y = mpg) +
  geom_point(color = "firebrick") +
  labs(title = "MPG Stats as Car's Weight Changes",
    subtitle = "In general, more weight is assoiciated with fewer MPG. The correlation is strongly negative.",
    x = "Weight (in half-Tons (1000 lbs))",
    y = "Miles Per Gallon (MPG)") +
    geom_smooth(method = 'lm', formula = y ~ x, colour = "darkslateblue") +
  theme_economist()



avgMPGbyGear<- aggregate(mtcars$mpg, by = list(mtcars$gear), FUN = mean)
avgMPGbyGear$Group.1 <- factor(avgMPGbyGear$Group.1)


gearPlot <- ggplot(data = avgMPGbyGear) +
  aes(x = Group.1, y = x) +
  geom_bar(stat = "identity", fill = "#fc9272") +
  geom_text(aes(label = round(avgMPGbyGear$x, 2)), hjust = -0.2, size = 4) +
  labs(title = "Average MPG For Each Number of Gears",
    x = "Number of Gears",
    y = "Average MPG",
    subtitle = "There seems to be little consistency between number of gears and average MPG") +
  theme_economist() +
  coord_flip()



avgHPbyCyl <- aggregate(mtcars$hp, by = list(mtcars$cyl), FUN = mean)


hpByCylPlot <- ggplot(data = avgHPbyCyl) +
  aes(x = Group.1, y = x) +
  geom_bar(stat = "identity", fill = "#74c476") +
  geom_text(aes(label = round(avgHPbyCyl$x,2), vjust = -0.3)) +
  labs(title = "Average Horsepower by Cylinders in Engine",
    x = "Number of Cylinders in Engine",
    y = "Average Horsepower",
    subtitle = "More cylinders in engine seems to indicate higher horsepower") +
  theme_economist()


grid.arrange(transMeanPlot,cylMeanPlot, nrow = 2 )
grid.arrange(hpPlot, wtPlot, nrow = 2)
grid.arrange(gearPlot, hpByCylPlot, nrow = 2)


##fitting all variables into a model
fit1 <- lm(mpg ~ . - 1, data = mtcars)

export_formattable(formattable(round(data.frame(summary(fit1)$coefficients),3), list(Pr...t.. = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./model1.png")   

mtCor <- cor(mtcars2)
corrplot(mtCor, method = "number", type = "upper")

fit2 <- lm(mpg~wt, data = mtcars)
fit3 <- update(fit2,mpg~wt + cyl)
fit4 <- update(fit2,mpg~wt + cyl + hp)
fit5 <- update(fit2,mpg~wt + cyl + hp + disp)

export_formattable(formattable(round(anova(fit2,fit3,fit4,fit5),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova1.png")

fit6 <- lm(mpg~hp, data = mtcars)
fit7 <- update(fit6,mpg~hp + cyl)
fit8 <- update(fit6,mpg~hp + cyl + wt)
fit9 <- update(fit6,mpg~hp + cyl + wt + disp)


export_formattable(formattable(round(anova(fit6,fit7,fit8,fit9),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova2.png")

fit10 <- lm(mpg~cyl, data = mtcars)
fit11 <- update(fit10,mpg~cyl + wt)
fit12 <- update(fit10,mpg~cyl + wt + hp)
fit13 <- update(fit10,mpg~cyl + wt + hp + disp)


export_formattable(formattable(round(anova(fit10,fit11,fit12,fit13),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova3.png")


fit14 <- lm(mpg~wt, data = mtcars)
fit15 <- update(fit14,mpg~wt + hp)
fit16 <- update(fit14,mpg~wt + hp + cyl)
fit17 <- update(fit14,mpg~wt + hp + cyl + disp)

export_formattable(formattable(round(anova(fit14,fit15,fit16,fit17),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova4.png")

model1 <- lm(mpg ~ wt + hp + drat + qsec + vs + am + carb - 1, data = mtcars)

export_formattable(formattable(round(data.frame(summary(model1)$coefficients),3), list(Pr...t.. = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./model2.png")      

##knitr::include_graphics("./model2.png")


fit18 <- lm(mpg~wt, data = mtcars)
fit19 <- update(fit18,mpg~wt + hp)
fit20 <- update(fit18,mpg~wt + hp + drat)
fit21 <- update(fit18,mpg~wt + hp + drat + qsec)

export_formattable(formattable(round(anova(fit18,fit19,fit20,fit21),3)),"./anova5.png")


export_formattable(formattable(round(anova(fit18,fit19,fit20,fit21),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova5.png")


fit22 <- lm(mpg~wt, data = mtcars)
fit23 <- update(fit22,mpg~wt + hp)
fit24 <- update(fit22,mpg~wt + hp + qsec)
fit25 <- update(fit22,mpg~wt + hp + qsec + drat)

export_formattable(formattable(round(anova(fit22,fit23,fit24,fit25),3),list('Pr(>F)' = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./anova6.png")

model3 <- lm(mpg~wt + hp + vs + am + carb - 1, data = mtcars)

export_formattable(formattable(round(data.frame(summary(model3)$coefficients),3), list(Pr...t.. = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./model3.png")      

model4 <- lm(mpg~wt + hp + am - 1, data = mtcars)

export_formattable(formattable(round(data.frame(summary(model4)$coefficients),3), list(Pr...t.. = formatter("span", style = function(x) ifelse(x < .05, "color:green", "color:red")))),"./model4.png")      

residTest <- data.frame(mtcars$mpg,mtcars$wt,mtcars$hp,mtcars$am)
colnames(residTest) <- c("mpg","wt","hp","am")
residTest$mpgPredict <- predict(model4,residTest)
residTest$residuals <- residTest$mpg - residTest$mpgPredict

plot(residTest$residuals ~ residTest$mpgPredict,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals Plotted Against Fitted Values of MPG",
     sub = "There appears to be no pattern between residuals and the fitted values",
     bg = "lightblue",
     col = "black",
     cex = 1.1,
     pch = 21,
     frame = FALSE
     
     )
abline(h=0, lwd = 2)
for (i in 1:length(residTest$mpg)) { 
  lines(c(residTest$mpgPredict[i],residTest$mpgPredict[i]),c(residTest$residuals[i],0),col = "red", lwd = 2)
}

Appendix B: Hat Values and DFbetas table:

hatvals <- round(hatvalues(lm(residTest$residuals~residTest$mpgPredict)),3)
dfb <- round(dffits(lm(residTest$residuals~residTest$mpgPredict)),3)
cooks <- round(cooks.distance(lm(residTest$residuals~residTest$mpgPredict)),3)


diagnosticTab <- data.frame(1:32, hatvals, dfb, cooks)


colnames(diagnosticTab) <- c("Point","Hat Values","dfBetas","cooks distance")
diaTab <- formattable(diagnosticTab)

export_formattable(diaTab,"./diaTab.png")

Appendix C: Miscellaneous Code and Stats:

betaStdevs <- sapply(1:1000, function(x){
MPGsim <- rnorm(n = 1000, mean = mean(mtcars$mpg), sd = sd(mtcars$mpg))
cylSim <- round(rnorm(n = 1000, mean = mean(mtcars2$cyl), sd = sd(mtcars2$cyl)),0)
dispSim <- rnorm(n = 1000, mean = mean(mtcars$disp), sd = sd(mtcars$disp))
wtSim <- rnorm(n = 1000, mean = mean(mtcars$wt), sd = sd(mtcars$wt))
HPsim <- rnorm(n = 1000, mean = mean(mtcars$hp), sd = sd(mtcars$hp))
simData <- data.frame(MPGsim, HPsim, wtSim,cylSim,dispSim)

c(coef(lm(simData$MPGsim ~ simData$HPsim))[2],coef(lm(simData$MPGsim ~ simData$HPsim + simData$wtSim))[2], coef(lm(simData$MPGsim ~ simData$HPsim + simData$wtSim + simData$cylSim))[2], + coef(lm(simData$MPGsim ~ simData$HPsim + simData$wtSim +simData$cylSim + simData$dispSim))[2])

}
)
asdf <- round(apply(betaStdevs,1,sd),5)

##aggregate(mtcars$mpg, by = list(mtcars$carb), FUN = mean)
## in general, more carborators is assoiciated with lower mpg
##aggregate(mtcars$cyl, by = list(mtcars$carb), FUN = mean)
##cor(mtcars$cyl, mtcars$carb)
## more carborators was highly correlated with more cylinders, it was ommitted from the final model
plot(mtcars$disp ~ mtcars$mpg)
## the plot reveals that higher engine displacement is associated with fewer mpg
##cor(mtcars$cyl, mtcars$disp)
## higher engine displacement is highly correlated with number of cylinders, it was omitted from the final model
meanByTrans$`Average MPG`[2] - meanByTrans$`Average MPG`[1]